The content of the invention
In order to solve the above technical problems, the technical solution adopted by the present invention is as follows:
Economizing type data de-duplication method in a kind of cloud storage system, the cloud storage system is by carry out file operation
Client, the meta data server for storing file system metadata information, the image file of backed up in synchronization metadata and operation day
Two level meta data server, the memory node of storage data block of will collectively form, and this method comprises the following steps:
Step 1:Each client pre-processes local file to be uploaded, carries out the office of file-level and block level
Then the metadata information of file to be uploaded is uploaded to member by portion's data de-duplication operations to prevent the upload again of repeated data
Data server;
Step 2:Meta data server receives the metadata information from different clients, be successively read file fingerprint,
Data block fingerprint, then compares memory, the fingerprint index information in hard disk and write buffer area, finally believes the fingerprint value being transmitted through on not
Breath returns to each client.
Step 3:The new data being transmitted through on not is uploaded to storage end by client, and storage end stores new data, and
Update storage the metadata information table at end.
Step 4:Client sends the request of data to be changed, and data place to be modified is obtained by meta data server
Memory node number, then connect memory node and operation of directly modifying to the data of storage end.
Step 5:Storage end is detected amended data block, when amended data block is by comparing fingerprint value
It was found that on this node, directly it is deleted again;When amended data block is not on this node, then this is first saved in
On node, then found on other nodes by the comparison of meta data server, which is deleted again using delay;Work as modification
Data block afterwards is found neither on this node, and do not exist by comparing the fingerprint index on this node and meta data server
On other nodes, except the data block is saved on this node, meta data server also needs to create a Copy for the data block.
The cloud storage system is characterized in that:Also contain filtering module and update module on meta data server,
Filtering module is used for the repeated data information for filtering different clients, and update module is used to update storage end global data metadata
Information, i.e., directly update the metadata information of repeated data block, wait receive just update after memory node feedack it is non-heavy
The metadata information of complex data block.
The client has file pretreatment module, part to delete module, metadata management module and data transmission module again,
Wherein file pretreatment module carries out document classification according to the type of file, then gives local module of deleting again and carries out file-level weight
Delete, the non-duplicate file after file-level is deleted again is returned to file pretreatment module and filtered again, filters out less than 64MB
Non-duplicate file, finally again by it is local delete module again and carry out block level delete again.Metadata management module is used to record client
End has uploaded the fingerprint value information of data block, to avoid the upload of local repeated data;Data transmission module is then that client connects
The interface of meta data server and memory node is connect, that is, is responsible for the metadata information of file to be uploaded uploading to Metadata Service
Device, non-duplicate data block is uploaded on memory node.
The memory node includes memory module, metadata management module, Self-Check Report module and delay and deletes module again, its
Middle memory module is responsible for the storage of data block, distributes the physical address of data block;On metadata management module minute book node
The metadata information of data block;Self-Check Report module be detect data block modification caused by repeated data, give delay weight
Delete module carry out hot spot repeated data block judgement with it is corresponding processing and the metadata information of modification is fed back into Self-Check Report
Module, is then reported to meta data server.
File-level data de-duplication in the step 1:Using MD5 algorithm calculation document fingerprint values, size and class are compared
The equal file fingerprint value of type, is then compared with local metadata information table, determines duplicate file and non-duplicate text again
Part;
Block level data de-duplication described in the step 1 is as follows:It is non-heavy less than 64MB for having filtered out
Multiple file, piecemeal is carried out using fixed length block algorithm, and block length is set to 64MB, and the fingerprint value of data block is calculated using MD5 algorithms, than
The data block equal to block length determines repeated data block.
When file fingerprint is compared in the step 2, if finding, fingerprint value is existing, no longer the fingerprint of comparison data block,
Otherwise the data block fingerprint of configuration file is also compared.
The mapping relations of the in store data block fingerprint and its storage address thereon of each storage end in the step 3,
Pass through data block fingerprint, you can determine the physical address of data block storage.
Modification of the multiple users of client to data block can introduce new repeated data block in the step 4, and existing
Storage system puts aside these data blocks repeated.User backs up again after local is to data modification in standby system,
The part not made an amendment is filtered out during backup;And cloud storage is experienced as in local, user to the high in the clouds that user brings
The address for the data for wanting modification is got, is directly modified to data.This is exactly cloud storage and the difference of standby system.
Postpone to delete again comprising to behaviour of both hot spot repeated data block and non-hot repeated data block in the step 5
Make, determination methods use equation below:
In formula, a certain data block is changed in node i, and determines that the data block does not repeat in node i, in node j
On have repeated data block;Represent in tp+1-tpSome interior data block of period being averaged except node i at memory node end
Access times;α is a threshold value, represents to become access times minimum in the hot spot data block unit interval;Aj(tp) and Aj
(tp+1) t is represented respectivelypAnd tp+1The access times of a certain data block on moment node j;Z is the numbering of node where data block B
Set.
Then postpone to delete again for hot spot repeated data block to reduce the access response time of system;For non-hot repeat number
According to block, then the deletion where selecting non-hot repeated data block on the relatively small number of node of memory node residual capacity is negative to realize
Carry balanced.
Beneficial effect
1. existing data de-duplication is mainly directed towards the relatively static backup of data and filing system, and does not apply to
Cloud storage system, and data are shared by multi-user in cloud storage system, multi-user, which changes data, causes the dynamic of data to increase
By force.The present invention is directed to the dynamic of data in cloud storage system, considers data characteristic in itself, split data into hot spot data and
Non-thermal point data, deletes opportunity, to ensure the performance of system more preferably again for different data using different.
2. the present invention, with reference to replica management mechanism, is ensureing compared to existing data de-duplication strategy in cloud storage
On the premise of availability of data, using delayed deletion repeat hot spot data block (being temporarily regarded as copy), within a certain period of time
Access pressure of the user to hot spot data block is alleviated, therefore can be more preferable for the reducing effect of system response time.
3. the non-hot data block repeated is also considered as a copy by the present invention, the storage of node where comparing all copies
The copy loaded on larger node is deleted, to realize that storage load is more balanced.
Embodiment
In order to facilitate description, The present invention gives the Organization Chart of cloud storage data deduplication system, as shown in Figure 1.
The system is by m client (Client), 1 meta data server (Metadata Server, MS), 1 two level metadata clothes
Business device (Secondary Metadata Server, SMS) and n memory node (Storage Node, Snode) collectively form.
Wherein, client mainly initiates the object of the operations such as file upload, access, modification, deletion;Meta data server is mainly stored
All metadata informations of file system, there is provided access control and the global foundation deleted again, it is equivalent to whole system framework
Maincenter.Two level meta data server mainly undertakes the work of the image file and operation log of backed up in synchronization metadata;Storage section
Point is then responsible for the actual data block of storage.In addition, there is close contact in system between each composition part, cooperate.
Interacting for metadata information is only carried out between client and meta data server, to mitigate the load of the transmission bandwidth of metadata.When
When client will upload data, by meta data server to determine non-repetitive data message;When client will access (including
Modification) data when, by meta data server with determine data place nodal information.Can be into line number between client and memory node
According to transmission.Memory node can also be interacted with meta data server, such as the metadata for the data changed on memory node
Information will also be interacted with meta data server, to determine whether for repeated data.Meanwhile meta data server also can be according to storage
The situation of data access creates certain copy for it and accesses load to reduce on node.For there was only a meta data server
Framework, once it breaks down, whole system will paralyse, therefore between meta data server and two level meta data server
For active and standby relation.
Client mainly has file pretreatment module, part to delete module, metadata management module and data transmission module again,
Wherein file pretreatment module carries out document classification according to the type of file, and the later stage, which filtered out block level is deleted again when, to be less than
The non-duplicate file of 64MB;Part deletes module and is deleted operation again from two angles of file-level and block level again;Metadata pipe
Reason module essential record client has uploaded the fingerprint value information of data block, to avoid the upload of local repeated data;Data pass
Defeated module is responsible for the metadata information of file to be uploaded uploading to meta data server, and non-duplicate data block is uploaded to storage
On node.Have certain contact between each module, the file after the processing of file pretreatment module give it is local delete again module into
Row file-level is deleted again, and the non-duplicate file after file-level is deleted again is returned to file pretreatment module and filtered again, most
Deleting again for block level is carried out by local module of deleting again again afterwards.Part involved in whole process to metadata information will be with member
Data management module interacts, and data transmission module is then the interface of client connection meta data server and memory node.
There are filtering module and update module on meta data server, wherein filtering module passes through the rope on meta data server
The metadata information drawn in table (being distributed on memory and disk) and write buffer area filters out the repeat number from different clients
It is believed that breath.For the data block repeated, directly pass through the metadata information of update module renewal corresponding data block;For non-duplicate
Data block, update module is then just by the renewal of its metadata information on disk after memory node feedack is received
In concordance list.When the data of memory node are changed, it can also be interacted with meta data server, so as to trigger renewal mould
Renewal of the block to concordance list on meta data server.
Memory node mainly includes memory module, metadata management module, Self-Check Report module and delay and deletes module again, its
Middle memory module is mainly responsible for the storage of data block, records the physical address of data block;Metadata management module minute book node
On data block metadata information;Self-Check Report module is mainly to detect repeated data caused by the modification of data block, is handed over
Module is deleted again to delay, and the metadata information of modification is reported to meta data server;Delay deletes module for detecting again
Repeated data block, then judge whether repeated data block is hot spot repeated data block, for hot spot repeated data block delay delete again,
Then select the identical block on suitable node to delete for non-hot repeated data block, believe involved in this module to metadata
The part of breath needs to interact with metadata management module and Self-Check Report module.
The present invention carries out data de-duplication according to following steps:
Step 1:Each client pre-processes local file to be uploaded, carries out the office of file-level and block level
Then portion's data de-duplication operations (including are treated the metadata information of file to be uploaded to prevent the upload again of repeated data
The fingerprint value of the fingerprint value of upper transmitting file and its all data blocks) upload to meta data server.Upload the finger of repeated data block
Line value is to quote number to update the data block in meta data server.Wherein, local data de-duplication operations is specific
It is described as follows:
1. file-level data de-duplication:Using MD5 algorithm calculation document fingerprint values, size and the equal text of type are compared
Part fingerprint value, is then compared with local metadata information table, determines duplicate file and non-duplicate file again;
2. block level data de-duplication:For non-duplicate file (having filtered out the file less than 64MB), using calmly
Long block algorithm carries out piecemeal, and block length is set to 64MB, and the fingerprint value of data block is calculated using MD5 algorithms, and it is equal to compare block length
Data block determines repeated data block.
Step 2:Meta data server receives the metadata information from different clients, be successively read file fingerprint,
Data block fingerprint, then compares memory, the fingerprint index information in hard disk and write buffer area, finally believes the fingerprint value being transmitted through on not
Breath returns to each client.
When comparing file fingerprint, if finding, fingerprint value is existing, no longer the fingerprint of comparison data block, otherwise also to compare
The data block fingerprint of configuration file.Fingerprint index table is distributed in memory and hard disk, and the space for being primarily due to memory extremely has
Limit, therefore most of fingerprint index table storage is in a hard disk.In addition, also have partial data block fingerprint value information in write buffer area, this
It is because storage end does not complete the storage work of new data block sended over to client also, and the fingerprint value of new data is not yet
It can write in hard disk.
During fingerprint value compares, the present invention utilizes " type by the time for sacrificing document classification, size sorts
The file identical with size is very likely similar documents " and " identical block that files in different types is shared can almost neglect
Slightly " carry out continuous drawdown ratio to scope.
Step 3:The new data being transmitted through on not is uploaded to storage end by client, and storage end stores new data, and
Update storage the metadata information table at end.
For the data repeated, client have updated its letter on meta data server by step 1 and step 2
Breath, and it is uploaded directly into storage end for non-repetitive data, client.And the in store number thereon of each storage end
According to block fingerprint and its mapping relations of storage address.Pass through data block fingerprint, you can determine the physical address of data block storage.
Step 4:Client sends the request of data to be changed, and data place to be modified is obtained by meta data server
Memory node number, then connect memory node and operation of directly modifying to the data of storage end.
Modification of the client to data is different because of user, that is, the mode for the user's modification for enjoying same data block is different, and
Different data are also possible to that identical data can be modified to, this is the dynamic of cloud storage data, and cloud storage with
The difference of standby system.Standby system is that user backs up again after local is to data modification, mistake during backup
The part not made an amendment is filtered, and cloud storage is experienced as in local, user gets desired modification to the high in the clouds that user brings
Data address, directly modify to data.
Step 5:Storage end is detected amended data block, and judges that amended data block belongs in table 1
Which kind of situation simultaneously takes appropriate measures, and specific Method And Principle is as shown in Figure 2.
The amended three kinds of situations of 1 data block of table and corresponding operating
Need to recalculate its fingerprint value for amended data block, and compare the progress of the metadata information on this node
Judge, if finding, the data block on this node, directly deletes it again;If it was found that amended data block does not exist
On this node, then first it is saved on this node, then compares meta data server and find on other nodes, then carries out delay weight
Delete;If it was found that after fingerprint index of the amended data block on this node and meta data server is compared, neither in this node
On, and not on other nodes, then meta data server also needs to create a Copy for the data block.Delay is deleted comprising to hot spot again
Operation of both repeated data block and non-hot repeated data block, determination methods use formula (1), for hot spot repeated data
Block then postpones to delete again to reduce the access response time of system;For non-hot repeated data block, then non-hot repeat number is selected
According to the deletion on the relatively small number of node of memory node residual capacity where block to realize load balancing.
In order to make it easy to understand, some concepts of complementary definition:
Hot spot data block:Average access frequency reaches the data block of certain threshold value in a period of time, that is, meets formula (1).
The data block that condition is not satisfied, is known as non-hot data block.
Hot spot repeated data block:Amended data block A ' do not have found on this node, but find on other nodes with
Identical data block A, and data block A is hot spot data block, then A ' is referred to as hot spot repeated data block.
Non-hot repeated data block:Amended data block B ' does not have found on this node, but but is sent out on other nodes
Existing same data block B, and data block B is non-hot data block, then and B ' is referred to as non-hot repeated data block.
Present invention is alternatively directed to the step 5 combination attached drawing 3 give user change memory node i (i=1,2,3 ...,
N) data block on, the specific implementation step that storage end is handled are as follows:
1. Request request modifications:Node i is connected to after modification request of the client to a certain data block (being denoted as A),
Read block A is replicated into memory;
2. Modify is modified:Node i in memory modifies data block A (amended data block is denoted as B)
Then the reference number of A does the operation that subtracts 1, and the fingerprint value of B is calculated using MD5 algorithms;
3. Check repeats to detect:Whether node i has quickly existed in the fingerprint value for locally searching B, to avoid repeat number
According to storage.If jumping to step 5. without if, otherwise remember that data block identical with data block B in node i is B ', and carry out next
Step;
4. Deduplicate deduplications:Data block B is deleted, and uses the pointer replacement data block B for being directed toward data block B '
Storage;
5. Store is stored:Amended new data block B is stored in node i, and updates the metadata of node i local
Information table;
6. Check repeats to detect:The metadata information of renewal is periodically sent on meta data server by node i, by member
Data server judges whether there is identical block on other node j (j ≠ i).Step is jumped to if finding 8., it is otherwise next
Step;
7. Replica creates a Copy:Created a Copy by meta data server for new data block B;
8. classification is handled:Meta data server judges whether repeated data block B is hot spot repeated data block, such as formula (1),
If so, then jump to step 10., otherwise in next step;
In formula, tp+1A certain data block is changed in moment node i, and determines that the data block does not repeat in node i,
There is repeated data block on node j;Represent in tp+1-tpSome interior data block of period is at memory node end (except section
The Average visits of point i);α is a threshold value, represents to become access times minimum in the hot spot data block unit interval;Aj
(tp) and Aj(tp+1) t is represented respectivelypAnd tp+1The access times of a certain data block on moment node j;Z nodes where data block B
Numbering set.
9. greed is deleted:tp+1Moment, the residual capacity S of node k (k ∈ Z) where more non-hot repeated data block Bk
(tp+1) andSize, all the time select the relatively small number of node of residual capacity on data block B delete.Update Metadata Service
Device.Wherein tp+1Moment storage end average residual capacityAsk for as shown in formula (2),
In formula, Sm(tp+1) it is tp+1The memory space residual capacity of moment node m, n are that the section of storage end is always counted.
10. delayed deletion:tp+1Moment does not delete hot spot data block B, and the metadata of synchronizing data blocks B is on node j, etc.
To subsequent time tp+2Continue step 8..