CN104408154A

CN104408154A - Repeated data deletion method and device

Info

Publication number: CN104408154A
Application number: CN201410729944.2A
Authority: CN
Inventors: 余健; 钟延辉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2015-03-11
Anticipated expiration: 2034-12-04
Also published as: CN104408154B

Abstract

An embodiment of the invention discloses a repeated data deletion method and device. The repeated data deletion method includes dividing received initial data into blocks; obtaining the fingerprint of each data block; generating a first check value of the initial data according to the fingerprint of each data block; searching and deleting the repeated data block in the initial data; storing the unique block after repeated data block deletion; generating a second check value after repeated data deletion by use of the generation mode as same as that of the first check value according to the fingerprint of the unique block; comparing the first check value and the second check value, determining that the initial data have been subject to correct repeated data deletion when the first check value is identical to the second check value. The repeated data deletion method and device can ensure data accuracy during repeated data deletion.

Description

Data de-duplication method and device

Technical field

The present invention relates to data processing field, particularly relate to a kind of data de-duplication method and device.

Background technology

Data de-duplication technology is a kind of data reducti techniques being applied to storage system, is especially common in the standby system based on disk.The working method of this technology the data in standby system is carried out isometric or being elongatedly divided into some data blocks, and the data block of repetition only retains a copy of it, and other replaces with designator, thus eliminate redundant data.

Prior art, in order to store the correctness of data after ensureing data de-duplication, can store at bottom and adopting RAID6 prevent double plate to lose efficacy; Or, by remote copy, current data is copied portion again; In addition, by backstage patrol and examine to make regular check on after data de-duplication store the correctness of data.

But, in data de-duplication process, if because Problem-Error appears in program, or program is subject to external attack, mistake in data de-duplication process may be caused to refer to piecemeal, thus cause the extension of error in data, and the accuracy of data de-duplication can only be confirmed by follow-up operation of patrolling and examining.

Visible, even if patrolled and examined the accuracy that also cannot ensure data processing in data de-duplication process in above-mentioned prior art by backstage.

Summary of the invention

Provide a kind of data de-duplication method and device in the embodiment of the present invention, can give security for the accuracy of data in repeating data delete procedure.

In order to solve the problems of the technologies described above, the embodiment of the invention discloses following technical scheme:

On the one hand, a kind of data de-duplication method is provided, comprises:

The primary data received is carried out piecemeal;

Obtain the fingerprint of each deblocking;

According to the fingerprint of all described deblockings, generate the first proof test value of described primary data;

Search the repeating data piecemeal in described primary data, delete described repeating data piecemeal;

Store the unique block after deleting duplicated data piecemeal;

According to the fingerprint of all described unique blocks, adopt the generating mode identical with described first proof test value, generate current the second proof test value completing data after data de-duplication;

More described first proof test value and described second proof test value, when described first proof test value is consistent with described second proof test value, determine that described primary data is through correct data de-duplication.

In conjunction with one side face, in the implementation that the first is possible, described the primary data received is carried out piecemeal, comprising:

The described primary data in the data processing unit of capacity that waits received is carried out fixed length or elongated piecemeal, and wherein, described primary data, before carrying out piecemeal, is first divided into multiple described data processing unit waiting capacity.

In conjunction with one side face, and the first possible implementation, in the implementation that the second is possible, the described fingerprint according to all described deblockings, generates the first proof test value of described primary data, comprising:

According to the fingerprint of all deblockings in data processing unit described in each, the first sub-proof test value of the corresponding data processing unit described in each of grey iterative generation;

By all described first sub-proof test values according to identical iterative manner, generate the first proof test value of described primary data.

In conjunction with one side face, and the implementation that the second is possible, in the implementation that the third is possible, the unique block after described storage deleting duplicated data piecemeal, comprising:

Unique block after deleting duplicated data piecemeal corresponding for data processing unit described in each is stored in corresponding container, and stores the fingerprint of wherein each unique block in the above-described container.

In conjunction with one side face, and the implementation that the second is possible, in the 4th kind of possible implementation, the unique block after described storage deleting duplicated data piecemeal, comprising:

Unique block after deleting duplicated data piecemeal corresponding for data processing unit described in each is carried out compression process;

The unique block completing compression is stored in the container of corresponding data processing unit described in each, and stores the fingerprint of the unique block wherein completing compression described in each in the above-described container.

In conjunction with one side face, with the third or the 4th kind of possible implementation, in the 5th kind of possible implementation, the described fingerprint according to all described unique blocks, adopt the generating mode identical with described first proof test value, generate current the second proof test value completing data after data de-duplication, comprising:

According to the fingerprint of all unique block in container described in each, the second sub-proof test value of the corresponding container described in each of grey iterative generation;

By all described second sub-proof test values according to identical iterative manner, generate described current the second proof test value completing data after data de-duplication.

In conjunction with one side face, and the 5th kind of possible implementation, in the 6th kind of possible implementation, also comprise:

When carrying out container levels and patrolling and examining, according to the fingerprint of all unique block in container described in each when patrolling and examining, the 3rd sub-proof test value of the corresponding current container described in each of grey iterative generation;

The described second sub-proof test value that relatively container described in each is corresponding and described 3rd sub-proof test value, when described second sub-proof test value and described 3rd sub-proof test value inconsistent time, determine data corruption in corresponding described container.

In conjunction with one side face, and the 5th kind or the 6th kind of possible implementation, in the 7th kind of possible implementation, also comprise:

When carrying out space reclamation inspection, according to the fingerprint of all unique block during inspection described in each in container, the 4th sub-proof test value of the corresponding current container described in each of grey iterative generation;

Described sub two proof test values that relatively container described in each is corresponding and described 4th sub-proof test value, when described second sub-proof test value and described 4th sub-proof test value inconsistent time, determine that the deblocking in corresponding described vessel space is reclaimed by mistake.

In conjunction with one side face, the implementation possible with the second, in the 8th kind of possible implementation, the described fingerprint according to all deblockings in data processing unit described in each, first sub-proof test value of the corresponding data processing unit described in each of grey iterative generation, comprising:

To suppose in described data cell, according to primary data assembling sequence, to comprise N (N >=2) individual deblocking, according to the first fingerprint of the first deblocking and the second fingerprint of the second piecemeal, the sub-School Affairs of grey iterative generation first;

According to the 3rd fingerprint and the described first sub-School Affairs of the 3rd deblocking, the sub-School Affairs of grey iterative generation second;

By that analogy, until according to the N fingerprint of N deblocking and (N-1) sub-School Affairs, the first sub-proof test value described in grey iterative generation.

In conjunction with one side face, with the 8th kind of possible implementation, in the 9th kind of possible implementation, the described fingerprint according to all unique block in container described in each, the second sub-proof test value of the corresponding container described in each of grey iterative generation, comprising:

Suppose in described container according to primary data piecemeal order, comprise M (N > M >=2) individual deblocking, according to the first fingerprint of the first deblocking and the second fingerprint of the second piecemeal, the sub-School Affairs of grey iterative generation first;

By that analogy, until according to the M fingerprint of M deblocking and (M-1) sub-School Affairs, the second sub-proof test value described in grey iterative generation.

On the other hand, a kind of data de-duplication device is provided, comprises:

Deblocking module, carries out piecemeal for the primary data that will receive;

Fingerprint acquisition module, for obtaining the fingerprint of each deblocking;

First proof test value generation module, for the fingerprint according to all described deblockings, generates the first proof test value of described primary data;

Heavily deleting module, for searching the repeating data piecemeal in described primary data, deleting described repeating data piecemeal;

Unique block memory module, for storing the unique block after deleting duplicated data piecemeal;

Second proof test value generation module, for the fingerprint according to all described unique blocks, adopts the generating mode identical with described first proof test value, generates current the second proof test value completing data after data de-duplication;

Proof test value comparison module, for more described first proof test value and described second proof test value, when described first proof test value is consistent with described second proof test value, determines that described primary data is through correct data de-duplication.

In conjunction with above-mentioned another aspect, in the implementation that the first is possible, described deblocking module, specifically the described primary data in the data processing unit of capacity that waits received is carried out fixed length or elongated piecemeal, wherein, described primary data, before carrying out piecemeal, is first divided into multiple described data processing unit waiting capacity.

In conjunction with one side face, and the first possible implementation, in the implementation that the second is possible, described first proof test value generation module, comprising:

First sub-proof test value generation unit, for the fingerprint according to all deblockings in data processing unit described in each, the first sub-proof test value of the corresponding data processing unit described in each of grey iterative generation;

First proof test value determining unit, for by all described first sub-proof test values according to identical iterative manner, generate the first proof test value of described primary data.

In conjunction with one side face, and the implementation that the second is possible, in the implementation that the third is possible, described unique block memory module, comprising:

First capacitor memory cell, for being stored in container by the unique block after deleting duplicated data piecemeal corresponding for data processing unit described in each;

First fingerprint storage unit, for storing the fingerprint of wherein each unique block in the above-described container.

In conjunction with one side face, and the implementation that the second is possible, in the 4th kind of possible implementation, described unique block memory module, comprising:

Compression unit, for carrying out compression process by the unique block after deleting duplicated data piecemeal corresponding for data processing unit described in each;

Second container storage unit, for being stored in the container of corresponding data processing unit described in each by the unique block completing compression;

Second fingerprint storage unit, for storing the fingerprint of the unique block wherein completing compression described in each in the above-described container.

In conjunction with one side face, and the third or the 4th kind of possible implementation, in the 5th kind of possible implementation, described second proof test value generation module, comprising:

Second sub-proof test value generation unit, for the fingerprint according to all unique block in container described in each, the second sub-proof test value of the corresponding container described in each of grey iterative generation;

Second proof test value determining unit, for by all described second sub-proof test values according to identical iterative manner, generate described current the second proof test value completing data after data de-duplication.

Polling module, for when carrying out container levels and patrolling and examining, according to the fingerprint of all unique block in container described in each when patrolling and examining, the 3rd sub-proof test value of the corresponding current container described in each of grey iterative generation; The described second sub-proof test value that relatively container described in each is corresponding and described 3rd sub-proof test value, when described second sub-proof test value and described 3rd sub-proof test value inconsistent time, determine data corruption in corresponding described container.

Space reclamation inspection module, for when carrying out space reclamation inspection, according to the fingerprint of all unique block during inspection described in each in container, the 4th sub-proof test value of the corresponding current container described in each of grey iterative generation; The described second sub-proof test value that relatively container described in each is corresponding and described 4th sub-proof test value, when described second sub-proof test value and described 4th sub-proof test value inconsistent time, determine that the deblocking in corresponding described vessel space is reclaimed by mistake.

In the embodiment of the present invention, in data de-duplication process, when completing piecemeal to primary data, namely according to the fingerprint of all described deblockings, generate the first proof test value of described primary data; After deleting duplicated data piecemeal, according to the fingerprint of current unique block, the second proof test value that data file is current can be generated, by the second proof test value and the first proof test value being compared, if former and later two proof test values are consistent, then represent that the data de-duplication process of primary data is reliable.By comparing the proof test value of data file before and after deleting duplicated data, greatly can improve the efficiency of data de-duplication flow process reliability demonstration, in repeating data delete procedure, the accuracy of data is given security.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, for those of ordinary skills, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of a kind of data de-duplication method embodiment provided by the invention;

Fig. 2 is the realization flow figure of step 103 in Fig. 1;

Fig. 3 is the schematic diagram generating the first proof test value according to deblocking fingerprint provided by the invention;

Fig. 4 is the generating mode schematic diagram of the second proof test value in the embodiment of the present invention;

Fig. 5 is a kind of schematic diagram storing unique block in a reservoir in the embodiment of the present invention;

Fig. 6 is the schematic diagram of grey iterative generation proof test value in the embodiment of the present invention;

Fig. 7 is the structural representation of a kind of data de-duplication device embodiment provided by the invention;

Fig. 8 is the structural representation of the first proof test value generation module in Fig. 7 of the present invention;

Fig. 9 is the structural representation of unique block memory module in Fig. 7 of the present invention;

Figure 10 is the another kind of structural representation of unique block memory module in Fig. 7 of the present invention;

Figure 11 is the structural representation of the second proof test value generation module in Fig. 7 of the present invention:

Figure 12 is the structural representation of another data de-duplication device embodiment provided by the invention:

Figure 13 is the structural representation of another kind of data de-duplication device embodiment provided by the invention:

Figure 14 is the structural representation of another data de-duplication device embodiment provided by the invention.

Embodiment

Technical scheme in the embodiment of the present invention is understood better in order to make those skilled in the art person, and enable the above-mentioned purpose of the embodiment of the present invention, feature and advantage become apparent more, below in conjunction with accompanying drawing, technical scheme in the embodiment of the present invention is described in further detail.

See Fig. 1, be the flow process of a kind of data de-duplication method embodiment provided by the invention, specifically can comprise:

Step 101, the primary data received is carried out piecemeal.

In this step, after receiving primary data, need to carry out piecemeal to primary data.

In a preferred embodiment of the invention, first primary data can be divided into multiple data processing unit (Buffer) waiting capacity.After receiving the primary data issued with the form of Buffer, the primary data in each Buffer can be carried out piecemeal process.Partitioned mode can be fixed length piecemeal or elongated piecemeal.

When Buffer primary data being divided into unified size carries out data de-duplication, space application is unified, and the deblocking number of each Buffer is relatively fixing, deblocking number of variations will be caused too large because of original document size of data; In addition, be divided into Buffer and carry out data de-duplication relative file level data de-duplication, heavily delete particle size reduction, can storage space be saved.

Step 102, obtain the fingerprint of each deblocking.

In this step, after each Buffer completes deblocking, the fingerprint of each piecemeal can be calculated by hash algorithm, for the comparison of follow-up data feature.

Step 103, fingerprint according to all deblockings, generate the first proof test value of primary data.

In this step, according to the fingerprint of all deblockings, the final proof test value generating whole primary data, is called the proof test value generated before carrying out data de-duplication " the first proof test value " in order to distinguish mutually with other proof test values occurred below, herein.

In the embodiment of the present invention, use fingerprint as the basis for estimation of unique block, fingerprint is equivalent to the summary of piecemeal, very little compared with the volume of piecemeal own, use fingerprint to calculate test value, make space expense little, few to resource occupation such as CPU, efficiency is high, and affect less on whole heavy system performance of deleting, traffic affecting does not run.

Step 104, the repeating data piecemeal searched in primary data, deleting duplicated data piecemeal.

In this step, perform the operation of data de-duplication.Particularly, by similarity analysis, search in primary data whether there is repeating data piecemeal, when there is repeating data piecemeal, deleting duplicated data piecemeal, only retains the portion in repeating data piecemeal, and other repeating data piecemeal designators deleted replace.

Unique block after step 105, storage deleting duplicated data piecemeal.

In this step, all unique block after deleting duplicated data piecemeal is stored.

Step 106, fingerprint according to all unique blocks, adopt the generating mode identical with the first proof test value, generates current the second proof test value completing data after data de-duplication.

In this step, after the storage completing unique block, according to the fingerprint of unique block, generate the current proof test value completing data after data de-duplication, be called " the second proof test value ".Wherein, the second proof test value is identical with the generating mode of the first proof test value.

Step 107, compare the first proof test value and the second proof test value, when the first proof test value is consistent with the second proof test value, determine that primary data is through correct data de-duplication.

In this step, after completing data de-duplication, the first corresponding for primary data proof test value and the second proof test value can be compared, when the first proof test value is consistent with the second proof test value, then represent that current data de-duplication flow process is reliable; Otherwise, illustrate that current data de-duplication flow process is incorrect, there is the problem that data are inconsistent.

In the embodiment of the present invention, in data de-duplication process, when completing piecemeal to primary data, namely according to the fingerprint of all deblockings, generate the first proof test value of primary data; After deleting duplicated data piecemeal, according to the fingerprint of current unique block, the second proof test value that data file is current can be generated, by the second proof test value and the first proof test value being compared, if former and later two proof test values are consistent, then represent that the data de-duplication process of primary data is reliable.By comparing the proof test value of data file before and after deleting duplicated data, greatly can improve the efficiency of data de-duplication flow process reliability demonstration, in repeating data delete procedure, the accuracy of data is given security.

For the ease of the understanding to technical solution of the present invention, be described in detail below in conjunction with embodiment.

Figure 2 shows that a kind of specific implementation of abovementioned steps 103, can comprise:

Step 201, fingerprint according to all deblockings in each data processing unit, the first sub-proof test value of grey iterative generation each data processing unit corresponding;

Step 202, obtain the first proof test value set of all first sub-proof test value compositions, the first proof test value set is defined as the first proof test value of primary data.

Figure 3 shows that the schematic diagram generating the first proof test value according to deblocking fingerprint.First pending file data is divided into multiple Buffer, as the Buffer1 ~ BufferN in figure, the primary data in each Buffer is carried out piecemeal process.Partitioned mode can be fixed length piecemeal or elongated piecemeal.Wherein, be divided into i deblocking, comprise for Buffer1 and BufferN, Buffer1: piecemeal 1 ~ piecemeal i, the deblocking of BufferN comprises: piecemeal j ~ piecemeal k.After each Buffer completes deblocking, the fingerprint of each piecemeal is calculated by hash algorithm, as shown in Figure 3, suppose that the fingerprint that piecemeal 1 ~ piecemeal i is corresponding is respectively fingerprint 1 ~ fingerprint i, according to fingerprint 1 ~ fingerprint i, iteration sub-proof test value corresponding to Buffer1 can be gone out, that is: School Affairs 1; In like manner, according to fingerprint j ~ fingerprint k, iteration sub-proof test value corresponding to BufferN can be gone out, that is: School Affairs N.After the sub-proof test value (School Affairs 1 ~ School Affairs N) generating corresponding N number of Buffer respectively, the proof test value set that the sub-proof test value that can obtain N number of Buffer is further formed, i.e. School Affairs C.Now, School Affairs C is proof test value corresponding to original document data, verifies for the follow-up reliability to data de-duplication flow process.

Usually, the unique block after deleting duplicated data will be written in container and store.In the embodiment of the present invention, the concrete mode storing unique block can comprise following two kinds of forms:

One is, is stored in by the unique block after deleting duplicated data piecemeal corresponding for each data processing unit in corresponding container, and stores the fingerprint of wherein each unique block in a reservoir.

Another kind is, the unique block after deleting duplicated data piecemeal corresponding for each data processing unit is carried out compression process; Afterwards, the unique block completing compression is stored in the container of each data processing unit corresponding, and stores the fingerprint of wherein each unique block in a reservoir.

When completing unique block and storing, need the fingerprint according to all unique blocks, adopt the generating mode identical with the first proof test value, current the second proof test value completing data after data de-duplication.Particularly, the generating mode of the second proof test value as shown in Figure 4, comprising:

Step 401, fingerprint according to all unique block in each container, the second sub-proof test value of grey iterative generation each container corresponding;

Step 402, obtain the second proof test value set of all second sub-proof test value compositions, the second proof test value set is defined as current the second proof test value completing data after data de-duplication.

Aforementioned the first store in the implementation of unique block, as shown in Figure 5, suppose that deblocking 1 ~ deblocking i of Buffer 1 is stored in container 1 and container 2 at the unique block after data de-duplication, the fingerprint (fp) of each unique block corresponding also deposits the metadata space of deblocking in a reservoir simultaneously, and deleted repeating data piecemeal is recorded by reference.According to the fingerprint of the unique block in container 1 and container 2, the second sub-proof test value, that is: School Affairs 1 ' of grey iterative generation these two containers corresponding.Further, the set of depositing the second sub-proof test value corresponding to each container of all unique block of corresponding initial data file is obtained, that is: School Affairs C '.It should be noted that, the second sub-proof test value is identical with the generating mode of the first sub-proof test value.By the School Affairs C in twin check and C ' and Fig. 3, can judge whether original document data pass through correct data de-duplication flow process.As School Affairs C '=School Affairs C, represent that current data de-duplication flow process is reliable.

In addition, because the unique block stored in container is corresponding with single Buffer, by independent twin check and n '=School Affairs N, can judge whether pass through correct data de-duplication flow process in corresponding Buffer.

Store in the implementation of unique block at aforementioned the second, the difference storing the implementation of unique block with the first is, when the deblocking in Buffer is after data de-duplication, before unique block is write corresponding container, first unique block can be carried out compression process, then, the unique block completing compression is stored in the container of each data processing unit corresponding.This processing mode is applicable to the situation of larger data amount, by data compression, greatly can reduce the occupancy of container.Identical with the implementation that the first stores unique block, while storing unique block in a reservoir, need to store in a reservoir the fingerprint of the unique block be in wherein.Further, according to the fingerprint of the unique block in container, the second sub-proof test value of the corresponding container of grey iterative generation, obtains the set of the second sub-proof test value corresponding to each container of depositing all unique block of corresponding initial data file, that is: School Affairs C ' then.By the School Affairs C in twin check and C ' and Fig. 3, can judge whether original document data pass through correct data de-duplication flow process.As School Affairs C '=School Affairs C, represent that current data de-duplication flow process is reliable.

It should be noted that, in order to ensure the succession of file data piecemeal, the generative process of proof test value progressively can be synthesized according to the order of deblocking.

For grey iterative generation School Affairs 1 in Buffer1, as shown in Figure 6, the deblocking in Buffer1, according to original assembling sequence, is divided into i deblocking, according to the first fingerprint of the first deblocking and the second fingerprint of the second piecemeal, the sub-School Affairs of grey iterative generation first; According to the 3rd fingerprint and the first sub-School Affairs of the 3rd deblocking, the sub-School Affairs of grey iterative generation second; By that analogy, until according to the i-th fingerprint of the i-th deblocking and (i-1) sub-School Affairs, the sub-proof test value of grey iterative generation first, that is: the School Affairs 1 in Fig. 3.Other each Buffer, all according to identical iterative manner, generate the first respective sub-proof test value.In addition, after the first sub-proof test value that each Buffer is corresponding generates, need according to identical iterative manner, by the first proof test value of all first sub-proof test value grey iterative generation original document data.

Owing to have employed the method for substep iteration synthesis proof test value, therefore, can support that the substep of proof test value is verified, find the deblocking point of makeing mistakes, thus promote proof strength.

In this aspect embodiment, the algorithm for grey iterative generation proof test value can select following algorithm: 4-Byte XOR, CRC, Adler32 and Rabin etc.Different algorithm computation complexities and practical scene there are differences, and correlative study is more complete, no longer repeats herein.

Briefly, if need the complexity reducing verification as far as possible, verification blocking information is only needed heavily to delete the integrality in flow process, accelerate inspection process and reduce verification complexity, then need not generate proof test value according to the sequential iteration of file data piecemeal, that is: the computing method of out-of-order can be adopted to generate proof test value, as: 1-Byte XOR.

Usually, after completing data de-duplication, in order to verify whether storage unique blocks of data is in a reservoir damaged, follow-up increase can patrol and examine flow process and realized.In practical application scene, patrolling and examining flow process can regularly initiate, such as: preset adjacent patrol and examine for twice between time interval threshold value, when arrive this time interval threshold value time, namely initiate patrol and examine flow process.Particularly, patrol and examine and can be divided into: monofile routine inspection mode and container levels routine inspection mode.

In monofile routine inspection mode, by the fingerprint according to the unique block in all containers, adopt the generating mode that first proof test value corresponding with original document data is identical, generate current file data the second proof test value in a reservoir, second proof test value and the first proof test value are compared, when the second proof test value is consistent with the first proof test value, determines that primary data is through correct data de-duplication, ensure the consistance of whole file data in data de-duplication process.

In container levels routine inspection mode, because container is when initially carrying out unique block and filling, each container storage has the second sub-proof test value that in the fingerprint of wherein each unique block and container, unique block is corresponding, therefore, when patrolling and examining, can according to sub-proof test value corresponding to fingerprint synthesis each container current of the unique block stored in now each container, that is: the 3rd sub-proof test value.By the 3rd sub-proof test value, the unique block in current container is identified, 3rd sub-proof test value and the second sub-proof test value are compared, when the second sub-proof test value and the 3rd sub-proof test value inconsistent time, then determine that in corresponding container, data and initial padding data are inconsistent, the silence data that physical store causes due to reasons such as external attack, media damage may occur and destroy problem.

Also it should be noted that, after filling unique block in container, data block length time no situation may be there is, therefore, need to manage container according to space reclamation mechanism, that is: arrange the container of storage space utilization factor lower than certain threshold value, after being moved by deblocking useful in container, discharge container resource again.In reality, that carries out in container that data deletion, data rewriting all can change deblocking in container quotes number of times, when certain deblocking is not cited for a long time, represent that these data can for abandoning data, therefore, this deblocking will be removed, and arranges the mark of " to be recycled " can to this deblocking, or is released in resource pool.When deblocking is reclaimed by mistake, will change according to the sub-proof test value that remaining data piecemeal fingerprint in container generates, inconsistent with initial sub-proof test value.Thus, can being compared by proof test value, to whether there is the recovery of deblocking mistake during space reclamation verifying.

Particularly, when carrying out space reclamation inspection, can according to the fingerprint of all unique block in each container during inspection, the proof test value of corresponding each container current of grey iterative generation, is called " the 4th sub-proof test value "; By the 4th proof test value and container initially carrying out compared with the second sub-proof test value corresponding when unique block is filled, when the second sub-proof test value and the 4th sub-proof test value inconsistent time, determine to occur in corresponding container space the situation that deblocking is reclaimed by mistake.

Corresponding with data de-duplication method embodiment provided by the invention, present invention also offers a kind of data de-duplication device.

As shown in Figure 7, be the embodiment of a kind of data de-duplication device provided by the invention, this device specifically can comprise:

Deblocking module 701, carries out piecemeal for the primary data that will receive;

Fingerprint acquisition module 702, for obtaining the fingerprint of each deblocking;

First proof test value generation module 703, for the fingerprint according to all deblockings, generates the first proof test value of primary data;

Heavily delete module 704, for searching the repeating data piecemeal in primary data, deleting duplicated data piecemeal;

Unique block memory module 705, for storing the unique block after deleting duplicated data piecemeal;

Second proof test value generation module 706, for the fingerprint according to all unique blocks, adopts the generating mode identical with the first proof test value, generates current the second proof test value completing data after data de-duplication;

Proof test value comparison module 707, for comparing the first proof test value and the second proof test value, when the first proof test value is consistent with the second proof test value, determines that primary data is through correct data de-duplication.

In embody rule scene, above-mentioned deblocking module, specifically carries out fixed length or elongated piecemeal by the primary data in the data processing unit of capacity that waits received, wherein, primary data, before carrying out piecemeal, is first divided into multiple data processing unit waiting capacity.

In the embodiment shown in fig. 8, above-mentioned first proof test value generation module 703, specifically can comprise:

First sub-proof test value generation unit 801, for the fingerprint according to all deblockings in each data processing unit, the first sub-proof test value of grey iterative generation each data processing unit corresponding;

First proof test value determining unit 802, for by all first sub-proof test values according to identical iterative manner, generate the first proof test value of primary data.

In the embodiment shown in fig. 9, above-mentioned unique block memory module 705, specifically can comprise:

First capacitor memory cell 901, for being stored in container by the unique block after deleting duplicated data piecemeal corresponding for each data processing unit;

First fingerprint storage unit 902, for storing the fingerprint of wherein each unique block in a reservoir.

Embodiment shown in Figure 10, be the another kind of implementation of unique block memory module 705, wherein, unique block memory module 705, specifically can comprise:

Compression unit 1001, for carrying out compression process by the unique block after deleting duplicated data piecemeal corresponding for each data processing unit;

Second container storage unit 1002, for being stored in the container of each data processing unit corresponding by the unique block completing compression;

Second fingerprint storage unit 1003, for storing in a reservoir wherein, each completes the fingerprint of unique block of compression.

In the embodiment shown in fig. 11, above-mentioned second proof test value generation module 706, specifically can comprise:

Second sub-proof test value generation unit 1101, for the fingerprint according to all unique block in each container, the second sub-proof test value of grey iterative generation each container corresponding;

Second proof test value determining unit 1102, for by all second sub-proof test values according to identical iterative manner, generate current the second proof test value completing data after data de-duplication.

In the embodiment shown in fig. 12, data de-duplication device can also comprise:

Polling module 708, for when carrying out container levels and patrolling and examining, according to the fingerprint of all unique block in each container when patrolling and examining, the 3rd sub-proof test value of corresponding each container current of grey iterative generation; The second sub-proof test value that relatively each container is corresponding and the 3rd sub-proof test value, when the second sub-proof test value and the 3rd sub-proof test value inconsistent time, determine data corruption in corresponding container.

In addition, on the basis of Fig. 7 or Figure 12, as shown in FIG. 13 and 14, above-mentioned data de-duplication device can also comprise:

Space reclamation inspection module 709, for when carrying out space reclamation inspection, according to the fingerprint of all unique block in each container during inspection, the 4th sub-proof test value of corresponding each container current of grey iterative generation; The second sub-proof test value that relatively each container is corresponding and the 4th sub-proof test value, when the second sub-proof test value and the 4th sub-proof test value inconsistent time, determine that the deblocking in corresponding container space is reclaimed by mistake.

Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.

In several embodiments that the application provides, should be understood that disclosed system, apparatus and method can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.

If when described function realizes with the form of SFU software functional unit and is defined as independently production marketing or use, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. various can be program code stored medium.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims

1. a data de-duplication method, is characterized in that, comprising:

The primary data received is carried out piecemeal;

Obtain the fingerprint of each deblocking;

Store the unique block after deleting duplicated data piecemeal;

2. method according to claim 1, is characterized in that, described the primary data received is carried out piecemeal, comprising:

3. method according to claim 2, is characterized in that, the described fingerprint according to all described deblockings, generates the first proof test value of described primary data, comprising:

4. method according to claim 3, is characterized in that, the unique block after described storage deleting duplicated data piecemeal, comprising:

5. method according to claim 3, is characterized in that, the unique block after described storage deleting duplicated data piecemeal, comprising:

6. the method according to claim 4 or 5, is characterized in that, the described fingerprint according to all described unique blocks, adopts the generating mode identical with described first proof test value, generates current the second proof test value completing data after data de-duplication, comprising:

7. method according to claim 6, is characterized in that, also comprises:

8. the method according to claim 6 or 7, is characterized in that, also comprises:

9. method according to claim 3, is characterized in that, the described fingerprint according to all deblockings in data processing unit described in each, and the first sub-proof test value of the corresponding data processing unit described in each of grey iterative generation, comprising:

10. method according to claim 9, is characterized in that, the described fingerprint according to all unique block in container described in each, and the second sub-proof test value of the corresponding container described in each of grey iterative generation, comprising:

11. 1 kinds of data de-duplication devices, is characterized in that, comprising:

12. devices according to claim 11, it is characterized in that, described deblocking module, specifically the described primary data in the data processing unit of capacity that waits received is carried out fixed length or elongated piecemeal, wherein, described primary data, before carrying out piecemeal, is first divided into multiple described data processing unit waiting capacity.

13. devices according to claim 12, is characterized in that, described first proof test value generation module, comprising:

14. devices according to claim 13, is characterized in that, described unique block memory module, comprising:

15. devices according to claim 13, is characterized in that, described unique block memory module, comprising:

16. devices according to claims 14 or 15, it is characterized in that, described second proof test value generation module, comprising:

17. devices according to claim 16, is characterized in that, also comprise:

18. devices according to claim 16 or 17, is characterized in that, also comprise: