CN102629247B

CN102629247B - Method, device and system for data processing

Info

Publication number: CN102629247B
Application number: CN201210034149.2A
Authority: CN
Inventors: 曹宇
Original assignee: Huawei Symantec Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2011-12-31
Filing date: 2012-02-15
Publication date: 2014-09-17
Anticipated expiration: 2032-02-15
Also published as: CN102629247A

Abstract

The invention relates to the technical field of data processing, in particular to a method, a device and a system for data processing. The method for the data processing includes providing a first interface which receives a file write operation request from an application program; portioning the data of a file into at least one sub-data block, obtaining an unique identification of each of the at least one sub-data block, and comparing unique identifications with a reserved unique identification; establishing a chaining between a sub-data block identical to the reserved unique identification and the data corresponding to the reserved unique identification if the unique identifications are identical to the reserved unique identification; and otherwise, reserving the unique identifications of the sub-data block, which are different from the reserved unique identification, sending a data write operation request including the data of the sub-data block, which are different from the reserved unique identification, to a storage system by means of a second interface, and enabling the storage system to store the data of the sub-data block. According to the method, the device and the system for the data processing, the repeated data are reduced, and the storage space is saved.

Description

A kind of data processing method, device and system

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of data processing method, device and system.

Background technology

Data de-duplication technology, after being suggested the beginning of this century, is applied rapidly in each storage enterprise, becomes the significant function of memory technology.Data de-duplication means identical data only to store a to save space.Data de-duplication can reduce data effectively, reduces carrying cost.

Existing data de-duplication module is generally integrated in storage system inside, its process is generally that database is cut into slices, calculate the hash value (cryptographic hash) of each section, and according to hash value formation logic address, be mapped as physical address storage from logical address.This method is applicable to storage system newly developed, but inapplicable for some old storage systems, need to carry out a large amount of amendments to the framework of former storage system and just can make it support data de-duplication function.

At present, also there is a kind of scheme in prior art, and can realize general file system increases the function of data de-duplication.This scheme is safeguarded the record of repeating data by increasing an extra database.In the time that a file is written into file system, at this moment file is not carried out to any operation, but preserve total data.After data are preserved, there is in addition a process to scan file system, if find that file is not visited within a period of time, this file is carried out to data de-duplication operations.Because this operation is carried out after file is saved, be therefore called rear heavy deleting.Rear heavy deleting while carrying out, first the process of heavily deleting reads out this file, is cut into the not piece of isometric size according to certain rule, then every blocks of data is preserved according to the mode of ordinary file, in database, records the information of split point of preservation of this file simultaneously.Then in file metadata, add a zone bit, represent that this file was heavily deleted, thus while reading this file later, can't be as ordinary file direct reading disk, but find its slicing files place by database, and then reading out data.

Realizing in process of the present invention, inventor finds that in prior art, at least there are the following problems: the scheme that prior art provides is heavily deleted after can only supporting, can not heavily delete in real time.In the time storing for the first time, file system need to be preserved total data, even if there are the data of repetition also can preserve many parts.In the time that execution is heavily deleted process, file need to be read out to carry out and heavily delete operation.This method makes the data of repetition carry out repeatedly read-write operation, has taken a large amount of system resource, and causes a large amount of storage spaces to be wasted, and can not effectively save storage space.

Summary of the invention

For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of data processing method and device, can be that general file system increases data de-duplication function easily, and required cost is little, and supports heavily to delete in real time function, effectively saves storage space.

On the one hand, the embodiment of the present invention provides a kind of data processing method, and described method is applied to data processing equipment, and described method comprises:

First interface is provided, receives the file write operation requests from application program by described first interface;

The data of file are cut into at least one sub-block, obtain the unique identification of each described sub-block, respectively itself and the unique identification of preserving are compared; If identical, set up linking between the sub-block identical with the unique identification of described preservation and data corresponding to the unique identification of described preservation;

If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.

On the other hand, the embodiment of the present invention provides a kind of data processing equipment, and described device comprises:

First interface, for receiving the file write operation requests from application program; The interface that described first interface and the second interface are same type, described the second interface is connected with storage system;

Metadata management module, for metadata and the file block information of storage file;

Data de-duplication module, for the data of the file receiving by first interface are cut into at least one sub-block, obtains the unique identification of each described sub-block, respectively the unique identification of preserving in itself and metadata management module is compared; If identical, set up linking between the data that the unique identification preserved in the sub-block identical with the unique identification of described preservation and described metadata management module is corresponding; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.

On the one hand, the embodiment of the present invention also provides a kind of data handling system again, and described system comprises:

Data processing equipment and storage system, described data processing equipment has first interface, and described data processing equipment receives the data operation request from application program by first interface; Described storage system has the second interface, the interface that described the second interface and first interface are same type, and described storage system is mutual by the second interface and data processing equipment, wherein:

Described data processing equipment is for receiving the file write operation requests of application program; The data of described file are cut into at least one sub-block, obtain the unique identification of each described sub-block, respectively the unique identification of preserving in itself and metadata management module is compared; If identical, set up linking between the data that the unique identification preserved in the sub-block identical with the unique identification of described preservation and described metadata management module is corresponding; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of the sub-block different from the unique identification of described preservation;

Described storage system, for receiving the data write operation request from data processing equipment, is preserved described data.

The beneficial effect that the embodiment of the present invention can reach is: the embodiment of the present invention is by providing a general first interface for application program, receive the file write operation requests of application program, if through unique identification comparison, there is repeating data in the sub-block that judges described file, carries out data de-duplication processing by data processing equipment; If through unique identification comparison, judge that the unique identification of sub-block of described file is different from existing unique identification, the data of sub-block are sent to storage system and store.When carry out for the first time write operation at file, carry out data de-duplication processing, therefore effectively reduce repeating data, save storage space.On the other hand, the embodiment of the present invention provides a general first interface for all application programs, described first interface is identical with the second interface that offers application program in original storage system, making common file system is that the interface of storage system just can be supported repetition delete function in the situation that not needing to change, and has realized ordinary file system possess repetition delete function by less cost.

Brief description of the drawings

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Data processing method the first embodiment process flow diagram that Fig. 1 provides for the embodiment of the present invention;

Fig. 2 is embodiment of the present invention repeating data treatment scheme schematic diagram;

Data processing method the second embodiment process flow diagram that Fig. 3 provides for the embodiment of the present invention;

Fig. 4 is embodiment of the present invention Method of Data Organization schematic diagram;

Data processing method the second embodiment schematic diagram that Fig. 5 a-Fig. 5 d provides for the embodiment of the present invention;

Fig. 6 is data processing method provided by the invention the 3rd embodiment process flow diagram;

Fig. 7 a-Fig. 7 f is data processing method provided by the invention the 3rd embodiment schematic diagram;

Fig. 8 is data processing method provided by the invention the 4th embodiment process flow diagram;

The data handling system schematic diagram that Fig. 9 provides for the embodiment of the present invention;

The data processing equipment schematic diagram that Figure 10 provides for the embodiment of the present invention.

Embodiment

The embodiment of the present invention provides a kind of data processing method and device, can be that general file system increases data de-duplication function easily, and required cost is little, and supports heavily to delete in real time function, effectively saves storage space.

In order to make those skilled in the art person understand better the technical scheme in the present invention, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, should belong to the scope of protection of the invention.

Referring to Fig. 1, data processing method the first embodiment process flow diagram providing for the embodiment of the present invention.Described method comprises:

S101, provides first interface, receives the file write operation requests from application program by described first interface.

The method that the embodiment of the present invention provides is applied to data processing equipment.Described data processing equipment is connected with original storage system.Data processing equipment has first interface, and storage system has the second interface, wherein, and the general-purpose interface that described first interface and the second interface are same type.

In original storage system of not supporting data de-duplication function, application program for example, is accessed original storage system by the second interface (POSIX interface), and original storage system is original memory storage or file system.

The method that the embodiment of the present invention provides is applied to data processing equipment, and described data processing equipment has first interface, and described first interface is for receiving the operation requests from application program.Wherein, the first interface of described data processing equipment is identical with the interface type that original storage system offers application program.For application program, what it was seen is the device that can support data de-duplication; And for original storage system, be still by the second interface and carry out file I/O accessing operation.Concrete, first interface and the second interface can be general standard interfaces, for example POSIX interface.

S102, is cut at least one sub-block by the data of file, obtains the unique identification of each described sub-block, respectively itself and the unique identification of preserving is compared; If identical, set up linking between the sub-block identical with the unique identification of described preservation and data corresponding to the unique identification of described preservation; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.

In embodiments of the present invention, the unique identification of sub-block can be the cryptographic hash of this sub-block, or calculates with other algorithms the identifier obtaining.

Referring to Fig. 3, it is embodiment of the present invention repeating data treatment scheme schematic diagram.

Concrete, step S102 realizes by following steps:

S201, is cut at least one sub-block by the data of file.

In the embodiment of the present invention, can adopt slit mode fixed length or random length that the data of file are cut into one or more data blocks.It should be noted that, in the time that file is empty file, while not containing any data, at this moment, preserve the metadata of file, do not need to carry out data cutting operation.

S202, obtains the cryptographic hash of each described sub-block, respectively itself and the cryptographic hash of having preserved is compared, and judges whether to exist identical cryptographic hash.If identical, enter step S203; If different, enter step S204.

Specifically in embodiments of the present invention, in the time of a new document creation, first to pass through hash (Hash) and calculate, the address that obtains file metadata.Describe with a concrete example below, after document creation such as file a.txt by name, need to calculate through hash, by a kind of possible algorithm, the address that obtains file metadata is c03b9d52f4052f6886f6d1a79701122d (address value of metadata is relevant with concrete hash algorithm, and the present invention does not limit concrete hash algorithm).In this address, record is the address of data block, the metadata of a file can be kept in a file, also can be by the meta-data preservation of multiple files in a file or be kept in database, the present invention does not limit the Storage Format of metadata.In embodiments of the present invention, by the metadata information of metadata management module records file, the metadata of file comprises size, file type, establishment, modification time of filename, file etc.The data of file are cut into after one or more sub-blocks, and calculate respectively the cryptographic hash (hash) of each data block, hash can be for the mapping relations between reflection data and deposit data address.Concrete hash algorithm can adopt MD4, MD5 or SHA-1 algorithm, and the present invention does not limit this.The hash value record of each sub-block is in metadata information.For example, the data of a.txt file including can be split into two sub-blocks, the hash address of two sub-blocks is respectively 659df9a166c3cca249ecfa481d074a9b and 50f74ff709ed8b9547d21f130f7eefd2, and these two hash values are just kept in the indicated piece of c03b9d52f4052f6886f6d1a79701122d as content so.The data block that is designated 659df9a166c3cca249ecfa481d074a9b and 50f74ff709ed8b9547d21f130f7eefd2 is preserved respectively actual data.In embodiments of the present invention, use metadata information and the file block information of metadata management module stores file, wherein, metadata information comprises the cryptographic hash of each sub-block.

Respectively the hash value of each data block and the existing hash value of metadata management module are compared.If identical, enter step S203; If different, enter step S204.

S203, sets up linking between the sub-block identical with the cryptographic hash of described preservation and data corresponding to the cryptographic hash of described preservation.

If the hash value that the hash value of certain sub-block has been preserved with metadata management module is identical, the data that described sub-block is described exist, the data of described sub-block are the data of repetition, here the data of described sub-block are not preserved, but directly set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved.Specifically in embodiments of the present invention, can adopt the mode of pointer to establish the link.Concrete, can be by the pointed of the described data block data corresponding with described cryptographic hash of having preserved.Certainly, also can adopt other modes to establish the link, the present invention does not limit this.Meanwhile, the reference count of described data is added to 1.In embodiments of the present invention, counting represents the number of times that sub-block is quoted by higher level's data block by reference.In the time that the reference count of data is zero, delete described data.

S204, preserve the unique identification of described different from the cryptographic hash of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the cryptographic hash of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.

If the hash value of some sub-blocks is different from existing hash value, illustrate that the data of above-mentioned sub-block are not repeating datas, preserve the hash value of described data block to metadata management module, and logical address using hash as described data block; The logical address of described data block is converted to physical address, according to described physical address, the data of described data block is stored in storage system by described the second interface.

In the first embodiment provided by the invention, by a general first interface is provided for application program, receive the file write operation requests of application program, if through cryptographic hash comparison, there is repeating data in the sub-block that judges described file, carries out data de-duplication processing by data processing equipment; If through cryptographic hash comparison, judge that the cryptographic hash of sub-block of described file is different from the existing cryptographic hash of preserving in metadata management module, the data of sub-block are sent to storage system and store.When carry out for the first time write operation at file, carry out data de-duplication processing, therefore effectively reduce repeating data, save storage space.On the other hand, the embodiment of the present invention provides a general first interface for all application programs, described first interface is identical with the interface that offers application program in original storage system, making common file system is that storage system interface just can be supported to repeat delete function in the situation that not needing to change, and has realized ordinary file system possess repetition delete function by less cost.

Referring to Fig. 3, data processing method the second embodiment process flow diagram providing for the embodiment of the present invention.

Below by a concrete example, method provided by the invention is at length introduced.

S301, definition primary sources piece and secondary sources piece.

First data block is divided into primary sources piece and secondary sources piece by we, and wherein, primary sources piece can be called again traditional address block, and its address and its content are irrelevant, is the address of directly distributing according to traditional approach.Be in traditional address block, to store the address information of storage system (file system) and the index of secondary sources piece at primary sources piece.Wherein, can find secondary sources piece according to described index.In primary sources piece, record the information of storage system (file system), from primary sources piece, can find the index of other metadata and data block.Secondary sources piece can be called again content address piece, and its address is to be determined by the hash value of its content, the therefore different data block of content, and hash value is same scarcely, and address is also just different.Secondary sources piece generally comprises meta data block and sub-block.

Device provided by the invention can be regarded a file system as, and system is made up of together these " traditional address blocks " and " content address piece ".Suppose, in a file system, has respectively sub-directory 1 and sub-directory 2 under a file root directory; Wherein, sub-directory stores two files for 1 time, is respectively file 1 and file 2.Wherein, file 1 is empty file, and file 2 has the data of 1k.Sub-directory 2 is empty list.Referring to Fig. 4, the data block of file root directory representative is primary sources piece, can obtain the index of other meta data block and sub-block from this data block.Meta data block 1, meta data block 2, meta data block 3 and data block (Data Block) are secondary sources piece.For above-described file system, the metadata store of sub-directory 1 correspondence is in meta data block 1, and the metadata store of sub-directory 2 correspondences is in meta data block 2.Owing to storing two files in sub-directory 1, at this moment we calculate the file 1 that obtains respectively in sub-directory 1 and the address of file 2 by hash, are kept in meta data block 1.As shown in Figure 4, in meta data block 1, preserve two address informations, the wherein address of address 1 representation file 2, the address of address 2 representation files 1.Certainly, this is only a kind of example, also can use the address of address 1 representation file 1, the address of address 2 representation files 2.Can find respectively the metadata information of file 1 and file 2 and file 1 and file 2 by these two address informations.At this moment, we calculate by hash value address corresponding to data obtaining in file 2 again, and are kept in the meta data block 3 of file 2 correspondences.Referring to Fig. 3, owing to there being the data of 1K in file 2, the address corresponding to data of preserving file 2 in the meta data block 3 that therefore address 1 is pointed to is address 3, and the data of file 2 are pointed in address 3.Because file 1 does not have data, so meta data block 3 is only pointed in address 2, meta data block 3 does not link with concrete data block.Find out from the graph, in the method providing in the embodiment of the present invention, general metadata and data are reciprocity, and then they hang on file system tree by the calculated address of hash value, all belongs to secondary sources piece.Above secondary sources piece, there is a reference count at each, quoted by how many higher level's data blocks for identifying this data block.

S302, reception application program creates the request of the first file.

S303, the metadata information of preservation the first file.

Referring to Fig. 5 a-5d, it is second embodiment of the invention data processing method schematic diagram.

FR (Filesystem Root) representation file system entry is primary sources piece.When receiving after the request to create of the first file, calculate the hash value of the first file, and address using described hash value as the first file metadata.The content of concrete metadata can comprise the information such as filename, file size, file type, document creation time.In Fig. 5 a, FR representation file system entry, MB1 represents the meta data block of the first file, the digitized representation reference count on piece, the reference count of MB1 is 1, shows that it is quoted 1 time by higher level's data block.Between file system entrance FR and meta data block MB1, set up and link.

S304, is cut at least one sub-block by the first file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.

Suppose that the first file is cut into a sub-block, through comparing, the hash value of described sub-block is different from the hash value of preserving, preserve the hash value of the first file sub-block and the data of sub-block, as shown in Figure 5 b, create data block DB1, in order to preserve the data of sub-block.Set up linking between meta data block and sub-block, i.e. linking between MB1 and DB1 simultaneously.Like this, can find the data of the first file sub-block from meta data block MB1.

S305, the request to create of reception the second file.

Wherein, the filename of the second file is different from the first file, and the content of the second file is identical with the first file.

S306, the metadata information of preservation the second file.

Preserve the metadata information of the second file.In Fig. 5 c, MB2 represents the meta data block of the second file, the digitized representation reference count on piece.

S307, is cut at least one sub-block by the second file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.

Suppose the second file to be cut into a sub-block, obtain the hash value of described sub-block, and it is carried out in the catalogue of FR to index search, finding has had identical hash value, sets up linking of data that described sub-block is corresponding with described identical hash value.As shown in Fig. 5 d, create MB2, in order to store the metadata information of the second file, because the data of the second file are identical with the data content of the first file, calculate by hash, find existing identical hash value, therefore the data of the second file are not stored, only set up linking of MB2 and DB1, the reference count of DB1 is added to 1 simultaneously, the reference count of DB1 is 2 now.

Referring to Fig. 6, it is data processing method provided by the invention the 3rd embodiment schematic flow sheet.

Snapshot is as a characteristic of storage system, its importance is also more and more higher, and the implementation method of snapshot is a lot, but generally all will solve the problem of quick and data consistent, even have conflict with other functions, the method for prior art simultaneously Zhi Chichong is deleted and snapshot functions.The method that the embodiment of the present invention provides can support snapshot, heavily delete function uses simultaneously, and by very simple mode really accomplish fast, consistent.In embodiments of the present invention, to file data modify, delete, create, increase, creation operation, snapshot functions still can be suitable for.Be operating as example and describe snapshot functions and heavily delete function and can be suitable for simultaneously file data is deleted, increased below.

S601, definition primary sources piece and secondary sources piece.

S602, copies linking between primary sources piece and described primary sources piece and secondary sources piece, creates the snapshot of primary sources piece.

Referring to Fig. 7 a to Fig. 7 f, it is third embodiment of the invention schematic diagram.

Taking the file system shown in Fig. 7 a as example, snapshot implementing method is described below.Primary sources piece (FR as shown in Figure 7a) is copied to portion, be named as copy as snapshot, and copy linking of primary sources piece and next stage sub-block.As shown in Figure 7b, the snapshot that copy is FR, and the reference count that the reference count of the data block of being quoted by copy adds 1, MB1 and MB2 all becomes 2.

S603, receives the file deletion action request of application program.

It is successful illustrating above-mentioned snapshot with a file deletion action below.This is only a concrete example of the inventive method, any amendment for file or deletion action, and snapshot is still applicable.

S604, according to the metadata information of described file of preserving, obtains point block message of described file, and disconnects linking of described piecemeal and primary sources piece according to described point of block message.

Concrete, suppose that the file that will delete is the second file, according to the metadata information of the second file, obtaining its block data is the indicated data block of MB2.At this moment, delete linking between the primary sources piece (FR) shown in Fig. 7 c dotted line and sub-block MB2, deletion result is as shown in Fig. 7 d.At this moment, still can have access to the second file from snapshot copy, but the reference count of the meta data block MB2 of the second file subtracts 1.At this moment do not delete meta data block MB2 and data block DB1 that the second file is corresponding, because they are also quoted by other data blocks.In the time that the reference count of data block is zero, delete described data block.

S605, receives the request that creates the 3rd file.

Preserve the metadata information of the 3rd file.As shown in Fig. 7 e, increase a meta data block MB3, reference count is 1.

S606, is cut at least one sub-block by the 3rd file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.

In the present embodiment, suppose the 3rd file to be cut into two sub-blocks, obtain respectively the hash value of two sub-blocks.By the hash value of first data block and existing hash value relatively, find to have had identical hash value, set up linking of data that this sub-block is corresponding with hash value.As shown in Fig. 7 f, the content of first data block is identical with the content of two data blocks above, sets up linking of MB3 and DB1, and the reference count of DB1 adds 1.The hash value of second data block is different from the hash value existing, at this moment, and a newly-built data block DB2, and the data of data block are preserved.

Can find out from above-described embodiment, method provided by the invention can be supported snapshot simultaneously and heavily delete function, and implementation method is simple.

In the method that prior art provides, rely on data-base recording metadata, when therefore the file after heavily deleting is revised again, amendment part again Zhi Chichong is deleted.File become some file fragmentations, and these file fragmentation specific addresses is recorded in database after heavily deleting after once.At this moment, each file fragmentation may be to meet well heavily to delete relation, but when amendment is wherein when certain file fragmentation, and this scheme can't be again distributed new address or burst again to this file fragmentation, after heavily deleting file and modifying, whole system is heavily deleted effect and can be declined like this.In order to address this problem, the method that the embodiment of the present invention provides, after file is modified, is still carried out cutting to whole file, carries out data de-duplication operations, therefore heavily deletes the scheme that effect is better than prior art.

Referring to Fig. 8, it is data processing method provided by the invention the 4th embodiment schematic diagram.

S801, the file modification operation requests of reception application program.

S802, according to the metadata information of the described file of preserving, access storage system, obtains described file.

The metadata information of preserving while establishment first according to file, access storage system, obtains described file.

S803, the file write operation requests of reception application program.

After application program is modified to described file, receive the request that writes amended file.

S804, is cut at least one sub-block to the data of amended file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.

In embodiment provided by the invention, after file is modified, still whole file is carried out to cutting, if a sub-block of amended file is repeating data, do not preserve described sub-block; If the hash value of a sub-block of amended file is different from existing hash value, illustrate that it is new data, preserves the data of described sub-block.

In this embodiment, after file is modified, still it is re-started to cutting, calculate, compare hash value, to judge whether to exist repeating data, therefore obtain and heavily deleted preferably effect.

Corresponding with data processing method embodiment provided by the invention, the present invention also provides data handling system, data processing equipment specific embodiment.

Referring to Fig. 9, the data handling system schematic diagram providing for the embodiment of the present invention.

A kind of data handling system, described system comprises data processing equipment 902 and storage system 903, and described data processing equipment 902 has first interface, and described data processing equipment 902 receives the data operation request from application program 901 by first interface; Described storage system 903 has the second interface, the general-purpose interface that described the second interface and first interface are same type, and described storage system 903 is mutual by the second interface and data processing equipment 902, wherein,

Described data processing equipment 902 is for receiving the file write operation requests of application program; The data of described file are cut into at least one sub-block, obtain the unique identification of each described sub-block, respectively the unique identification of preserving in itself and metadata management module is compared; If identical, set up linking between the data that the unique identification preserved in the sub-block identical with the unique identification of described preservation and described metadata management module is corresponding; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.

In embodiments of the present invention, unique identification can be cryptographic hash or other identifiers.

Described storage system 903, for receiving the data write operation request from data processing equipment, is preserved described data.

In embodiment provided by the invention, data processing equipment has a general first interface, and described first interface is identical with the second interface type of storage system.Storage system can be file system or memory storage, has the second general interface, is specifically as follows POSIX interface.Data processing equipment has first interface, is specifically as follows POSIX interface.Data processing equipment can be independent of storage system and exist.Storage system can be general memory storage or file system.

Referring to Figure 10, it is embodiment of the present invention data processing equipment schematic diagram.Described device comprises:

First interface, for receiving the write operation requests from application program; The interface that described first interface and the second interface are same type, described the second interface is associated with storage system;

Described data de-duplication module comprises:

Cutting module, for being cut at least one sub-block by the data of described file;

Logical address distribution module, for obtaining the unique identification of described sub-block, the logical address using the unique identification of described sub-block as sub-block compares the unique identification of preserving in itself and metadata management module;

Data management module, when different for the unique identification preserved from metadata management module when the unique identification of sub-block, is converted to physical address by the logical address of described sub-block; Send data write operation request by described the second interface to storage system, described storage system is stored to the data of described sub-block in described storage system by described the second interface according to described physical address.

Described metadata management module, also in the time that the unique identification of sub-block is identical with the unique identification of preserving, is set up linking between the data that described sub-block is corresponding with the described unique identification of having preserved.

Described device also comprises:

Data dictionary, for defining primary sources piece and secondary sources piece; Wherein, described primary sources piece is for preserving the address information of storage system and the index of secondary sources piece; The address of described secondary sources piece is associated with the cryptographic hash of its content, and described secondary sources piece comprises meta data block and sub-block.

Described device also comprises:

Reference count module, for recording the number of times of quoting of described secondary sources piece, in the time that described reference count is zero, deletes described secondary sources piece.

Described device also comprises:

Snapshot module, for copying linking between described primary sources piece and described primary sources piece and secondary sources piece, creates the snapshot of described primary sources piece.

Concrete, described first interface is also for receiving the file deletion action request of application program;

Described device also comprises:

Removing module, for according to the metadata information of described file of preserving, obtains point block message of described file, and disconnects linking of described piecemeal and primary sources piece according to described point of block message;

Described reference count module also, for according to the request of removing module, subtracts 1 by the reference count of the data corresponding with described piecemeal.

Concrete, described first interface is also for receiving the file read operation request of application program;

Described device also comprises:

Read module, for according to the metadata information of the described file of preserving, accesses storage system, obtains described file, and described file is carried out to read operation.

Described device can also comprise the 3rd interface, and described the 3rd interface is used for inputting with the second interface, output access operation.

It should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.

The present invention can describe in the general context of computer executable instructions, for example program module.Usually, program module comprises and carries out particular task or realize routine, program, object, assembly, data structure of particular abstract data type etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, be executed the task by the teleprocessing equipment being connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium including memory device.

The above is only the specific embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a data processing method, it is characterized in that, described method is applied to data processing equipment, described data processing equipment is connected with a storage system, described data processing equipment comprises first interface, and described storage system comprises the second interface, and described first interface is identical with the type of described the second interface, described storage system is carried out file read-write operation by the second interface, and described method comprises:

2. method according to claim 1, is characterized in that, linking between the described foundation sub-block identical with the unique identification of described preservation and data corresponding to the unique identification of described preservation comprises:

By the data corresponding unique identification of preserving in the pointed of sub-block identical unique identification described and described preservation and metadata management module;

The reference count of described data is added to 1, wherein, the number of times that described reference count is cited for identification data, when receive described data removal request time, the reference count of described data is subtracted to 1, in the time that the reference count of described data is zero, delete described data.

3. method according to claim 1, it is characterized in that, the unique identification of the sub-block different from the unique identification of described preservation comprises described in described preservation: preserve the unique identification of the different sub-block of unique identification described and described preservation as the logical address of the different sub-block of the unique identification of described and described preservation;

Described second interface that passes through sends data write operation request to storage system, and described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, and the data that make described storage system preserve described sub-block comprise:

The logical address of described sub-block is converted to physical address; Send data write operation request by described the second interface to storage system, make described storage system preserve the data of described sub-block according to described physical address.

4. method according to claim 1, is characterized in that, at the described first interface that provides, is received from the file write operation requests of application program and is also comprised before by described first interface:

Definition primary sources piece and secondary sources piece, wherein, described primary sources piece stores the address information of described storage system and the index of secondary sources piece, and described secondary sources piece comprises meta data block and sub-block; Between described primary sources piece and described secondary sources piece, set up and have link.

5. method according to claim 4, is characterized in that, described method also comprises:

Copy linking between described primary sources piece and described primary sources piece and described secondary sources piece, create the snapshot of described primary sources piece.

6. a data processing equipment, it is characterized in that, described data processing equipment is connected with a storage system, described data processing equipment comprises first interface, described storage system comprises the second interface, described first interface is identical with the type of described the second interface, and described storage system is carried out file read-write operation by the second interface, wherein:

Described first interface is in the file write operation requests receiving from application program;

Described device also comprises:

7. device according to claim 6, is characterized in that, described data de-duplication module comprises:

Logical address distribution module, for obtaining the unique identification of described sub-block, the logical address using the unique identification of described sub-block as sub-block; The unique identification of preserving in the unique identification of described sub-block and metadata management module is compared;

Data management module, when different for the unique identification preserved from metadata management module when the unique identification of sub-block, is converted to physical address by the logical address of described sub-block; Send data write operation request by described the second interface to storage system, described storage system is preserved the data of described sub-block according to described physical address;

Described metadata management module is also in the time that the unique identification of sub-block is identical with the unique identification of preserving, and sets up linking between the sub-block identical with the unique identification of described preservation and data corresponding to the unique identification of described preservation.

8. device according to claim 6, is characterized in that, described device also comprises:

Data dictionary, for defining primary sources piece and secondary sources piece; Wherein, described primary sources piece is for preserving the address information of storage system and the index of secondary sources piece; The address of described secondary sources piece is associated with the unique identification of its data, and described secondary sources piece comprises meta data block and sub-block; Between described primary sources piece and described secondary sources piece, set up and have link.

9. device according to claim 8, is characterized in that, described device also comprises:

Reference count module, for recording the number of times of quoting of described secondary sources piece, in the time that described reference count is zero, deletes described secondary sources piece;

Snapshot module, for copying linking between described primary sources piece and described primary sources piece and described secondary sources piece, creates the snapshot of described primary sources piece.

10. a data handling system, is characterized in that, described system comprises data processing equipment and storage system, and described data processing equipment has first interface, and described data processing equipment receives the data operation request from application program by first interface; Described storage system has the second interface, the interface that described the second interface and first interface are same type, and described storage system is mutual by the second interface and data processing equipment, wherein: