Embodiment
The embodiment of the present invention provides a kind of data processing method and device, can be that general file system increases data de-duplication function easily, and required cost is little, and supports heavily to delete in real time function, effectively saves storage space.
In order to make those skilled in the art person understand better the technical scheme in the present invention, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, should belong to the scope of protection of the invention.
Referring to Fig. 1, data processing method the first embodiment process flow diagram providing for the embodiment of the present invention.Described method comprises:
S101, provides first interface, receives the file write operation requests from application program by described first interface.
The method that the embodiment of the present invention provides is applied to data processing equipment.Described data processing equipment is connected with original storage system.Data processing equipment has first interface, and storage system has the second interface, wherein, and the general-purpose interface that described first interface and the second interface are same type.
In original storage system of not supporting data de-duplication function, application program for example, is accessed original storage system by the second interface (POSIX interface), and original storage system is original memory storage or file system.
The method that the embodiment of the present invention provides is applied to data processing equipment, and described data processing equipment has first interface, and described first interface is for receiving the operation requests from application program.Wherein, the first interface of described data processing equipment is identical with the interface type that original storage system offers application program.For application program, what it was seen is the device that can support data de-duplication; And for original storage system, be still by the second interface and carry out file I/O accessing operation.Concrete, first interface and the second interface can be general standard interfaces, for example POSIX interface.
S102, is cut at least one sub-block by the data of file, obtains the unique identification of each described sub-block, respectively itself and the unique identification of preserving is compared; If identical, set up linking between the sub-block identical with the unique identification of described preservation and data corresponding to the unique identification of described preservation; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.
In embodiments of the present invention, the unique identification of sub-block can be the cryptographic hash of this sub-block, or calculates with other algorithms the identifier obtaining.
Referring to Fig. 3, it is embodiment of the present invention repeating data treatment scheme schematic diagram.
Concrete, step S102 realizes by following steps:
S201, is cut at least one sub-block by the data of file.
In the embodiment of the present invention, can adopt slit mode fixed length or random length that the data of file are cut into one or more data blocks.It should be noted that, in the time that file is empty file, while not containing any data, at this moment, preserve the metadata of file, do not need to carry out data cutting operation.
S202, obtains the cryptographic hash of each described sub-block, respectively itself and the cryptographic hash of having preserved is compared, and judges whether to exist identical cryptographic hash.If identical, enter step S203; If different, enter step S204.
Specifically in embodiments of the present invention, in the time of a new document creation, first to pass through hash (Hash) and calculate, the address that obtains file metadata.Describe with a concrete example below, after document creation such as file a.txt by name, need to calculate through hash, by a kind of possible algorithm, the address that obtains file metadata is c03b9d52f4052f6886f6d1a79701122d (address value of metadata is relevant with concrete hash algorithm, and the present invention does not limit concrete hash algorithm).In this address, record is the address of data block, the metadata of a file can be kept in a file, also can be by the meta-data preservation of multiple files in a file or be kept in database, the present invention does not limit the Storage Format of metadata.In embodiments of the present invention, by the metadata information of metadata management module records file, the metadata of file comprises size, file type, establishment, modification time of filename, file etc.The data of file are cut into after one or more sub-blocks, and calculate respectively the cryptographic hash (hash) of each data block, hash can be for the mapping relations between reflection data and deposit data address.Concrete hash algorithm can adopt MD4, MD5 or SHA-1 algorithm, and the present invention does not limit this.The hash value record of each sub-block is in metadata information.For example, the data of a.txt file including can be split into two sub-blocks, the hash address of two sub-blocks is respectively 659df9a166c3cca249ecfa481d074a9b and 50f74ff709ed8b9547d21f130f7eefd2, and these two hash values are just kept in the indicated piece of c03b9d52f4052f6886f6d1a79701122d as content so.The data block that is designated 659df9a166c3cca249ecfa481d074a9b and 50f74ff709ed8b9547d21f130f7eefd2 is preserved respectively actual data.In embodiments of the present invention, use metadata information and the file block information of metadata management module stores file, wherein, metadata information comprises the cryptographic hash of each sub-block.
Respectively the hash value of each data block and the existing hash value of metadata management module are compared.If identical, enter step S203; If different, enter step S204.
S203, sets up linking between the sub-block identical with the cryptographic hash of described preservation and data corresponding to the cryptographic hash of described preservation.
If the hash value that the hash value of certain sub-block has been preserved with metadata management module is identical, the data that described sub-block is described exist, the data of described sub-block are the data of repetition, here the data of described sub-block are not preserved, but directly set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved.Specifically in embodiments of the present invention, can adopt the mode of pointer to establish the link.Concrete, can be by the pointed of the described data block data corresponding with described cryptographic hash of having preserved.Certainly, also can adopt other modes to establish the link, the present invention does not limit this.Meanwhile, the reference count of described data is added to 1.In embodiments of the present invention, counting represents the number of times that sub-block is quoted by higher level's data block by reference.In the time that the reference count of data is zero, delete described data.
S204, preserve the unique identification of described different from the cryptographic hash of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the cryptographic hash of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.
If the hash value of some sub-blocks is different from existing hash value, illustrate that the data of above-mentioned sub-block are not repeating datas, preserve the hash value of described data block to metadata management module, and logical address using hash as described data block; The logical address of described data block is converted to physical address, according to described physical address, the data of described data block is stored in storage system by described the second interface.
In the first embodiment provided by the invention, by a general first interface is provided for application program, receive the file write operation requests of application program, if through cryptographic hash comparison, there is repeating data in the sub-block that judges described file, carries out data de-duplication processing by data processing equipment; If through cryptographic hash comparison, judge that the cryptographic hash of sub-block of described file is different from the existing cryptographic hash of preserving in metadata management module, the data of sub-block are sent to storage system and store.When carry out for the first time write operation at file, carry out data de-duplication processing, therefore effectively reduce repeating data, save storage space.On the other hand, the embodiment of the present invention provides a general first interface for all application programs, described first interface is identical with the interface that offers application program in original storage system, making common file system is that storage system interface just can be supported to repeat delete function in the situation that not needing to change, and has realized ordinary file system possess repetition delete function by less cost.
Referring to Fig. 3, data processing method the second embodiment process flow diagram providing for the embodiment of the present invention.
Below by a concrete example, method provided by the invention is at length introduced.
S301, definition primary sources piece and secondary sources piece.
First data block is divided into primary sources piece and secondary sources piece by we, and wherein, primary sources piece can be called again traditional address block, and its address and its content are irrelevant, is the address of directly distributing according to traditional approach.Be in traditional address block, to store the address information of storage system (file system) and the index of secondary sources piece at primary sources piece.Wherein, can find secondary sources piece according to described index.In primary sources piece, record the information of storage system (file system), from primary sources piece, can find the index of other metadata and data block.Secondary sources piece can be called again content address piece, and its address is to be determined by the hash value of its content, the therefore different data block of content, and hash value is same scarcely, and address is also just different.Secondary sources piece generally comprises meta data block and sub-block.
Device provided by the invention can be regarded a file system as, and system is made up of together these " traditional address blocks " and " content address piece ".Suppose, in a file system, has respectively sub-directory 1 and sub-directory 2 under a file root directory; Wherein, sub-directory stores two files for 1 time, is respectively file 1 and file 2.Wherein, file 1 is empty file, and file 2 has the data of 1k.Sub-directory 2 is empty list.Referring to Fig. 4, the data block of file root directory representative is primary sources piece, can obtain the index of other meta data block and sub-block from this data block.Meta data block 1, meta data block 2, meta data block 3 and data block (Data Block) are secondary sources piece.For above-described file system, the metadata store of sub-directory 1 correspondence is in meta data block 1, and the metadata store of sub-directory 2 correspondences is in meta data block 2.Owing to storing two files in sub-directory 1, at this moment we calculate the file 1 that obtains respectively in sub-directory 1 and the address of file 2 by hash, are kept in meta data block 1.As shown in Figure 4, in meta data block 1, preserve two address informations, the wherein address of address 1 representation file 2, the address of address 2 representation files 1.Certainly, this is only a kind of example, also can use the address of address 1 representation file 1, the address of address 2 representation files 2.Can find respectively the metadata information of file 1 and file 2 and file 1 and file 2 by these two address informations.At this moment, we calculate by hash value address corresponding to data obtaining in file 2 again, and are kept in the meta data block 3 of file 2 correspondences.Referring to Fig. 3, owing to there being the data of 1K in file 2, the address corresponding to data of preserving file 2 in the meta data block 3 that therefore address 1 is pointed to is address 3, and the data of file 2 are pointed in address 3.Because file 1 does not have data, so meta data block 3 is only pointed in address 2, meta data block 3 does not link with concrete data block.Find out from the graph, in the method providing in the embodiment of the present invention, general metadata and data are reciprocity, and then they hang on file system tree by the calculated address of hash value, all belongs to secondary sources piece.Above secondary sources piece, there is a reference count at each, quoted by how many higher level's data blocks for identifying this data block.
S302, reception application program creates the request of the first file.
S303, the metadata information of preservation the first file.
Referring to Fig. 5 a-5d, it is second embodiment of the invention data processing method schematic diagram.
FR (Filesystem Root) representation file system entry is primary sources piece.When receiving after the request to create of the first file, calculate the hash value of the first file, and address using described hash value as the first file metadata.The content of concrete metadata can comprise the information such as filename, file size, file type, document creation time.In Fig. 5 a, FR representation file system entry, MB1 represents the meta data block of the first file, the digitized representation reference count on piece, the reference count of MB1 is 1, shows that it is quoted 1 time by higher level's data block.Between file system entrance FR and meta data block MB1, set up and link.
S304, is cut at least one sub-block by the first file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.
Suppose that the first file is cut into a sub-block, through comparing, the hash value of described sub-block is different from the hash value of preserving, preserve the hash value of the first file sub-block and the data of sub-block, as shown in Figure 5 b, create data block DB1, in order to preserve the data of sub-block.Set up linking between meta data block and sub-block, i.e. linking between MB1 and DB1 simultaneously.Like this, can find the data of the first file sub-block from meta data block MB1.
S305, the request to create of reception the second file.
Wherein, the filename of the second file is different from the first file, and the content of the second file is identical with the first file.
S306, the metadata information of preservation the second file.
Preserve the metadata information of the second file.In Fig. 5 c, MB2 represents the meta data block of the second file, the digitized representation reference count on piece.
S307, is cut at least one sub-block by the second file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.
Suppose the second file to be cut into a sub-block, obtain the hash value of described sub-block, and it is carried out in the catalogue of FR to index search, finding has had identical hash value, sets up linking of data that described sub-block is corresponding with described identical hash value.As shown in Fig. 5 d, create MB2, in order to store the metadata information of the second file, because the data of the second file are identical with the data content of the first file, calculate by hash, find existing identical hash value, therefore the data of the second file are not stored, only set up linking of MB2 and DB1, the reference count of DB1 is added to 1 simultaneously, the reference count of DB1 is 2 now.
Referring to Fig. 6, it is data processing method provided by the invention the 3rd embodiment schematic flow sheet.
Snapshot is as a characteristic of storage system, its importance is also more and more higher, and the implementation method of snapshot is a lot, but generally all will solve the problem of quick and data consistent, even have conflict with other functions, the method for prior art simultaneously Zhi Chichong is deleted and snapshot functions.The method that the embodiment of the present invention provides can support snapshot, heavily delete function uses simultaneously, and by very simple mode really accomplish fast, consistent.In embodiments of the present invention, to file data modify, delete, create, increase, creation operation, snapshot functions still can be suitable for.Be operating as example and describe snapshot functions and heavily delete function and can be suitable for simultaneously file data is deleted, increased below.
S601, definition primary sources piece and secondary sources piece.
S602, copies linking between primary sources piece and described primary sources piece and secondary sources piece, creates the snapshot of primary sources piece.
Referring to Fig. 7 a to Fig. 7 f, it is third embodiment of the invention schematic diagram.
Taking the file system shown in Fig. 7 a as example, snapshot implementing method is described below.Primary sources piece (FR as shown in Figure 7a) is copied to portion, be named as copy as snapshot, and copy linking of primary sources piece and next stage sub-block.As shown in Figure 7b, the snapshot that copy is FR, and the reference count that the reference count of the data block of being quoted by copy adds 1, MB1 and MB2 all becomes 2.
S603, receives the file deletion action request of application program.
It is successful illustrating above-mentioned snapshot with a file deletion action below.This is only a concrete example of the inventive method, any amendment for file or deletion action, and snapshot is still applicable.
S604, according to the metadata information of described file of preserving, obtains point block message of described file, and disconnects linking of described piecemeal and primary sources piece according to described point of block message.
Concrete, suppose that the file that will delete is the second file, according to the metadata information of the second file, obtaining its block data is the indicated data block of MB2.At this moment, delete linking between the primary sources piece (FR) shown in Fig. 7 c dotted line and sub-block MB2, deletion result is as shown in Fig. 7 d.At this moment, still can have access to the second file from snapshot copy, but the reference count of the meta data block MB2 of the second file subtracts 1.At this moment do not delete meta data block MB2 and data block DB1 that the second file is corresponding, because they are also quoted by other data blocks.In the time that the reference count of data block is zero, delete described data block.
S605, receives the request that creates the 3rd file.
Preserve the metadata information of the 3rd file.As shown in Fig. 7 e, increase a meta data block MB3, reference count is 1.
S606, is cut at least one sub-block by the 3rd file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.
In the present embodiment, suppose the 3rd file to be cut into two sub-blocks, obtain respectively the hash value of two sub-blocks.By the hash value of first data block and existing hash value relatively, find to have had identical hash value, set up linking of data that this sub-block is corresponding with hash value.As shown in Fig. 7 f, the content of first data block is identical with the content of two data blocks above, sets up linking of MB3 and DB1, and the reference count of DB1 adds 1.The hash value of second data block is different from the hash value existing, at this moment, and a newly-built data block DB2, and the data of data block are preserved.
Can find out from above-described embodiment, method provided by the invention can be supported snapshot simultaneously and heavily delete function, and implementation method is simple.
In the method that prior art provides, rely on data-base recording metadata, when therefore the file after heavily deleting is revised again, amendment part again Zhi Chichong is deleted.File become some file fragmentations, and these file fragmentation specific addresses is recorded in database after heavily deleting after once.At this moment, each file fragmentation may be to meet well heavily to delete relation, but when amendment is wherein when certain file fragmentation, and this scheme can't be again distributed new address or burst again to this file fragmentation, after heavily deleting file and modifying, whole system is heavily deleted effect and can be declined like this.In order to address this problem, the method that the embodiment of the present invention provides, after file is modified, is still carried out cutting to whole file, carries out data de-duplication operations, therefore heavily deletes the scheme that effect is better than prior art.
Referring to Fig. 8, it is data processing method provided by the invention the 4th embodiment schematic diagram.
S801, the file modification operation requests of reception application program.
S802, according to the metadata information of the described file of preserving, access storage system, obtains described file.
The metadata information of preserving while establishment first according to file, access storage system, obtains described file.
S803, the file write operation requests of reception application program.
After application program is modified to described file, receive the request that writes amended file.
S804, is cut at least one sub-block to the data of amended file, obtains the cryptographic hash of described sub-block, respectively itself and the cryptographic hash of preserving is compared; If identical, set up linking between the data that described sub-block is corresponding with described cryptographic hash of having preserved; If different, preserve the data of described sub-block to storage system.
In embodiment provided by the invention, after file is modified, still whole file is carried out to cutting, if a sub-block of amended file is repeating data, do not preserve described sub-block; If the hash value of a sub-block of amended file is different from existing hash value, illustrate that it is new data, preserves the data of described sub-block.
In this embodiment, after file is modified, still it is re-started to cutting, calculate, compare hash value, to judge whether to exist repeating data, therefore obtain and heavily deleted preferably effect.
Corresponding with data processing method embodiment provided by the invention, the present invention also provides data handling system, data processing equipment specific embodiment.
Referring to Fig. 9, the data handling system schematic diagram providing for the embodiment of the present invention.
A kind of data handling system, described system comprises data processing equipment 902 and storage system 903, and described data processing equipment 902 has first interface, and described data processing equipment 902 receives the data operation request from application program 901 by first interface; Described storage system 903 has the second interface, the general-purpose interface that described the second interface and first interface are same type, and described storage system 903 is mutual by the second interface and data processing equipment 902, wherein,
Described data processing equipment 902 is for receiving the file write operation requests of application program; The data of described file are cut into at least one sub-block, obtain the unique identification of each described sub-block, respectively the unique identification of preserving in itself and metadata management module is compared; If identical, set up linking between the data that the unique identification preserved in the sub-block identical with the unique identification of described preservation and described metadata management module is corresponding; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.
In embodiments of the present invention, unique identification can be cryptographic hash or other identifiers.
Described storage system 903, for receiving the data write operation request from data processing equipment, is preserved described data.
In embodiment provided by the invention, data processing equipment has a general first interface, and described first interface is identical with the second interface type of storage system.Storage system can be file system or memory storage, has the second general interface, is specifically as follows POSIX interface.Data processing equipment has first interface, is specifically as follows POSIX interface.Data processing equipment can be independent of storage system and exist.Storage system can be general memory storage or file system.
Referring to Figure 10, it is embodiment of the present invention data processing equipment schematic diagram.Described device comprises:
First interface, for receiving the write operation requests from application program; The interface that described first interface and the second interface are same type, described the second interface is associated with storage system;
Metadata management module, for metadata and the file block information of storage file;
Data de-duplication module, for the data of the file receiving by first interface are cut into at least one sub-block, obtains the unique identification of each described sub-block, respectively the unique identification of preserving in itself and metadata management module is compared; If identical, set up linking between the data that the unique identification preserved in the sub-block identical with the unique identification of described preservation and described metadata management module is corresponding; If different, preserve the unique identification of described different from the unique identification of described preservation sub-blocks, and send data write operation request by the second interface to storage system, described data write operation request comprises the data of described different from the unique identification of described preservation sub-blocks, makes described storage system preserve the data of described sub-block.
In embodiments of the present invention, unique identification can be cryptographic hash or other identifiers.
Described data de-duplication module comprises:
Cutting module, for being cut at least one sub-block by the data of described file;
Logical address distribution module, for obtaining the unique identification of described sub-block, the logical address using the unique identification of described sub-block as sub-block compares the unique identification of preserving in itself and metadata management module;
Data management module, when different for the unique identification preserved from metadata management module when the unique identification of sub-block, is converted to physical address by the logical address of described sub-block; Send data write operation request by described the second interface to storage system, described storage system is stored to the data of described sub-block in described storage system by described the second interface according to described physical address.
Described metadata management module, also in the time that the unique identification of sub-block is identical with the unique identification of preserving, is set up linking between the data that described sub-block is corresponding with the described unique identification of having preserved.
Described device also comprises:
Data dictionary, for defining primary sources piece and secondary sources piece; Wherein, described primary sources piece is for preserving the address information of storage system and the index of secondary sources piece; The address of described secondary sources piece is associated with the cryptographic hash of its content, and described secondary sources piece comprises meta data block and sub-block.
Described device also comprises:
Reference count module, for recording the number of times of quoting of described secondary sources piece, in the time that described reference count is zero, deletes described secondary sources piece.
Described device also comprises:
Snapshot module, for copying linking between described primary sources piece and described primary sources piece and secondary sources piece, creates the snapshot of described primary sources piece.
Concrete, described first interface is also for receiving the file deletion action request of application program;
Described device also comprises:
Removing module, for according to the metadata information of described file of preserving, obtains point block message of described file, and disconnects linking of described piecemeal and primary sources piece according to described point of block message;
Described reference count module also, for according to the request of removing module, subtracts 1 by the reference count of the data corresponding with described piecemeal.
Concrete, described first interface is also for receiving the file read operation request of application program;
Described device also comprises:
Read module, for according to the metadata information of the described file of preserving, accesses storage system, obtains described file, and described file is carried out to read operation.
Described device can also comprise the 3rd interface, and described the 3rd interface is used for inputting with the second interface, output access operation.
It should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
The present invention can describe in the general context of computer executable instructions, for example program module.Usually, program module comprises and carries out particular task or realize routine, program, object, assembly, data structure of particular abstract data type etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, be executed the task by the teleprocessing equipment being connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium including memory device.
The above is only the specific embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.