CN102467572A

CN102467572A - Data block query method supporting repeated data deletion procedure

Info

Publication number: CN102467572A
Application number: CN2010105761462A
Authority: CN
Inventors: 刘威; 王云松; 陈志丰
Original assignee: Inventec Corp
Current assignee: Shenzhen Excellent Clothing Co Ltd
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2012-05-23
Anticipated expiration: 2030-11-17
Also published as: CN102467572B

Abstract

A data block query method supporting a repeating data deleting procedure improves the speed of the repeating data deleting procedure for querying data blocks. The query method comprises the following steps: storing a hash index list in a server; generating a data block and a hash value according to an input file in a client; the client sends a query request to the server, and the hash value of the corresponding data block is recorded in the query request; when the hash value is not stored in the server, the server sends a storage requirement to the client, and adds the received hash value into a hash index list; establishing a corresponding associated data index list for the hash index list, and recording information of data blocks related to the hash values in the associated data index list; and when the hash value is stored in the server, returning the hash value in the corresponding associated data index list to the client according to the hash value.

Description

Support the block querying method of data de-duplication program

Technical field

The present invention relates to a kind of querying method of block, particularly a kind of block querying method of supporting the data de-duplication program.

Background technology

Data de-duplication is a kind of data reduction technology, is generally used for the standby system based on disk, and fundamental purpose is to reduce the memory capacity of using in the storage system.Its working method is in certain time cycle, to search the repeating data piece of the variable-size of diverse location in the different files.The data block that repeats replaces with designator.Owing to always be flooded with a large amount of redundant datas in the storage system.In order to address this problem, space more than the saving, " repeating deletion " technology has become focus of people's concerns just naturally.Adopting " repeating deletion " technology can be original 1/20 with the data reduction of storage; Thereby the backup space more than abdicating; Not only can make the Backup Data on the storage system preserve the also long time, but also required a large amount of bandwidth can practice thrift offline storage the time.

For reaching the purpose that data integrity is preserved, so in the process of carrying out data de-duplication, can carry out the processing of cutting to input file.Input file can produce a plurality of block after handling through cutting.For effective management data block, so in the process of carrying out cutting, can utilize index file to write down each item canned data of all block.

Client produces the corresponding cryptographic hash of block after whole input file has been carried out cutting processing (fixed length or elongated) immediately.Client is sent query requests to service end subsequently, uses cryptographic hash whether to have identical cryptographic hash to the service end query.Service end can be searched in the hash index table each query requests, returns Query Result through network then.Please refer to shown in Figure 1ly, it is the synoptic diagram of the data query block of prior art.

When the data volume of client 110 inquiry is very big; The hash index table also can increase severely thereupon; Service end 120 low memories might appear to deposit the hash index table; Like this hash index table will relate to from the slow memory device of file access and inquire about, and will drag the travelling speed of slow total system greatly.

Summary of the invention

In view of above problem; Technical matters to be solved by this invention is to provide a kind of block querying method of supporting the data de-duplication program; Be applied in many data blocks that produced through the data de-duplication program; And the processing that the data block is inquired about, and then improve the inquiry velocity of block.

For achieving the above object, the block querying method of the support data de-duplication program that the present invention disclosed may further comprise the steps: in service end, store the hash index tabulation, the many groups of record cryptographic hash in the hash index tabulation; Load input file in the client, and produce the block of corresponding input file and the cryptographic hash of corresponding each block; Client is sent query requests to service end, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to the service end inquiry; In the hash index tabulation of service end, do not store cryptographic hash; Then service end is sent storage request to client; Store in the service end in order to the corresponding block of cryptographic hash institute is sent to, and service end adds received cryptographic hash in the hash index tabulation in regular turn; Cryptographic hash in the hash index tabulation is set up corresponding associated data index, and other cryptographic hash that the record cryptographic hash is correlated with in the associated data index; In service end, store cryptographic hash, then service end returns to client according to cryptographic hash with the cryptographic hash in the corresponding associated data index in the lump; During the cryptographic hash of data query block, whether client has existed cryptographic hash from the associated data index inquiry that is received to client next time; In the associated data index that client received, there has been cryptographic hash; Then by the descriptor that obtains cryptographic hash information or cryptographic hash relevant data block in the associated data index; For example this data block number of times that has been cited can increase according to quoting needs; In the associated data index that client received, do not have cryptographic hash, then client is carried out the inquiry of cryptographic hash to service end.

Because the associated data index can show the relevance (forward-backward correlation) of block, and in use service end can constantly adjustment should the tabulation of couplet data directory according to statistical information.So can guarantee the hit rate that client is inquired about to a certain extent in local internal memory.Service end can use the cost of once visiting memory device at a slow speed to obtain a large amount of relative recordings, has significantly reduced client like this and has carried out query requests repeatedly and cause that service end constantly reads the problem of inquiry at memory device at a slow speed.Simultaneously the single pass network sends the data directory collection and has also reduced in the network request/affirmation back and forth and carried out the consuming time of network access.

Describe the present invention below in conjunction with accompanying drawing and specific embodiment, but not as to qualification of the present invention.

Description of drawings

Fig. 1 is the synoptic diagram of the data query block of prior art;

Fig. 2 is a configuration diagram of the present invention;

Fig. 3 is an operation workflow synoptic diagram of the present invention;

Fig. 4 is the synoptic diagram of record related data indexed set of the present invention.

Wherein, Reference numeral

110 clients

120 service ends

210 service ends

The tabulation of 211 hash indexs

212 associated data index

220 clients

Embodiment

Below in conjunction with accompanying drawing structural principle of the present invention and principle of work are done concrete description:

Please refer to shown in Figure 2ly, it is a configuration diagram of the present invention.The present invention includes service end 210 and client 220.Client 220 can be connected in service end 210 through the mode of the Internet (Internet) or corporate intranet (intranet); Can client 220 and service end 210 be run on simultaneously also that service end 210 also comprises hash index tabulation 211 on same the calculator device, the many groups of hash index tabulation 211 records cryptographic hash.During the search request of client 220 a certain block cryptographic hash in service end 210 is sent an input file, content that service end 210 is put down in writing according to hash index tabulation 211 and the action of inquiring about through following manner.Please refer to shown in Figure 3ly, it is an operation workflow synoptic diagram of the present invention.

Step S310: in service end, store the hash index tabulation, the many groups of record cryptographic hash in the hash index tabulation;

Step S320: client loads input file, and produces the block of corresponding input file and the cryptographic hash of corresponding each block;

Step S330: client is sent query requests to service end, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to the service end inquiry;

Step S340: in the hash index tabulation of service end, do not store cryptographic hash; Then service end is sent storage request to client; Store in the service end in order to the corresponding block of cryptographic hash institute is sent to, and service end adds received cryptographic hash in the hash index tabulation in regular turn;

Step S350: the cryptographic hash in the hash index tabulation is set up corresponding associated data index, and other cryptographic hash that the record cryptographic hash is correlated with in the associated data index; And

Step S360: in service end, store cryptographic hash, then service end returns to client according to cryptographic hash with the cryptographic hash in the corresponding associated data index in the lump.

By loading input file in the client 220,220 pairs of input files of client carry out cutting to be handled, and produces the block of corresponding input file and the cryptographic hash of corresponding each block.The algorithm that cryptographic hash is calculated can be but be not limited to SHA-1 or MD5.And block be according to regular length mode (fixed-size partition) or content-based elongated degree partitioning scheme (content-defined chunking, CDC).The block size that fixed length cutting algorithm use defines is in advance carried out cutting to input file.The advantage of fixed length block algorithm be simple, performance is high.Content-defined cutting algorithm is a kind of elongated block algorithm, the partition strategy that its employing fingerprint data (like the Rabin fingerprint) become length to differ in size file division.Different with fixed length cutting algorithm, content-defined cutting algorithm is based on file content and carries out the block cutting, so the block size is transformable.

Then, client 220 is sent query requests to service end 210, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to service end 210 inquiries.In the hash index tabulation 211 of service end 210, do not store cryptographic hash; Then service end 210 is sent storage request to client 220; Store in order to the corresponding block of cryptographic hash institute is sent in the service end 210, and service end 210 adds received cryptographic hash in the hash index tabulation 211 in regular turn.And the cryptographic hash in the hash index tabulation 211 set up corresponding associated data index 212, and the information of the block that the record cryptographic hash is relevant in associated data index 212.For instance, the cryptographic hash that can the storage data block in associated data index 212 or the number value of block, also or the index information of block memory location.

Suppose the processing spec inquired about from first block of input file, and service end 210 was not noted down arbitrary block of input file.Client 220 at first converts first block of input file into the first cryptographic hash hash1 (corresponding to the first cryptographic hash hash1), and the first cryptographic hash hash1 is proposed query requests to service end 210.Owing to do not store the cryptographic hash of any block of input file in the service end 210, so service end 210 is written to service end 210 with the received first cryptographic hash hash1 (first block).In like manner, second block (corresponding to the second cryptographic hash hash2) is when still being written to service end 210 according to top process.Service end 210 judges that according to the context of two data blocks the first cryptographic hash hash1 and the second cryptographic hash hash2 have relevance.Service end 210 is put into the second cryptographic hash hash2 associated data index 212 of the first cryptographic hash hash1.Please refer to shown in Figure 4ly, it is the synoptic diagram of record related data indexed set of the present invention.

Cryptographic hash for other block also is written to the associated data index 212 of the first cryptographic hash hash1 according to it in proper order.The amount of capacity of associated data index 212 has certain limitation in the present invention.When the quantity of the cryptographic hash in the associated data index 212 meets threshold value; Service end 210 is proceeded to deposit the processing of cryptographic hash except meeting in next associated data index 212; Also can this cryptographic hash of up-to-date inquiry be recorded in this associated data index 212 deleting from associated data index 212 through cryptographic hash at most after the inquiry.

For instance; If the max cap. of associated data index 212 is 10 groups of cryptographic hash of record, then the relative index of first cryptographic hash hash1 record is the second cryptographic hash hash2～the 11 cryptographic hash hash11 (in other words being exactly continuous ten data blocks after first block).

After the 12 cryptographic hash hash12 produced, service end 210 can leave the 12 cryptographic hash hash12 in the associated data index 212 of the 11 cryptographic hash hash11 in.In addition, if a certain group of cryptographic hash all exists when related with other cryptographic hash simultaneously, can adopt according to correlation properties and leave in the associated data index 212 of which cryptographic hash.Or take place all to preserve in the relevant associated data index 212 all a.

The above situation is not store the cryptographic hash that can be queried in the service end 210.In service end 210, store cryptographic hash, then service end 210 returns to client 220 according to cryptographic hash with the cryptographic hash in the corresponding associated data index 212 in the lump.Accept example.When client 220 desires are inquired about the 5th block (meaning is promptly inquired about the 5th cryptographic hash hash5), because the 5th cryptographic hash hash5 is sorted in the corresponding associated data index 212 of first cryptographic hash hash1 institute in the service end 210.So service end 210 is except returning to the 5th cryptographic hash hash5 that is inquired the client 220, service end 210 also can send the associated data index 212 of the first cryptographic hash hash1 to client 220 simultaneously in the lump.

Client 220 is after receiving the associated data concordance list, and client 220 is stored in the associated data concordance list in the internal memory.Make client 220 when the cryptographic hash of data query block next time, client 220 can begin to inquire about earlier the cryptographic hash that has had the desire inquiry whether from the associated data index 212 that is received.In the associated data index 212 that client 220 is received, there has been cryptographic hash, then by obtaining cryptographic hash in the associated data index 212.By the block of being inquired about possibly is continuously, therefore can reduce the access time of

client

220 and 210 of service ends effectively through associated data index 212, and then the efficient of raising access.Otherwise, in the associated data index 212 that client 220 is received, there is not cryptographic hash, then client 220 is carried out the cryptographic hash query processing of step S330～step S360 again to service end 210.

Because associated data index 212 can show the relevance (meaning i.e. the association of front and back order) of block, and in use service end 210 can constantly be adjusted associated data index 212 according to statistical information.So can guarantee the hit rate that client 220 is inquired about to a certain extent in local internal memory.Service end 210 can use the cost of once visiting memory device at a slow speed to obtain a large amount of relative recordings, has significantly reduced client 220 like this and has carried out query requests repeatedly and cause that service end 210 constantly reads the problem of inquiry at memory device at a slow speed.Simultaneously the single pass network sends the data directory collection and has also reduced in the network request/affirmation back and forth and carried out the consuming time of network access.

Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. A data block query method that supports a deduplication program is applied to a plurality of data blocks generated by a deduplication program, and the processing of querying the data blocks is characterized in that the support The data block query method of the deduplication program includes the following steps:

storing a hash index list in a server, and recording multiple sets of hash values in the hash index list;

A client loads an input file, and generates the data blocks corresponding to the input file and the hash value corresponding to each of the data blocks;

The client sends a query request to the server, records the hash values of the corresponding data blocks in the query request, and inquires from the server whether there is the same hash value;

When the hash value is not stored in the hash index list of the server, the server sends a storage request to the client to transmit the data block corresponding to the hash value to the server storage, and the server sequentially adds the received hash value to the hash index list;

Establishing a corresponding associated data index list for the hash value in the hash index list, and recording other hash values related to the hash value in the associated data index list;

When the hash value is stored in the server, the server returns the hash values corresponding to the associated data index list to the client according to the hash value;

When the client queries the hash value of the data block next time, the client queries whether the hash value already exists from the received associated data index list;

When the hash value already exists in the associated data index list received by the client, the hash value is obtained from the associated data index list; and

When the hash value does not exist in the associated data index list received by the client, the client queries the server for the hash value.

2 . The data block query method supporting deduplication program according to claim 1 , wherein the data blocks are generated according to a fixed-length method or a content-based variable-length segmentation method. 3 .

3. The data block query method supporting de-duplication program according to claim 1, characterized in that, when the number of the hash values in the associated data index list meets a threshold value, the query process The oldest hash value is deleted from the associated data index list, and the latest queried hash value is recorded in the associated data index list.

4. The data block query method supporting de-duplication program according to claim 1, characterized in that, when the number of the hash values in the associated data index list meets a threshold value, the server will Continue to store the hash value in the next associated data index list.