CN102467572A - Data block inquiring method for supporting data de-duplication program - Google Patents
Data block inquiring method for supporting data de-duplication program Download PDFInfo
- Publication number
- CN102467572A CN102467572A CN2010105761462A CN201010576146A CN102467572A CN 102467572 A CN102467572 A CN 102467572A CN 2010105761462 A CN2010105761462 A CN 2010105761462A CN 201010576146 A CN201010576146 A CN 201010576146A CN 102467572 A CN102467572 A CN 102467572A
- Authority
- CN
- China
- Prior art keywords
- cryptographic hash
- service end
- block
- hash
- client
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data block inquiring method for supporting a data de-duplication program, which improves the speed for the data de-duplication program to inquire the data blocks. The inquiring method comprises the following steps: storing a hash index list into a server; producing data blocks and hash values by a client according to input files; sending an inquiring request to the server by the client, wherein the inquiring request records the hash values of corresponding data blocks; sending a storage requirement to the client by the server when the server stores no hash values, and adding received hash values into the hash index list; building a corresponding associated data index list to the hash index list, and recording the data block information associated with the hash values in the associated data index list; returning the hash values in the corresponding associated data index list to the client according to the hash value when the hash value is stored in the server.
Description
Technical field
The present invention relates to a kind of querying method of block, particularly a kind of block querying method of supporting the data de-duplication program.
Background technology
Data de-duplication is a kind of data reduction technology, is generally used for the standby system based on disk, and fundamental purpose is to reduce the memory capacity of using in the storage system.Its working method is in certain time cycle, to search the repeating data piece of the variable-size of diverse location in the different files.The data block that repeats replaces with designator.Owing to always be flooded with a large amount of redundant datas in the storage system.In order to address this problem, space more than the saving, " repeating deletion " technology has become focus of people's concerns just naturally.Adopting " repeating deletion " technology can be original 1/20 with the data reduction of storage; Thereby the backup space more than abdicating; Not only can make the Backup Data on the storage system preserve the also long time, but also required a large amount of bandwidth can practice thrift offline storage the time.
For reaching the purpose that data integrity is preserved, so in the process of carrying out data de-duplication, can carry out the processing of cutting to input file.Input file can produce a plurality of block after handling through cutting.For effective management data block, so in the process of carrying out cutting, can utilize index file to write down each item canned data of all block.
Client produces the corresponding cryptographic hash of block after whole input file has been carried out cutting processing (fixed length or elongated) immediately.Client is sent query requests to service end subsequently, uses cryptographic hash whether to have identical cryptographic hash to the service end query.Service end can be searched in the hash index table each query requests, returns Query Result through network then.Please refer to shown in Figure 1ly, it is the synoptic diagram of the data query block of prior art.
When the data volume of client 110 inquiry is very big; The hash index table also can increase severely thereupon; Service end 120 low memories might appear to deposit the hash index table; Like this hash index table will relate to from the slow memory device of file access and inquire about, and will drag the travelling speed of slow total system greatly.
Summary of the invention
In view of above problem; Technical matters to be solved by this invention is to provide a kind of block querying method of supporting the data de-duplication program; Be applied in many data blocks that produced through the data de-duplication program; And the processing that the data block is inquired about, and then improve the inquiry velocity of block.
For achieving the above object, the block querying method of the support data de-duplication program that the present invention disclosed may further comprise the steps: in service end, store the hash index tabulation, the many groups of record cryptographic hash in the hash index tabulation; Load input file in the client, and produce the block of corresponding input file and the cryptographic hash of corresponding each block; Client is sent query requests to service end, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to the service end inquiry; In the hash index tabulation of service end, do not store cryptographic hash; Then service end is sent storage request to client; Store in the service end in order to the corresponding block of cryptographic hash institute is sent to, and service end adds received cryptographic hash in the hash index tabulation in regular turn; Cryptographic hash in the hash index tabulation is set up corresponding associated data index, and other cryptographic hash that the record cryptographic hash is correlated with in the associated data index; In service end, store cryptographic hash, then service end returns to client according to cryptographic hash with the cryptographic hash in the corresponding associated data index in the lump; During the cryptographic hash of data query block, whether client has existed cryptographic hash from the associated data index inquiry that is received to client next time; In the associated data index that client received, there has been cryptographic hash; Then by the descriptor that obtains cryptographic hash information or cryptographic hash relevant data block in the associated data index; For example this data block number of times that has been cited can increase according to quoting needs; In the associated data index that client received, do not have cryptographic hash, then client is carried out the inquiry of cryptographic hash to service end.
Because the associated data index can show the relevance (forward-backward correlation) of block, and in use service end can constantly adjustment should the tabulation of couplet data directory according to statistical information.So can guarantee the hit rate that client is inquired about to a certain extent in local internal memory.Service end can use the cost of once visiting memory device at a slow speed to obtain a large amount of relative recordings, has significantly reduced client like this and has carried out query requests repeatedly and cause that service end constantly reads the problem of inquiry at memory device at a slow speed.Simultaneously the single pass network sends the data directory collection and has also reduced in the network request/affirmation back and forth and carried out the consuming time of network access.
Describe the present invention below in conjunction with accompanying drawing and specific embodiment, but not as to qualification of the present invention.
Description of drawings
Fig. 1 is the synoptic diagram of the data query block of prior art;
Fig. 2 is a configuration diagram of the present invention;
Fig. 3 is an operation workflow synoptic diagram of the present invention;
Fig. 4 is the synoptic diagram of record related data indexed set of the present invention.
Wherein, Reference numeral
110 clients
120 service ends
210 service ends
The tabulation of 211 hash indexs
212 associated data index
220 clients
Embodiment
Below in conjunction with accompanying drawing structural principle of the present invention and principle of work are done concrete description:
Please refer to shown in Figure 2ly, it is a configuration diagram of the present invention.The present invention includes service end 210 and client 220.Client 220 can be connected in service end 210 through the mode of the Internet (Internet) or corporate intranet (intranet); Can client 220 and service end 210 be run on simultaneously also that service end 210 also comprises hash index tabulation 211 on same the calculator device, the many groups of hash index tabulation 211 records cryptographic hash.During the search request of client 220 a certain block cryptographic hash in service end 210 is sent an input file, content that service end 210 is put down in writing according to hash index tabulation 211 and the action of inquiring about through following manner.Please refer to shown in Figure 3ly, it is an operation workflow synoptic diagram of the present invention.
Step S310: in service end, store the hash index tabulation, the many groups of record cryptographic hash in the hash index tabulation;
Step S320: client loads input file, and produces the block of corresponding input file and the cryptographic hash of corresponding each block;
Step S330: client is sent query requests to service end, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to the service end inquiry;
Step S340: in the hash index tabulation of service end, do not store cryptographic hash; Then service end is sent storage request to client; Store in the service end in order to the corresponding block of cryptographic hash institute is sent to, and service end adds received cryptographic hash in the hash index tabulation in regular turn;
Step S350: the cryptographic hash in the hash index tabulation is set up corresponding associated data index, and other cryptographic hash that the record cryptographic hash is correlated with in the associated data index; And
Step S360: in service end, store cryptographic hash, then service end returns to client according to cryptographic hash with the cryptographic hash in the corresponding associated data index in the lump.
By loading input file in the client 220,220 pairs of input files of client carry out cutting to be handled, and produces the block of corresponding input file and the cryptographic hash of corresponding each block.The algorithm that cryptographic hash is calculated can be but be not limited to SHA-1 or MD5.And block be according to regular length mode (fixed-size partition) or content-based elongated degree partitioning scheme (content-defined chunking, CDC).The block size that fixed length cutting algorithm use defines is in advance carried out cutting to input file.The advantage of fixed length block algorithm be simple, performance is high.Content-defined cutting algorithm is a kind of elongated block algorithm, the partition strategy that its employing fingerprint data (like the Rabin fingerprint) become length to differ in size file division.Different with fixed length cutting algorithm, content-defined cutting algorithm is based on file content and carries out the block cutting, so the block size is transformable.
Then, client 220 is sent query requests to service end 210, and whether the cryptographic hash of record corresponding data block in query requests is in order to have identical cryptographic hash to service end 210 inquiries.In the hash index tabulation 211 of service end 210, do not store cryptographic hash; Then service end 210 is sent storage request to client 220; Store in order to the corresponding block of cryptographic hash institute is sent in the service end 210, and service end 210 adds received cryptographic hash in the hash index tabulation 211 in regular turn.And the cryptographic hash in the hash index tabulation 211 set up corresponding associated data index 212, and the information of the block that the record cryptographic hash is relevant in associated data index 212.For instance, the cryptographic hash that can the storage data block in associated data index 212 or the number value of block, also or the index information of block memory location.
Suppose the processing spec inquired about from first block of input file, and service end 210 was not noted down arbitrary block of input file.Client 220 at first converts first block of input file into the first cryptographic hash hash1 (corresponding to the first cryptographic hash hash1), and the first cryptographic hash hash1 is proposed query requests to service end 210.Owing to do not store the cryptographic hash of any block of input file in the service end 210, so service end 210 is written to service end 210 with the received first cryptographic hash hash1 (first block).In like manner, second block (corresponding to the second cryptographic hash hash2) is when still being written to service end 210 according to top process.Service end 210 judges that according to the context of two data blocks the first cryptographic hash hash1 and the second cryptographic hash hash2 have relevance.Service end 210 is put into the second cryptographic hash hash2 associated data index 212 of the first cryptographic hash hash1.Please refer to shown in Figure 4ly, it is the synoptic diagram of record related data indexed set of the present invention.
Cryptographic hash for other block also is written to the associated data index 212 of the first cryptographic hash hash1 according to it in proper order.The amount of capacity of associated data index 212 has certain limitation in the present invention.When the quantity of the cryptographic hash in the associated data index 212 meets threshold value; Service end 210 is proceeded to deposit the processing of cryptographic hash except meeting in next associated data index 212; Also can this cryptographic hash of up-to-date inquiry be recorded in this associated data index 212 deleting from associated data index 212 through cryptographic hash at most after the inquiry.
For instance; If the max cap. of associated data index 212 is 10 groups of cryptographic hash of record, then the relative index of first cryptographic hash hash1 record is the second cryptographic hash hash2~the 11 cryptographic hash hash11 (in other words being exactly continuous ten data blocks after first block).
After the 12 cryptographic hash hash12 produced, service end 210 can leave the 12 cryptographic hash hash12 in the associated data index 212 of the 11 cryptographic hash hash11 in.In addition, if a certain group of cryptographic hash all exists when related with other cryptographic hash simultaneously, can adopt according to correlation properties and leave in the associated data index 212 of which cryptographic hash.Or take place all to preserve in the relevant associated data index 212 all a.
The above situation is not store the cryptographic hash that can be queried in the service end 210.In service end 210, store cryptographic hash, then service end 210 returns to client 220 according to cryptographic hash with the cryptographic hash in the corresponding associated data index 212 in the lump.Accept example.When client 220 desires are inquired about the 5th block (meaning is promptly inquired about the 5th cryptographic hash hash5), because the 5th cryptographic hash hash5 is sorted in the corresponding associated data index 212 of first cryptographic hash hash1 institute in the service end 210.So service end 210 is except returning to the 5th cryptographic hash hash5 that is inquired the client 220, service end 210 also can send the associated data index 212 of the first cryptographic hash hash1 to client 220 simultaneously in the lump.
Because associated data index 212 can show the relevance (meaning i.e. the association of front and back order) of block, and in use service end 210 can constantly be adjusted associated data index 212 according to statistical information.So can guarantee the hit rate that client 220 is inquired about to a certain extent in local internal memory.Service end 210 can use the cost of once visiting memory device at a slow speed to obtain a large amount of relative recordings, has significantly reduced client 220 like this and has carried out query requests repeatedly and cause that service end 210 constantly reads the problem of inquiry at memory device at a slow speed.Simultaneously the single pass network sends the data directory collection and has also reduced in the network request/affirmation back and forth and carried out the consuming time of network access.
Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.
Claims (4)
1. block querying method of supporting the data de-duplication program; Be applied in many data blocks that produced through a data de-duplication program; And the processing that this block is inquired about; It is characterized in that the block querying method of this support data de-duplication program may further comprise the steps:
In a service end, store hash index tabulation, the many groups of record cryptographic hash in this hash index tabulation;
One client loads an input file, and produce mutually should input file those block and this cryptographic hash of corresponding each this block;
This client is sent a query requests to this service end, and whether this cryptographic hash of corresponding those block of record in this query requests is in order to have this identical cryptographic hash to this service end inquiry;
In this hash index tabulation of this service end, do not store this cryptographic hash; Then this service end is sent a storage request to this client; Store in this service end in order to this corresponding this block of cryptographic hash institute is sent to, and this service end adds this received cryptographic hash in this hash index tabulation in regular turn;
This cryptographic hash in this hash index tabulation is set up a corresponding associated data index, and other those cryptographic hash that this cryptographic hash of record is correlated with in this associated data index;
In this service end, store this cryptographic hash, then this service end returns to this client according to this cryptographic hash with those cryptographic hash in corresponding this associated data index in the lump;
When this client was inquired about this cryptographic hash of this block next time, whether this client had existed this cryptographic hash from this associated data index inquiry that is received;
In this associated data index that this client received, there has been this cryptographic hash, then by obtaining this cryptographic hash in this associated data index; And
In this associated data index that this client received, do not have this cryptographic hash, then this client is carried out the inquiry of this cryptographic hash to this service end.
2. the block querying method of support data de-duplication program according to claim 1 is characterized in that, those block of generation are according to a regular length mode or a content-based elongated degree partitioning scheme.
3. the block querying method of support data de-duplication program according to claim 1; It is characterized in that; When the quantity of those cryptographic hash in this associated data index meets a threshold value; Delete from this associated data index through this cryptographic hash at most after then will inquiring about, this cryptographic hash of up-to-date inquiry is recorded in this associated data index.
4. the block querying method of support data de-duplication program according to claim 1; It is characterized in that; When the quantity of those cryptographic hash in this associated data index meets a threshold value, then this service end processing that can in next this associated data index, proceed to deposit this cryptographic hash.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010576146 CN102467572B (en) | 2010-11-17 | 2010-11-17 | Data block inquiring method for supporting data de-duplication program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010576146 CN102467572B (en) | 2010-11-17 | 2010-11-17 | Data block inquiring method for supporting data de-duplication program |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102467572A true CN102467572A (en) | 2012-05-23 |
CN102467572B CN102467572B (en) | 2013-10-02 |
Family
ID=46071215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010576146 Expired - Fee Related CN102467572B (en) | 2010-11-17 | 2010-11-17 | Data block inquiring method for supporting data de-duplication program |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102467572B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915278A (en) * | 2012-09-19 | 2013-02-06 | 浪潮(北京)电子信息产业有限公司 | Data deduplication method |
CN102930004A (en) * | 2012-10-29 | 2013-02-13 | 华为技术有限公司 | Hash value storage method, device and chip |
CN102968507A (en) * | 2012-12-14 | 2013-03-13 | 中国银行股份有限公司 | Cache table based data query method |
CN103473298A (en) * | 2013-09-04 | 2013-12-25 | 华为技术有限公司 | Data archiving method and device and storage system |
WO2014067063A1 (en) * | 2012-10-30 | 2014-05-08 | 华为技术有限公司 | Duplicate data retrieval method and device |
CN105706041A (en) * | 2013-10-16 | 2016-06-22 | 网络装置公司 | Technique for global deduplication across datacenters with minimal coordination |
CN105917304A (en) * | 2014-12-09 | 2016-08-31 | 华为技术有限公司 | Apparatus and method for de-duplication of data |
CN106815260A (en) * | 2015-12-01 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of index establishing method and equipment |
WO2018165963A1 (en) * | 2017-03-17 | 2018-09-20 | 深圳市秀趣品牌文化传播有限公司 | E-commerce data redundancy processing system and method |
CN109902086A (en) * | 2019-01-31 | 2019-06-18 | 阿里巴巴集团控股有限公司 | A kind of index creation method, device and equipment |
CN110008249A (en) * | 2019-01-31 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of time-based data query method, device and equipment |
CN112817920A (en) * | 2021-03-03 | 2021-05-18 | 深圳市知小兵科技有限公司 | Distributed big data cleaning method |
CN114647658A (en) * | 2022-03-30 | 2022-06-21 | 新华三信息技术有限公司 | Data retrieval method, device, equipment and machine-readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006215735A (en) * | 2005-02-02 | 2006-08-17 | Mitsubishi Electric Corp | Duplicate website detection device |
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
-
2010
- 2010-11-17 CN CN 201010576146 patent/CN102467572B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006215735A (en) * | 2005-02-02 | 2006-08-17 | Mitsubishi Electric Corp | Duplicate website detection device |
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915278A (en) * | 2012-09-19 | 2013-02-06 | 浪潮(北京)电子信息产业有限公司 | Data deduplication method |
CN102930004A (en) * | 2012-10-29 | 2013-02-13 | 华为技术有限公司 | Hash value storage method, device and chip |
CN102930004B (en) * | 2012-10-29 | 2015-07-08 | 华为技术有限公司 | Hash value storage method, device and chip |
WO2014067063A1 (en) * | 2012-10-30 | 2014-05-08 | 华为技术有限公司 | Duplicate data retrieval method and device |
CN102968507A (en) * | 2012-12-14 | 2013-03-13 | 中国银行股份有限公司 | Cache table based data query method |
CN103473298B (en) * | 2013-09-04 | 2017-01-11 | 华为技术有限公司 | Data archiving method and device and storage system |
CN103473298A (en) * | 2013-09-04 | 2013-12-25 | 华为技术有限公司 | Data archiving method and device and storage system |
CN105706041A (en) * | 2013-10-16 | 2016-06-22 | 网络装置公司 | Technique for global deduplication across datacenters with minimal coordination |
US11775503B2 (en) | 2013-10-16 | 2023-10-03 | Netapp, Inc. | Technique for global deduplication across datacenters with minimal coordination |
US11301455B2 (en) | 2013-10-16 | 2022-04-12 | Netapp, Inc. | Technique for global deduplication across datacenters with minimal coordination |
CN105706041B (en) * | 2013-10-16 | 2019-07-19 | Netapp股份有限公司 | For carrying out the technology of global duplicate removal between the data center with minimum cooperation |
US10685013B2 (en) | 2013-10-16 | 2020-06-16 | Netapp Inc. | Technique for global deduplication across datacenters with minimal coordination |
CN105917304A (en) * | 2014-12-09 | 2016-08-31 | 华为技术有限公司 | Apparatus and method for de-duplication of data |
CN106815260B (en) * | 2015-12-01 | 2021-05-04 | 阿里巴巴集团控股有限公司 | Index establishing method and equipment |
CN106815260A (en) * | 2015-12-01 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of index establishing method and equipment |
WO2018165963A1 (en) * | 2017-03-17 | 2018-09-20 | 深圳市秀趣品牌文化传播有限公司 | E-commerce data redundancy processing system and method |
CN110008249A (en) * | 2019-01-31 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of time-based data query method, device and equipment |
CN109902086A (en) * | 2019-01-31 | 2019-06-18 | 阿里巴巴集团控股有限公司 | A kind of index creation method, device and equipment |
CN109902086B (en) * | 2019-01-31 | 2022-12-20 | 创新先进技术有限公司 | Index creation method, device and equipment |
CN110008249B (en) * | 2019-01-31 | 2023-08-08 | 创新先进技术有限公司 | Time-based data query method, device and equipment |
CN112817920A (en) * | 2021-03-03 | 2021-05-18 | 深圳市知小兵科技有限公司 | Distributed big data cleaning method |
CN114647658A (en) * | 2022-03-30 | 2022-06-21 | 新华三信息技术有限公司 | Data retrieval method, device, equipment and machine-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102467572B (en) | 2013-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102467572B (en) | Data block inquiring method for supporting data de-duplication program | |
CN102799598A (en) | Data recovery method for deleting repeated data | |
CN101504670A (en) | Data operation method, system, client terminal and data server | |
US8271462B2 (en) | Method for creating a index of the data blocks | |
CN102456059A (en) | Data deduplication processing system | |
EP2898430B1 (en) | Mail indexing and searching using hierarchical caches | |
CN103984640B (en) | Realize data prefetching method and device | |
CN113821171B (en) | Key value storage method based on hash table and LSM tree | |
CN102469142A (en) | Data transmission method for data deduplication program | |
CN101963982A (en) | Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash | |
CN102467458B (en) | Method for establishing index of data block | |
CN102169507A (en) | Distributed real-time search engine | |
CN103279502B (en) | A kind of framework and method with the data de-duplication file system be combined with parallel file system | |
CN107958079A (en) | Aggregate file delet method, system, device and readable storage medium storing program for executing | |
CN104092670A (en) | Method for utilizing network cache server to process files and device for processing cache files | |
CN103176754A (en) | Reading and storing method for massive amounts of small files | |
CN103605778A (en) | Method, device and system for locating video file | |
CN103970875A (en) | Parallel repeated data deleting method | |
CN110532201A (en) | A kind of metadata processing method and device | |
CN103198150A (en) | Big data indexing method and system | |
CN105493080A (en) | Method and apparatus for context aware based data de-duplication | |
WO2023185111A1 (en) | Quick access method and device for data file | |
CN102737068A (en) | Method and equipment for performing cache management on retrieval data | |
CN104462388A (en) | Redundant data cleaning method based on cascade storage media | |
CN102467523A (en) | Method for establishing index files and method for querying data blocks by using index files |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20160919 Address after: 518000, JINGWAH Road, road, Futian District, Guangdong, Shenzhen Province, room 605 Patentee after: Shenzhen excellent Clothing Co., Ltd. Address before: Taipei City, Taiwan, China Patentee before: Inventec Corporation |
|
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20131002 Termination date: 20201117 |
|
CF01 | Termination of patent right due to non-payment of annual fee |