CN103970852A

CN103970852A - Data deduplication method of backup server

Info

Publication number: CN103970852A
Application number: CN201410186755.5A
Authority: CN
Inventors: 付丽莉; 于建彬
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-05-06
Filing date: 2014-05-06
Publication date: 2014-08-06

Abstract

The invention provides a data deduplication method of a backup server, which comprises the following specific operation processes: dividing a partition on a logical volume formed by a hard disk of a backup server for temporarily storing data to be backed up on the server, wherein the partition plays a role of data caching; dividing the cached data in the partitions into blocks, taking the blocks as a calculation list, calculating the hash value of the blocks and storing the hash value in a database; comparing the hash values, namely comparing the hash value of the new data backed up each time with the hash value stored in the database, and if the hash values are the same, determining that the data are repeated; if not, the data block is non-repeated data, and the hash value of the data block is added into the hash library. Compared with the prior art, the data deduplication method of the backup server can effectively deduplication the repeated data, greatly save storage space, save storage space to a certain extent, and therefore reduce the operation cost of enterprises.

Description

A kind of data of backup server are heavily deleted method

Technical field

The present invention relates to field of computer technology, the data of the backup server of a kind of effective raising system running speed and quality are heavily deleted method specifically.

Background technology

The core concept of data de-duplication technology is: in the time of storage data, check and more already present data, if they are identical, so just filter out the backup of this part data, then quote already present data by pointer.Data de-duplication is more popular research topic of field of storage at present, because it brings a lot of significantly benefits can to whole storage system or even whole enterprise.Obviously, data de-duplication can fundamentally reduce the space that takies of storage and user's disc driver quantity, alleviates the expense of the aspects such as manpower, the energy, electric power resource, thereby significantly saves carrying cost.In addition, data de-duplication can reduce the data volume of transmitting in network, and then falls low-energy-consumption and network cost, and saves in a large number the network bandwidth for data Replica.

When stand-by program repeatedly backs up identical file in network from same catalogue, or while backing up identical file from multiple addresses, the data that repeat back up in temporary realm, repeating data amount on most of network is amazing, such as a backup server, backup 100 employees' of our company mail, what these 100 employees had that quite a few people receives an every day is identical mail, if same envelope mail is preserved by 80 people, these data are kept at their backup server and just have 80 parts from 80 users so, along with the growth of user mail amount every day, the quantity that repeats mail is also increasing.For same backup server, tend to produce the backup request of repeating data, if by data de-duplication, nature can greatly reduce taking storage space.

Sometimes for can be by data de-duplication, enterprise need to send the special responsible deletion work of special messenger, delete procedure is loaded down with trivial details and easily make mistakes, easily cause the loss of Backup Data, thereby, how can ensure in the situation that data are not lost, automatically the data de-duplication action that completes backup server becomes following development trend, especially for small business, the saving of storage space, can greatly reduce the operation cost of enterprise, based on this, now provide a kind of method that can effectively repeating data heavily be deleted on backup server.

Summary of the invention

Technical assignment of the present invention is to solve the deficiencies in the prior art, provides a kind of data of the backup server that reduces operation cost of enterprises, saving storage space heavily to delete method.

Technical scheme of the present invention realizes in the following manner, and the data of this kind of backup server are heavily deleted method, and its specific operation process is as follows:

On the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;

The data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;

Hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.

Logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.

The detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.

The detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.

The beneficial effect that the present invention compared with prior art produced is:

The data of a kind of backup server of the present invention are heavily deleted method and are used pointer to replace repeated data, and remain in logic complete data, can effectively heavily delete repeating data, greatly save storage space, save to a certain extent storage space, thereby reduce the operation cost of enterprise; Applied widely, be particularly useful for the back-up application in medium-sized and small enterprises, as mail data backup system, the backup tasks that comprises some repeated data in multi-user's backup procedures such as conventional data backup system, effectively deleting duplicated data, only retains unique data, effectively save memory disk space, thereby reduce costs, practical, be easy to promote.

Brief description of the drawings

Accompanying drawing 1 is backup server module diagram of the present invention;

Accompanying drawing 2 is Data Segmentation step schematic diagram of the present invention;

Accompanying drawing 3 is hash contrast step schematic diagram of the present invention;

Accompanying drawing 4 is for adopting the embodiment schematic diagram of prior art backup;

Accompanying drawing 5 is for adopting the embodiment schematic diagram of the present invention's backup.

Embodiment

Below in conjunction with accompanying drawing to the data of a kind of backup server of the present invention heavily the method for deleting be described in detail below.

Now provide a kind of data of backup server heavily to delete method, first build the basis of its enforcement: backup server, at this backup server as shown in Figure 1, on this server, be provided with four module data cache modules, data segmentation module, hash contrast module and unique data memory module.

Wherein building of data cache module refers on the logical volume of Raid5 composition, divide a subregion for deposit the data that will back up on this server temporarily, the size of this subregion, can determine according to user's backup custom (size of a backup data quantity).

Data segmentation module is cut apart for the data of data cache module are carried out to piece, and single as a calculating taking piece, the data in the hash collection in the contrast module by its hash value and below compare.

Hash contrast module is cut apart for the data in data cache partitions are carried out to piece, taking piece as a calculating list, calculate its hash functional value (conventionally with MD5 or SHA-1), and these hash functional values are organized into hash functional value storehouse, be kept at separately a position of subregion, hash in hash and the storehouse of each backup new data compares, if be worth the identical repeating data that is; Otherwise be non-repeating data, this data block need to be kept in the unique data district of this server, and the hash value of this data block also will be added in hash storehouse.

Unique data module is after backup requirements proposes, and first stores data in cache module, then enters data segmentation module, calculates to enter hash after hash value again and contrast module, if the identical repeating data that is of hash value; Otherwise be non-repeating data, be unique data, need to be kept on the specified partition of this server.

Further, these data heavily method of deleting are to complete between four modules of above-mentioned backup server, and as shown in accompanying drawing 2, Fig. 3, its specific implementation process is:

One, on backup server, set up the logical volume of a Raid5 with more than 3 hard disks, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time;

Two, carry out piece and cut apart being kept at data in data buffer storage, adopt the block size pre-defining to carry out cutting to file, and carry out weak proof test value and the strong proof test value of md5.Weak proof test value is mainly the performance in order to promote difference coding, first calculates weak proof test value and carries out hash and search, if found, calculates the strong proof test value of md5 and makes further hash and search.Because weak proof test value calculated amount is little more a lot of than md5, therefore can effectively improve coding efficiency;

Three, step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, prove that this data block is new data content, do not store before, this data block need to be kept in the data partition of this backup server, and the hash value of this data block is added in hash storehouse.

Embodiment as shown in accompanying drawing 4, Fig. 5, this embodiment is with an enterprise, the growth schematic diagram of the backup data quantity in sky is example on every Fridays, and all can produce a large amount of repeating datas every day, taken storage resources, increase production cost, at this moment adopt method of the present invention that the data that repeat are deleted, delete by adding black data in accompanying drawing, repeat every day, can find out, data backup total amount obviously reduces.

The foregoing is only embodiments of the invention, within the spirit and principles in the present invention all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the data of backup server are heavily deleted a method, it is characterized in that its specific operation process is as follows:

One, on the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;

Two, the data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;

Three, hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.

2. the data of a kind of backup server according to claim 1 are heavily deleted method, it is characterized in that: the logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.

3. the data of a kind of backup server according to claim 1 and 2 are heavily deleted method, it is characterized in that: the detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.

4. the data of a kind of backup server according to claim 3 are heavily deleted method, it is characterized in that: the detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.