CN103309975B - Duplicated data deleting method and apparatus - Google Patents

Duplicated data deleting method and apparatus Download PDF

Info

Publication number
CN103309975B
CN103309975B CN201310230732.5A CN201310230732A CN103309975B CN 103309975 B CN103309975 B CN 103309975B CN 201310230732 A CN201310230732 A CN 201310230732A CN 103309975 B CN103309975 B CN 103309975B
Authority
CN
China
Prior art keywords
file
stored
data
finger print
print information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310230732.5A
Other languages
Chinese (zh)
Other versions
CN103309975A (en
Inventor
周景才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310230732.5A priority Critical patent/CN103309975B/en
Publication of CN103309975A publication Critical patent/CN103309975A/en
Application granted granted Critical
Publication of CN103309975B publication Critical patent/CN103309975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a duplicated data deleting method and apparatus. The duplicated data deleting method comprises the following steps: identifying the classification of documents to be stored, determining duplicated data deleting rules used in stored documents according to the document classification, and performing duplicated data deleting on the documents to be stored according to the determined duplicated data deleting rules. According to the invention, the duplicated data deleting rules are determined according to the document classification, so that the duplicated data is deleted with pertinence, and the duplicated data deleting ratio is improved.

Description

A kind of data de-duplication method and equipment
Technical field
The present invention relates to field of data storage, more particularly to one kind based on document classification carry out data de-duplication method and Equipment.
Background technology
With the popularization of cloud computing technology, the virtual desktop framework based on cloud computing(virtual desktop Infrastructure, abbreviation VDI)Using being rapidly developed.It is current either domestic or external, numerous large enterprises and Government is one after another by the conventional personal computer of oneself(Personal Computer, abbreviation PC)Machine switches to VDI desktop clouds, so The PC of original each mutually isolated similar information isolated island is organically linked up.
According to the as shown by data of research, it is the data for repeating storage that the data stored between different user have 60%, particularly The duplicate data stored between different user in same department is up to 80%, therefore, in field of data storage, how to have The duplicate data that effect ground is deleted between user becomes people's concern.
The key point of data de-duplication technology is to utilize SHA-1 digest algorithms to calculate for identifying file not at present With the finger print information of content, wherein, the mode of the finger print information of calculation document different content includes:Coarseness ground calculates each text The finger print information of part, for example:Using the finger print information of the summary info calculation document of each file;Duplicate removal technology is in employing After stating the calculated finger print information of mode, the finger print information stored in calculated finger print information and fingerprint database is entered Row compares, and when calculated finger print information is identical with the finger print information stored in fingerprint database, illustrates to refer to for calculating The file or data block of stricture of vagina information belongs to duplicate data, needs to carry out data de-duplication;Otherwise, for calculating finger print information File or data block belong to non-duplicate data, it is not necessary to carry out data de-duplication.
But, there is problems with actual applications:
The file A stored in assuming fingerprint database, is calculated the fingerprint letter of file A using the summary info of file A Breath 1, and file B to be stored are calculated the finger print information 2 of file B using the summary info of file B, wherein, file A and File B belongs to identical file type.
, compared with file A, the summary info of file B is different from the summary info of file A for file B, and file B is except summary Other parts of the outer other parts with file A in addition to summary are identical.Now, calculated finger print information 1 with calculate The finger print information 2 for arriving is different, and file B belongs to non-duplicate data relative to file A, therefore, file B will be stored, but file B It is middle presence in a large number with file A identical data, cause the data de-duplication rate of file(After original document total amount is processed with duplicate removal The ratio of the file total amount of output)Than relatively low.
That is, for the file of identical file type, when the data for being used for calculating in file finger print information occur to become During change, the relatively low problem of the data de-duplication rate of file is will appear from.
The content of the invention
Embodiments provide a kind of data de-duplication method and equipment.
According to the first aspect of the invention, there is provided a kind of method that duplicate removal process is carried out to file, including:
The classification of identification file to be stored;
The data de-duplication rule that the file to be stored is used is determined according to the classification of file;
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
In the implementation of first aspect, in the first possible implementation, the classification of the file includes commonly using File and non-active file;
The classification of the identification file to be stored, specifically includes:
Obtain the occurrence number of the file type of the file to be stored, and judge that the occurrence number of the file type is It is no more than threshold value, when the occurrence number of the file type is more than the threshold value, the file to be stored is defined as commonly using File, when the occurrence number of the file type of the file to be stored for obtaining is not more than the threshold value, will be described to be stored File is defined as non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text Part.
It is in the first possible implementation of first aspect, in second possible implementation, described according to text The classification of part determines the data de-duplication rule that the file to be stored is used, and specifically includes:
When the file to be stored is active file, the data de-duplication rule that the file to be stored is used is number According to block level data de-duplication;
It is described regular according to the data de-duplication for determining, data de-duplication is carried out to the file to be stored, specifically Including:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, is counted Calculate the finger print information of each data block;
Finger print information of the finger print information of each data block with storage is compared;
When the finger print information of a data block is identical with the finger print information for storing, stores the data block and store And reference information between the finger print information identical finger print information of the data block, and abandon the data block;When one When the finger print information of data block is differed with the finger print information for storing, the data block and the calculated data block are stored Finger print information.
It is in the first possible implementation of first aspect, in the third possible implementation, described according to text The classification of part determines the data de-duplication rule that the file to be stored is used, and specifically includes:
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is File-level data de-duplication;
It is described regular according to the data de-duplication for determining, data de-duplication is carried out to the file to be stored, specifically Including:
According to file-level data de-duplication rule, at least part of number of files is selected from the file to be stored According to calculating the finger print information of at least part of file data;
Finger print information of the finger print information of calculated at least part of file data with storage is compared;
When the finger print information of calculated at least part of file data is identical with the finger print information for storing, deposit Store up the file to be stored and stored and the finger print information identical finger print information of at least part of file data between Reference information, and abandon the file to be stored;When calculated finger print information is differed with the finger print information for storing When, store the finger print information of the file to be stored and calculated at least part of file data.
According to the second aspect of the invention, there is provided a kind of duplicate removal engine apparatus, including:
Identification module, for recognizing the classification of file to be stored;
Deletion rule determining module, the classification for processing according to file determine the repeat number that the file to be stored is used According to deletion rule;
Removing module, for according to the data de-duplication rule for determining, carrying out duplicate data to the file to be stored Delete.
In the implementation of second aspect, in the first possible implementation, the classification of the file includes commonly using File and non-active file;
The identification module, the occurrence number of the file type specifically for obtaining the file to be stored, and judge institute The occurrence number of file type is stated whether more than threshold value, when the occurrence number of the file type is more than the threshold value, by institute State file to be stored and be defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than institute When stating threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text Part.
It is in the first possible implementation of second aspect, in second possible implementation, described to delete rule Then determining module, specifically for when the file to be stored is active file, the duplicate data that the file to be stored is used Deletion rule is block level data de-duplication;
The removing module, specifically for according to block level data de-duplication rule, by the text to be stored Part is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block with The finger print information of storage is compared;When the finger print information of a data block is identical with the finger print information for storing, institute is stored State data block and stored and reference information between the finger print information identical finger print information of the data block, and abandon institute State data block;When the finger print information of a data block is differed with the finger print information of storage, the data block and calculating are stored The finger print information of the data block for obtaining.
It is in the first possible implementation of second aspect, in the third possible implementation, described to delete rule Then determining module, specifically for when the file to be stored is non-active file, the repeat number that the file to be stored is used It is file-level data de-duplication according to deletion rule;
The removing module, specifically for according to file-level data de-duplication rule, from the file to be stored It is middle to select at least part of file data, calculate the finger print information of at least part of file data;Will be calculated described At least partly the finger print information of file data is compared with the finger print information of storage;When calculated at least part of text When the finger print information of number of packages evidence is identical with the finger print information for storing, store the file to be stored with store with it is described extremely Reference information between the finger print information identical finger print information of small part file data, and abandon the file to be stored;When When calculated finger print information is differed with the finger print information for storing, the file to be stored and calculated institute are stored State the finger print information of at least part of file data.
According to the third aspect of the invention we, there is provided a kind of data de-duplication equipment, including:
Input monitoring device, for recognizing the classification of file to be stored;
Processor, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file, root According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
In the implementation of the third aspect, in the first possible implementation, the classification of the file includes commonly using File and non-active file;
The input monitoring device, the occurrence number of the file type specifically for obtaining the file to be stored, and judge Whether the occurrence number of the file type is more than threshold value, when the occurrence number of the file type is more than the threshold value, will The file to be stored is defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than During the threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text Part.
In the first possible implementation of the third aspect, in second possible implementation, the processor, It is number specifically for the data de-duplication rule that when the file to be stored is active file, the file to be stored is used According to block level data de-duplication, and according to block level data de-duplication rule, the file to be stored is divided into Multiple data blocks, calculate the finger print information of each data block;By the finger print information of each data block and the finger for storing Stricture of vagina information is compared;When the finger print information of a data block is identical with the finger print information for storing, the data block is stored With reference information storing and between the finger print information identical finger print information of the data block, and the data are abandoned Block;When the finger print information of a data block is differed with the finger print information of storage, the data block and calculated is stored The finger print information of the data block.
In the first possible implementation of the third aspect, in the third possible implementation, the processor, Specifically for when the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is File-level data de-duplication, and according to the file-level data de-duplication rule, select from the file to be stored to Small part file data, calculates the finger print information of at least part of file data;Will be calculated described at least part of The finger print information of file data is compared with the finger print information of storage;When calculated at least part of file data When finger print information is identical with the finger print information for storing, the file to be stored is stored with storing with least part of text Reference information between the finger print information identical finger print information of number of packages evidence, and abandon the file to be stored;When being calculated Finger print information when differing with the finger print information that stores, store the file to be stored and the calculated at least portion Divide the finger print information of file data.
Classification of the embodiment of the present invention by identification file to be stored, and determine that the storage file makes according to document classification Data de-duplication rule, according to the data de-duplication rule for determining, carries out duplicate data to the file to be stored Delete, so using the classification of file, determine data de-duplication rule, repeat number is carried out to file to be stored targetedly According to deletion, file data de-duplication rate is improve.
Description of the drawings
Schematic flow sheets of the Fig. 1 for a kind of data de-duplication method of the embodiment of the present invention one;
Schematic flow sheets of the Fig. 2 for a kind of data de-duplication method of the embodiment of the present invention two;
Fig. 3 is the schematic flow sheet of the acquisition methods of active file in active file data base;
Schematic flow sheets of the Fig. 4 for a kind of data de-duplication method of the embodiment of the present invention three;
Structural representations of the Fig. 5 for a kind of data de-duplication equipment of the embodiment of the present invention four;
Structural representations of the Fig. 6 for a kind of data de-duplication equipment of the embodiment of the present invention five;
Logical architecture figures of the Fig. 7 for duplicate data sweep equipment;
System architecture diagrams of the Fig. 8 for duplicate data sweep equipment.
Specific embodiment
In order to realize the object of the invention, a kind of data de-duplication method and equipment are embodiments provided, is passed through The classification of identification file to be stored, and the data de-duplication rule that the storage file is used, root are determined according to document classification According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored, so using the classification of file, Determine data de-duplication rule, data de-duplication is carried out to file to be stored targetedly, file repeat number is improve According to deletion rate.
It should be noted that the setting numerical value being related in the embodiment of the present invention or threshold value or threshold value etc. can bases Determination is actually needed, can not be limited with being determined according to experimental data here.
Each embodiment of the invention is described in detail with reference to Figure of description.
Embodiment one:
As shown in figure 1, a kind of schematic flow sheet of the data de-duplication method for the embodiment of the present invention one.Methods described Including:
Step 101:The classification of identification file to be stored.
Wherein, the classification of the file includes active file and non-active file.
Specifically, in a step 101, the file format of the file to be stored to obtaining is identified, and judges text to be stored The file type of part, and according to document classification rule, it is determined that judging the document classification classification belonging to the file type that obtains.
Wherein, the file type is included but is not limited to:Doc file types, txt file type, pdf file types, ppt Etc. one or more in file type.
The document classification rule includes file size(It is divided into big file and small documents), the file generated time(It was divided into Phase file and new file)And occurrence number(It is divided into active file and non-active file)Deng.
More preferably, first, the file format of file to be stored is obtained, the corresponding file type of the file format is determined.
For example:The file format of the file to be stored obtained by read-write operation is XXX.doc, it is determined that the tray The corresponding file type of formula is doc file types.
Secondly, the file type stored in the file type for determining and active file data base is compared.
Specifically, judge the file type for determining whether with the file type phase that stores in active file data base Together.
Or with the presence or absence of the file type identical file type with determination in lookup active file data base.
As active file data base is known by the file type of file of the file type identification equipment to receiving Not, and the number of times that every kind of file type occurs is recorded, when reaching in the setting time cycle, to what is occurred in active file data base File type is classified, and specifically includes:
The number of times that every kind of file type is occurred is compared with the threshold value of setting, when the number of times that file type occurs is more than During the threshold value of setting, determine that the file type is active file;When the threshold value that the number of times that file type occurs no more than sets When, determine that the file type is non-active file.
More preferably, the file type of the active file for determining only is stored in active file data base, what is will determine that out belongs to The file type of non-active file is deleted.
It should be noted that the file type of active file not only can be stored in the active file data base, may be used also To store the file type of non-active file, do not limit here.
So, by being adjusted to the file type of the active file in active file data base in real time, it is determined that sening as an envoy to With frequency highest or higher file type, that is to say, that further delete and select the larger text of data de-duplication workload Part type, is that the file type determines suitable data de-duplication rule, improves file data de-duplication rate.
3rd, when the file type identical file type with file to be stored is found in active file data base When, the file to be stored is defined as into active file, when not finding and file to be stored in active file data base File type identical file type when, determine file to be stored be non-active file.
Specifically, the method for recognizing the classification of file to be stored, specifically includes:
First, in the file format for obtaining file to be stored, determine the corresponding text of file format of the file to be stored After part type, the occurrence number of the file type for determining is obtained.
Secondly, judge the occurrence number of the file type whether more than threshold value.
3rd, when it is determined that file type occurrence number be more than threshold value when, determine that the file of the file type is normal With file, when the occurrence number of the file type is not more than threshold value, determine that the file of the file type is non-conventional text Part.
Or, the method for recognizing the classification of file to be stored is specifically included:
First, in the file format for obtaining file to be stored, determine the corresponding text of file format of the file to be stored Part type.
Next, searches whether the file type that there is the file to be stored in active file data base is searched.
3rd, when the file type of the file to be stored is found in active file data base, it is determined that described treat Storage file is active file, when the file type of the file to be stored is not found in active file data base, Determine that file to be stored is non-active file.
It should be noted that using the embodiment underneath with the scheme of active file data base.
Certainly, another embodiment is deposited again for the file type of the active file of determination can be stored in data base The file type of the non-active file that storage determines.
Step 102:The data de-duplication rule that the file to be stored is used is determined according to the classification of file.
Specifically, in a step 102, when the file to be stored is active file, what the file to be stored was used Data de-duplication rule is block level data de-duplication;
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is File-level data de-duplication.
That is, the corresponding relation set up between the class categories and data de-duplication rule of file, that is, commonly use text Corresponding relation between part and block level data de-duplication rule, non-active file and file-level data de-duplication rule Between corresponding relation.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file Multiple data blocks are divided into, the finger print information of each data block is calculated, and according to the finger of calculated each data block Stricture of vagina information carries out the rule of data de-duplication.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data The rule of data deletion.
Specifically, in a step 102, the data de-duplication that the file to be stored is used is determined according to the classification of file Rule, specifically includes:
First, obtain the occurrence number of the file type of file to be stored.
For example:The file type of file to be stored is doc file types, then the doc for occurring in statistics file data base is literary The occurrence number of part type is 100 times.
Secondly, judge the occurrence number of file type of the file to be stored whether more than threshold value.
Specifically, the occurrence number of the file type of the file to be stored and threshold value are compared.
3rd, when the occurrence number of the file type of the file to be stored is more than threshold value, determine the files classes of selection The file of type is active file, according to the corresponding relation between active file and block level data de-duplication rule, it is determined that The data de-duplication rule that the file to be stored is used is block level data de-duplication rule;When the text to be stored When the occurrence number of the file type of part is not more than threshold value, determine that the file of the file type of selection is non-active file, according to Corresponding relation between non-active file and file-level data de-duplication rule, determines the repetition that the file to be stored is used Data deletion rule is file-level data de-duplication rule.
So, the number of times difference for being occurred in different time sections according to file type, real-time adjustment are directed to identical file class The data de-duplication rule of type, through prolonged training study, it is possible to increase data de-duplication rate.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated The finger print information of each data block carries out the rule of data de-duplication.
The division rule can be the division size of data block, divide duration etc., not limit here.
Specifically, it is assumed that the file to be stored is then divided into multiple data as 1M by the size of the data block for setting Block(The size of each data block is 1M), the finger print information of each data block is obtained using hash algorithm.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type In the data block that repeats, improve file data de-duplication rate.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data The rule of data deletion.
Specifically, it is assumed that at least part of file data in the file to be stored of selection refers to plucking for the file to be stored Partial data is wanted, then the finger print information of the document partial data of selection is calculated using hash algorithm, will be calculated Finger print information of the finger print information as the file to be stored.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is, Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
As can be seen here, block level data de-duplication rule belongs to particulate relative to file-level data de-duplication rule Degree ground data de-duplication rule, can avoid carrying out data de-duplication to file using file-level data de-duplication rule Also there is the situation of a large amount of duplicate data afterwards.
Step 103:According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
Specifically, in step 103, when the file to be stored is active file, what the file to be stored was used Data de-duplication rule is block level data de-duplication, according to the block level data de-duplication, is treated to described Storage file carries out data de-duplication, including:
First, according to block level data de-duplication rule, the file to be stored is divided into into multiple data Block, calculates the finger print information of each data block.
Secondly, the finger print information by the finger print information of each data block with storage is compared.
Specifically, the finger print information stored in the finger print information of each data block and file fingerprint storehouse is compared Compared with it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
3rd, when the finger print information of a data block is identical with the finger print information for storing, store the data block with The reference information between data block finger print information identical finger print information that is having stored, and abandon the data block; When the finger print information of a data block is differed with the finger print information of storage, the data block and calculated described is stored The finger print information of data block.
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is File-level data de-duplication, according to the file-level data de-duplication, carries out duplicate data to the file to be stored and deletes Remove, specifically include:
First, according to file-level data de-duplication rule, at least partly text is selected from the file to be stored Number of packages evidence, calculates the finger print information of at least part of file data.
Secondly, the finger print information by the finger print information of calculated at least part of file data with storage is compared Compared with.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file In fingerprint base.
3rd, when the finger print information of calculated at least part of file data it is identical with the finger print information for storing When, store the file to be stored and the finger print information identical finger print information with least part of file data for storing Between reference information, and abandon the file to be stored;When calculated finger print information with the finger print information that stores not When identical, the finger print information of the file to be stored and calculated at least part of file data is stored.
By the scheme of the embodiment of the present invention one, the classification of file to be stored is recognized, and according to document classification determines The data de-duplication rule that storage file is used, according to the data de-duplication rule for determining, enters to the file to be stored Row data de-duplication, so using the classification of file, determines data de-duplication rule, targetedly to file to be stored Data de-duplication is carried out, file data de-duplication rate is improve.
Embodiment two:
As shown in Fig. 2 the flow process for a kind of method that data de-duplication is carried out to file of the embodiment of the present invention two is shown It is intended to.The embodiment of the present invention two is the method with the embodiment of the present invention one under same design, and methods described includes:
Step 201:The active file that the file to be stored that judgement is received is stored in whether belonging to active file data base, If belonging to, execution step 202;If being not belonging to, execution step 206.
Specifically, in step 201, the acquisition modes of the active file for storing in the active file data base include but It is not limited to:
As shown in figure 3, for the schematic flow sheet of the acquisition methods of active file in active file data base.
Step 21:All Files in current active file data base is scanned, and determines the file type of each file.
Step 22:For identical file type, the file type is obtained from file type essential information storehouse and is occurred Number of times, count the block level number of repetition of the file-level number of repetition and this document type of this document type, and generate text Part type number of repetition statistical table.
As shown in table 1, it is file type number of repetition statistical table:
File type Frequency of occurrence File-level number of repetition Block level number of repetition
Doc file types 150 56 94
Txt file type 120 45 75
Pdf file types 125 46 79
Table 1
Wherein, the file type essential information storehouse is that a kind of preservation file type information and file type information go out occurrence Several data bases.
Step 23:The data message of arbitrary file type in file type number of repetition statistical table is read, according to the text The file-level number of repetition and block level number of repetition of part type, determines the whole file repetitive rate of the file type.
Specifically, the whole file repetitive rate of the file type is equal to file-level number of repetition and the institute of the file type State the ratio of the block level number of repetition of file type.
For example:In reading file type number of repetition statistical table, the data message of arbitrary file type is:Doc files classes It is 94 that type, the file-level number of repetition of the doc file types are the block level number of repetition of 56, the doc file types, Then the whole file repetitive rate of the doc file types is 56/94;Read arbitrary files classes in file type number of repetition statistical table The data message of type is:Txt file type, the file-level number of repetition of the txt file type are 45, the txt file class The block level number of repetition of type is 75, then the whole file repetitive rate of the txt file type is 45/75;Read file type In number of repetition statistical table, the data message of arbitrary file type is:The file-level of pdf file types, the pdf file types It is 79 that number of repetition is the block level number of repetition of 46, the pdf file types, then the whole file of the pdf file types Repetitive rate is 46/79.
Step 24:The whole file repetitive rate of calculated each file type is compared with threshold value respectively.
Specifically, judge the whole file repetitive rate of calculated each file type whether more than threshold value.
It should be noted that the threshold value, can be a percentage value, and between 1% and 100%, specifically can basis It is actually needed determination.
Step 25:According to comparative result, determine that the corresponding file of each file type is belonging to active file and still belongs to In non-active file.
Specifically, a kind of file type is selected, when the whole file repetitive rate of calculated file type is more than threshold value When, determine that the corresponding file of the file type belongs to active file;When the whole file repetitive rate of calculated file type No more than threshold value when, determine that the corresponding file of the file type belongs to non-active file.
More preferably, the corresponding file type of active file that belongs to for determining is refreshed into active file data base, will be true The fixed file type for belonging to non-active file is deleted from active file data base.
Specifically, the file type that will be stored in the file type of the file to be stored for receiving and active file data base It is compared, when the file type identical file type with the file to be stored is found in active file data base When, determine that file to be stored belongs to active file;When not finding in active file data base and the file to be stored File type identical file type when, determine that file to be stored belongs to non-active file.
Step 202:When it is determined that the file to be stored for receiving belongs to common file, according to active file type with Corresponding relation between block level data de-duplication rule, determines the data de-duplication rule that the file to be stored is used It is then block level data de-duplication rule.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated The finger print information of each data block carries out the rule of data de-duplication.
Step 203:According to block level data de-duplication rule, the file to be stored is divided into into many numbers According to block, the finger print information of each data block is calculated.
Specifically, in step 203, it is assumed that the file to be stored is then divided by the size of the data block for setting as 1M Become multiple data blocks(The size of each data block is 1M), the fingerprint letter of each data block is obtained using hash algorithm Breath.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type In the data block that repeats, improve file data de-duplication rate.
Step 204:Judge whether the finger print information of each data block is identical with the finger print information of storage.
Specifically, the finger print information stored in the finger print information of each data block and file fingerprint storehouse is compared Compared with it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
Step 205:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 205, when the finger print information of a data block is identical with the finger print information for storing, deposit Store up the data block and stored and reference information between the finger print information identical finger print information of the data block, and lose Abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and The finger print information of the calculated data block.
Step 206:When it is determined that the file to be stored for receiving is non-active file, according to non-active file and text Corresponding relation between part level data de-duplication rule, determines that the corresponding data de-duplication rule of the file to be stored is File-level data de-duplication rule.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data The rule of data deletion.
Specifically, it is assumed that at least part of file data in the file to be stored of selection refers to plucking for the file to be stored Partial data is wanted, then the finger print information of the document partial data of selection is calculated using hash algorithm, will be calculated Finger print information of the finger print information as the file to be stored.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is, Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
Step 207:According to file-level data de-duplication rule, select at least part of from the file to be stored File data, calculates the finger print information of at least part of file data.
Step 208:Finger print information of the finger print information of calculated at least part of file data with storage is entered Row compares.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file In fingerprint base.
Step 209:According to comparative result, data de-duplication process is carried out to file to be stored.
Specifically, in step 209, when the finger print information of calculated at least part of file data with store Finger print information it is identical when, store the file to be stored and the finger print information with least part of file data for storing Reference information between identical finger print information, and abandon the file to be stored;When calculated finger print information with deposit When the finger print information of storage is differed, the fingerprint of the file to be stored and calculated at least part of file data is stored Information.
By the scheme of the embodiment of the present invention two, using mixing data de-duplication technology, being capable of file in reduction system Cutting times and finger print information amount, for different files, targetedly using block level data de-duplication rule With file-level data de-duplication rule, file data de-duplication rate is improve.
Embodiment three:
As shown in figure 4, a kind of schematic flow sheet of the data de-duplication method for the embodiment of the present invention three.It is of the invention real It is the method with the embodiment of the present invention one and the embodiment of the present invention two under same inventive concept to apply example three, and methods described includes:
Step 301:The file to be stored of I/O port input is monitored, and determines that what is listened to treats using file type evaluator The file type of storage file.
Specifically, in step 301, the file to be stored of I/O port input is monitored in real time, using file type evaluator The file type of the file to be stored to listening to is identified.
More preferably, after the file type for determining file to be stored, this is found from file type essential information storehouse It is determined that file type, the occurrence number of the file type of the determination is increased into setting value, and refreshes file type and believed substantially The occurrence number of file type in breath storehouse.
Wherein, the file type essential information storehouse is that a kind of preservation file type information and file type information go out occurrence Several data bases.
Step 302:Obtain the occurrence number of the file type of file to be stored.
Step 303:Judge the occurrence number of file type of file to be stored whether more than threshold value.
Specifically, the occurrence number of the file type of file to be stored and threshold value are compared.
When the occurrence number of the file type of file to be stored is more than threshold value, execution step 304,305,306 and 307; When the occurrence number of the file type of file to be stored is not more than threshold value, execution step 308,309,310 and 311.
More preferably, in step 303, when the occurrence number of the file type of file to be stored is more than threshold value, determine institute It is active file to state file to be stored, and the file type of file to be stored is refreshed into active file data base.
Step 304:When the occurrence number of the file type of file to be stored is more than threshold value, occurrence is gone out according to file type Corresponding relation between the document classification and data de-duplication rule of number determination, determines the duplicate data of the file to be stored Deletion rule is block level data de-duplication rule.
Wherein, it is right between the document classification and data de-duplication rule of the file type determination of the file to be stored Should be related to for:The file type occurrence number of file to be stored is more than threshold value, i.e., then the file to be stored is active file, corresponding Block level data de-duplication rule;The file type occurrence number of file to be stored is not more than threshold value, i.e., then this is to be stored File be non-active file, respective file level data de-duplication rule.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated The finger print information of each data block carries out the rule of data de-duplication.
Step 305:According to block level data de-duplication rule, the file to be stored is divided into into many numbers According to block, the finger print information of each data block is calculated.
Specifically, in step 305, it is assumed that the file to be stored is then divided by the size of the data block for setting as 1M Become multiple data blocks(The size of each data block is 1M), the fingerprint letter of each data block is obtained using hash algorithm Breath.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type In the data block that repeats, improve file data de-duplication rate.
Step 306:Finger print information of the finger print information of each data block with storage is compared.
Specifically, within step 306, the finger that will be stored in the finger print information of each data block and file fingerprint storehouse Stricture of vagina information is compared, it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
Step 307:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 307, when the finger print information of a data block is identical with the finger print information for storing, deposit Store up the data block and stored and reference information between the finger print information identical finger print information of the data block, and lose Abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and The finger print information of the calculated data block.
Step 308:When the occurrence number of the file type of file to be stored is not more than threshold value, occurred according to file type Corresponding relation between the document classification and data de-duplication rule of number of times determination, determines that the file to be stored is corresponding heavy Complex data deletion rule is file-level data de-duplication rule.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data The rule of data deletion.
Step 309:According to file-level data de-duplication rule, select at least part of from the file to be stored File data, calculates the finger print information of at least part of file data.
Specifically, in a step 309, it is assumed that at least part of file data in the file to be stored of selection refers to described treating The summary partial data of storage file, then calculate the finger print information of the document partial data of selection using hash algorithm, Using calculated finger print information as the file to be stored finger print information.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is, Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
Step 310:Finger print information of the finger print information of calculated at least part of file data with storage is entered Row compares.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file In fingerprint base.
Step 311:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 311, when the finger print information of calculated at least part of file data with store Finger print information it is identical when, store the file to be stored and the finger print information with least part of file data for storing Reference information between identical finger print information, and abandon the file to be stored;When calculated finger print information with deposit When the finger print information of storage is differed, the fingerprint of the file to be stored and calculated at least part of file data is stored Information.
Example IV:
As shown in figure 5, a kind of structural representation of the data de-duplication equipment for the embodiment of the present invention four, the present invention is in fact It is the equipment with the embodiment of the present invention one to embodiment three under same design to apply example four, and the equipment includes:Identification module 11, Deletion rule determining module 12 and removing module 13, wherein:
Identification module 11, for recognizing the classification of file to be stored;
Deletion rule determining module 12, for the duplicate data used according to the classification determination file to be stored of file Deletion rule;
Removing module 13, for according to the data de-duplication rule for determining, carrying out repeat number to the file to be stored According to deletion.
Specifically, the classification of the file includes active file and non-active file.
The identification module 11, for obtaining the occurrence number of the file type of the file to be stored, and judges described Whether the occurrence number of file type is more than threshold value, when the occurrence number of the file type is more than the threshold value, will be described File to be stored is defined as active file, when the occurrence number of the file type of the file to be stored for obtaining be not more than it is described During threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text Part.
Specifically, the deletion rule determining module 12, specifically for when the file to be stored be active file when, institute It is block level data de-duplication to state the data de-duplication rule that file to be stored uses.
The removing module 13, specifically for according to the block level data de-duplication rule, will be described to be stored File is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block It is compared with the finger print information of storage;When the finger print information of a data block is identical with the finger print information for storing, storage The data block and stored and reference information between the finger print information identical finger print information of the data block, and abandon The data block;When the finger print information of a data block is differed with the finger print information of storage, the data block and meter are stored The finger print information of the data block for obtaining.
Specifically, the deletion rule determining module 12, specifically for when the file to be stored be non-active file when, The data de-duplication rule that the file to be stored is used is file-level data de-duplication.
The removing module 13, specifically for according to file-level data de-duplication rule, from the text to be stored At least part of file data is selected in part, the finger print information of at least part of file data is calculated;By calculated institute The finger print information and the finger print information of storage for stating at least part of file data is compared;When calculated described at least part of When the finger print information of file data is identical with the finger print information for storing, store the file to be stored with store with it is described Reference information between the finger print information identical finger print information of at least part of file data, and abandon the file to be stored; When calculated finger print information is differed with the finger print information that stores, the file to be stored and calculated is stored The finger print information of at least part of file data.
It should be noted that duplicate removal engine apparatus according to the present invention can apply hard in document storage server Part equipment, can also be the logical block applied in VDI systems, is integrated in VDI systems, is not specifically limited here.
Embodiment five:
As shown in fig. 6, a kind of structural representation of the data de-duplication equipment for the embodiment of the present invention five, the present invention is in fact It is the equipment with the embodiment of the present invention four under same design to apply example five, and the equipment includes:Input monitoring device 21, processor 22nd, memorizer 23 and document data bank 24, wherein, input monitoring device 21, processor 22, memorizer 23 and document data bank 24 lead to Cross bus 25 to connect, wherein:
Input monitoring device 21, for recognizing the classification of file to be stored;
Processor 22, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file, According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
Specifically, the classification of the file includes active file and non-active file.
The input monitoring device 21, for obtaining the occurrence number of the file type of the file to be stored, and judges institute The occurrence number of file type is stated whether more than threshold value, when the occurrence number of the file type is more than the threshold value, by institute State file to be stored and be defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than institute When stating threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base 24 is searched, when in conventional text When the file type of the file to be stored is found in part data base, the file to be stored is defined as into active file, when When the file type of the file to be stored is not found in active file data base, determine file to be stored and commonly use for non- File.
Specifically, the processor 22, specifically for when the file to be stored be active file when, the text to be stored The data de-duplication rule that part is used is block level data de-duplication, and according to the block level data de-duplication The file to be stored is divided into multiple data blocks by rule, calculates the finger print information of each data block;By it is described each The finger print information of data block is compared with the finger print information of storage;When the finger print information and the fingerprint for storing of a data block When information is identical, store the data block and stored and the finger print information identical finger print information of the data block between Reference information, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, storage The finger print information of the data block and the calculated data block.
The processor, specifically for when the file to be stored is non-active file, the file to be stored is used Data de-duplication rule be file-level data de-duplication, and according to the file-level data de-duplication rule, from institute At least part of file data is selected in stating file to be stored, the finger print information of at least part of file data is calculated;Will meter The finger print information of the described at least part of file data for obtaining is compared with the finger print information of storage;When calculated institute State at least part of file data finger print information it is identical with the finger print information for storing when, store the file to be stored and deposit Storage and between the finger print information identical finger print information of at least part of file data reference information, and treat described in abandoning Storage file;When calculated finger print information is differed with the finger print information that stores, store the file to be stored and The finger print information of calculated at least part of file data.
It should be noted that the not duplicate data in file to be stored is stored in memorizer 23.
As shown in fig. 7, for the logical architecture figure of duplicate data sweep equipment.Wherein, the data de-duplication equipment bag Include:Active file identification module 31, active file data base 32, active file adjusting module 33, IO watch-dogs 34, write command Unit 35, reading instruction unit 36 and main storage 37.
Specifically, the IO watch-dogs 34, for receiving file to be stored, and the file to be stored for receiving are sent to Active file identification module 31.
The active file identification module 31, for obtaining the occurrence number of the file type of file to be stored, and judges Whether the occurrence number of the file type of acquisition is more than threshold value.
The active file identification module 31, for scanning All Files in active file data base, and determines each The file type of file, for identical file type, obtains the file type from file type essential information storehouse and occurs Number of times, count the number of number of times, the file-level number of repetition of this document type and this document type that the file type occurs According to block level number of repetition, and file type number of repetition statistical table is generated, it is arbitrary in reading file type number of repetition statistical table The data message of file type, according to the file-level number of repetition and block level number of repetition of the file type, determines institute The whole file repetitive rate of file type is stated, whole file repetitive rate and the threshold value of calculated each file type are carried out Relatively, and according to comparative result, determine that the corresponding file of each file type is belonging to active file and still falls within non-commonly using File.
The active file identification module 31, for judge the file type of file to be stored whether with active file storehouse in The file type of storage is identical;When the file type of file to be stored it is identical with the file type stored in active file data base When, determine that file to be stored is active file;When the text stored in the file type of file to be stored and active file data base When part type is differed, determine that file to be stored is non-active file.
The active file data base 32, for storing active file.
The active file adjusting module 33, for the corresponding file type of active file that belongs to for determining is refreshed to normal With in document data bank, the corresponding file type of non-active file that belongs to for determining is deleted from active file storehouse.
Write command unit 35 and reading instruction unit 36, for performing read operation or write operation to file to be stored.
Specifically, the write command unit 35, for when the finger print information of a data block and the finger print information for storing When identical, store the data block and stored and reference between the finger print information identical finger print information of the data block Information, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, storage is described The finger print information of data block and the calculated data block(Store into main storage 37);
Or, when the finger print information of calculated at least part of file data it is identical with the finger print information for storing When, store the file to be stored and the finger print information identical finger print information with least part of file data for storing Between reference information, and abandon the file to be stored;When calculated finger print information with the finger print information that stores not When identical, the finger print information of the file to be stored and calculated at least part of file data is stored(Store to master In memorizer 37).
As shown in figure 8, for the system architecture diagram of duplicate data sweep equipment.The system includes:Virtual machine(Virtual Machine, VM)411~41n, hypervisor Hypervisor42, data de-duplication equipment 43 and main storage device 44, wherein:
Data de-duplication equipment 43, for collecting all files to be stored from Hypervisor42, and treats and deposits Storage file carries out data de-duplication, by the data storage after data de-duplication to main storage device 44.
Specifically, data de-duplication equipment 43, for recognizing the classification of file to be stored;Determined according to the classification of file The data de-duplication rule that the file to be stored is used;According to the data de-duplication rule for determining, to described to be stored File carries out data de-duplication.
Specifically, the classification of the file includes active file and non-active file.
The data de-duplication equipment 43, for obtaining the occurrence number of the file type of the file to be stored, and Whether the occurrence number of the file type is judged more than threshold value, when the occurrence number of the file type is more than the threshold value When, determine the file to be stored for active file, when the file to be stored for obtaining file type occurrence number not During more than the threshold value, determine that the file to be stored is non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file When the file type of the file to be stored is found in data base, determine that the file to be stored is active file, when normal During with the file type of the file to be stored is not found in document data bank, determine that file to be stored is non-conventional text Part.
Specifically, the data de-duplication equipment 43, it is for when the file to be stored is active file, described to treat The data de-duplication rule that storage file is used is block level data de-duplication;When the file to be stored is non-conventional During file, the data de-duplication rule that the file to be stored is used is file-level data de-duplication.
The block level data de-duplication rule refers to the data block division rule according to setting, by file type pair The file answered is divided into multiple data blocks, calculates the finger print information of each data block, and according to it is calculated each The finger print information of data block carries out the rule of data de-duplication.
Specifically, the data de-duplication equipment 43, for according to block level data de-duplication rule, inciting somebody to action The file to be stored is divided into multiple data blocks, calculates the finger print information of each data block;By described each data block Finger print information with storage finger print information be compared;When the finger print information and the finger print information phase for storing of a data block Meanwhile, store the data block and stored and reference letter between the finger print information identical finger print information of the data block Breath, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, the number is stored According to block and the finger print information of the calculated data block.
Specifically, the file-level data de-duplication rule refers to and at least part of file data selected from file, counts The finger print information of the file data selected, and weight is carried out according to the finger print information of the calculated file data The rule that complex data is deleted.
Specifically, the data de-duplication equipment 43, for according to file-level data de-duplication rule, from institute At least part of file data is selected in stating file to be stored, the finger print information of at least part of file data is calculated;Will meter The finger print information of the described at least part of file data for obtaining is compared with the finger print information of storage;When calculated institute State at least part of file data finger print information it is identical with the finger print information for storing when, store the file to be stored and deposit Storage and between the finger print information identical finger print information of at least part of file data reference information, and treat described in abandoning Storage file;When calculated finger print information is differed with the finger print information that stores, store the file to be stored and The finger print information of calculated at least part of file data.
It will be understood by those skilled in the art that embodiments of the invention can be provided as method, device(Equipment), or computer Program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or with reference in terms of software and hardware Embodiment form.And, the present invention can be using the meter for wherein including computer usable program code at one or more Calculation machine usable storage medium(Including but not limited to disk memory, CD-ROM, optical memory etc.)The computer journey of upper enforcement The form of sequence product.
The present invention is with reference to method according to embodiments of the present invention, device(Equipment)With the flow chart of computer program And/or block diagram is describing.It should be understood that can be by each flow process in computer program instructions flowchart and/or block diagram And/or the combination of square frame and flow chart and/or flow process and/or square frame in block diagram.These computer programs can be provided to refer to The processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is made to produce One machine so that produced for realizing by the instruction of computer or the computing device of other programmable data processing devices The device of the function of specifying in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included referring to Make the manufacture of device, the command device realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or The function of specifying in multiple square frames.
These computer program instructions can be also loaded in computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow process of flow chart or multiple flow processs and/or block diagram one The step of function of specifying in individual square frame or multiple square frames.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can make other change and modification to these embodiments.So, claims are intended to be construed to include excellent Select embodiment and fall into the had altered of the scope of the invention and change.
Obviously, those skilled in the art can carry out the essence of various changes and modification without deviating from the present invention to the present invention God and scope.So, if these modifications of the present invention and modification belong to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising these changes and modification.

Claims (12)

1. a kind of data de-duplication method, it is characterised in that include:
The number of times occurred according to the file type of file to be stored and the threshold value result of the comparison of setting, recognize the text to be stored The classification of part;Or, according to the result that the file type of file to be stored is searched in active file data base, treat described in identification The classification of storage file;
Wherein, the classification of the file includes active file and non-active file;
The data de-duplication rule that the file to be stored is used is determined according to the classification of file, is specifically included:Treat when described Storage file be active file when, the file to be stored using data de-duplication rule delete for block level duplicate data Remove;When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is file Level data de-duplication;
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
2. the method for claim 1, it is characterised in that the number of times occurred according to the file type of file to be stored With the threshold value result of the comparison of setting, the classification of the file to be stored is recognized, is specifically included:
The occurrence number of the file type of the file to be stored is obtained, and judges whether the occurrence number of the file type is big In threshold value, when the occurrence number of the file type is more than the threshold value, determine that the file to be stored is active file, when When the occurrence number of the file type of the file to be stored for obtaining is not more than the threshold value, determine that the file to be stored is Non- active file;
Or, the result that the file type according to file to be stored is searched in active file data base, identification described in treat The classification of storage file, specifically includes:
The file type of the file to be stored is searched in active file data base is searched, when looking in active file data base When finding the file type of the file to be stored, determine that the file to be stored is active file, when in active file data When the file type of the file to be stored is not found in storehouse, determine that file to be stored is non-active file.
3. the method for claim 1, it is characterised in that described according to the data de-duplication for determining rule, to described File to be stored carries out data de-duplication, specifically includes:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, calculates every The finger print information of one data block;
Finger print information of the finger print information of each data block with storage is compared;
When the finger print information of a data block is identical with the finger print information for storing, store the data block with store with Reference information between the finger print information identical finger print information of the data block, and abandon the data block;When a data When the finger print information of block is differed with the finger print information for storing, the finger of the data block and the calculated data block is stored Stricture of vagina information.
4. the method for claim 1, it is characterised in that described according to the data de-duplication for determining rule, to described File to be stored carries out data de-duplication, specifically includes:
According to file-level data de-duplication rule, at least part of file data is selected from the file to be stored, counted Calculate the finger print information of at least part of file data;
Finger print information of the finger print information of calculated at least part of file data with storage is compared;
When the finger print information of calculated at least part of file data is identical with the finger print information for storing, institute is stored State file to be stored and the drawing and the finger print information identical finger print information of at least part of file data between for having stored With information, and abandon the file to be stored;When calculated finger print information is differed with the finger print information for storing, deposit Store up the finger print information of the file to be stored and calculated at least part of file data.
5. a kind of data de-duplication equipment, it is characterised in that include:
Identification module, for the number of times and the threshold value result of the comparison of setting that are occurred according to the file type of file to be stored, knows The classification of not described file to be stored;Or, for being looked in active file data base according to the file type of file to be stored The result looked for, recognizes the classification of the file to be stored;Wherein, the classification of the file includes active file and non-conventional text Part;
Deletion rule determining module, for determining that the data de-duplication that the file to be stored is used is advised according to the classification of file Then, specifically for:When the file to be stored is active file, the data de-duplication rule that the file to be stored is used For block level data de-duplication;When the file to be stored is non-active file, the weight that the file to be stored is used Complex data deletion rule is file-level data de-duplication;
Removing module, for according to the data de-duplication rule for determining, carrying out data de-duplication to the file to be stored.
6. equipment as claimed in claim 5, it is characterised in that the identification module goes out according to the file type of file to be stored Existing number of times and the threshold value result of the comparison for setting, when recognizing the classification of the file to be stored, specifically for:Treat described in obtaining The occurrence number of the file type of storage file, and whether the occurrence number of the file type is judged more than threshold value, when described When the occurrence number of file type is more than the threshold value, the file to be stored is defined as into active file, described in obtaining When the occurrence number of the file type of file to be stored is not more than the threshold value, the file to be stored is defined as into non-conventional text Part;
Or, the result that the identification module is searched in active file data base according to the file type of file to be stored is known During the classification of not described file to be stored, specifically for:The file to be stored is searched in active file data base is searched File type, when the file type of the file to be stored is found in active file data base, by the text to be stored Part is defined as active file, when the file type of the file to be stored is not found in active file data base, really File to be stored is determined for non-active file.
7. equipment as claimed in claim 5, it is characterised in that
The removing module, specifically for according to block level data de-duplication rule, the file to be stored being drawn It is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block and storage Finger print information be compared;When the finger print information of a data block is identical with the finger print information for storing, the number is stored The reference information between data block finger print information identical finger print information that is storing according to block and, and abandon the number According to block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and be calculated The data block finger print information.
8. equipment as claimed in claim 5, it is characterised in that
The removing module, specifically for according to file-level data de-duplication rule, selecting from the file to be stored At least part of file data is selected, the finger print information of at least part of file data is calculated;Described in will be calculated at least The finger print information of partial document data is compared with the finger print information of storage;When calculated at least part of number of files According to finger print information it is identical with the finger print information for storing when, store the file to be stored with storing with least portion The reference information divided between the finger print information identical finger print information of file data, and abandon the file to be stored;Work as calculating When the finger print information for obtaining and the finger print information for storing are differed, store the file to be stored and it is calculated it is described extremely The finger print information of small part file data.
9. a kind of data de-duplication equipment, it is characterised in that include:
Input monitoring device, for the number of times occurred according to the file type of file to be stored and the threshold value result of the comparison for setting, Recognize the classification of the file to be stored;Or, for the file type according to file to be stored in active file data base The result of lookup, recognizes the classification of the file to be stored;Wherein, the classification of the file includes active file and non-conventional text Part;
Processor, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file, according to true Fixed data de-duplication rule, carries out data de-duplication to the file to be stored;
Wherein, according to the classification of file, the processor is determining that the data de-duplication that the file to be stored is used is regular When, specifically for:When the file to be stored is active file, the data de-duplication rule that the file to be stored is used For block level data de-duplication;When the file to be stored is non-active file, the weight that the file to be stored is used Complex data deletion rule is file-level data de-duplication.
10. equipment as claimed in claim 9, it is characterised in that files classes of the input monitoring device according to file to be stored The number of times that type occurs and the threshold value result of the comparison for setting, when recognizing the classification of the file to be stored, specifically for:Obtain institute The occurrence number of the file type of file to be stored is stated, and judges whether the occurrence number of the file type is more than threshold value, when When the occurrence number of the file type is more than the threshold value, the file to be stored is defined as into active file, when what is obtained When the occurrence number of the file type of the file to be stored is not more than the threshold value, the file to be stored is defined as very Use file;
Or, the result that the input monitoring device is searched in active file data base according to the file type of file to be stored, When recognizing the classification of the file to be stored, specifically for:The file to be stored is searched in active file data base is searched File type, when the file type of the file to be stored is found in active file data base, will be described to be stored File is defined as active file, when the file type of the file to be stored is not found in active file data base, Determine that file to be stored is non-active file.
11. equipment as claimed in claim 9, it is characterised in that the processor is according to the data de-duplication rule for determining Then, when carrying out data de-duplication to the file to be stored, specifically for:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, calculates every The finger print information of one data block;Finger print information of the finger print information of each data block with storage is compared;When When the finger print information of one data block is identical with the finger print information for storing, the data block is stored with storing with the number According to the reference information between the finger print information identical finger print information of block, and abandon the data block;When the finger of a data block When stricture of vagina information is differed with the finger print information for storing, the fingerprint letter of the data block and the calculated data block is stored Breath.
12. equipment as claimed in claim 9, it is characterised in that the processor is according to the data de-duplication rule for determining Then, when carrying out data de-duplication to the file to be stored, specifically for:
According to file-level data de-duplication rule, at least part of file data is selected from the file to be stored, counted Calculate the finger print information of at least part of file data;By the finger print information of calculated at least part of file data It is compared with the finger print information of storage;When the finger print information of calculated at least part of file data with store When finger print information is identical, the file to be stored and the finger print information phase with least part of file data for storing are stored Reference information between same finger print information, and abandon the file to be stored;When calculated finger print information with store Finger print information when differing, store the fingerprint letter of the file to be stored and calculated at least part of file data Breath.
CN201310230732.5A 2013-06-09 2013-06-09 Duplicated data deleting method and apparatus Active CN103309975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310230732.5A CN103309975B (en) 2013-06-09 2013-06-09 Duplicated data deleting method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310230732.5A CN103309975B (en) 2013-06-09 2013-06-09 Duplicated data deleting method and apparatus

Publications (2)

Publication Number Publication Date
CN103309975A CN103309975A (en) 2013-09-18
CN103309975B true CN103309975B (en) 2017-04-26

Family

ID=49135193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310230732.5A Active CN103309975B (en) 2013-06-09 2013-06-09 Duplicated data deleting method and apparatus

Country Status (1)

Country Link
CN (1) CN103309975B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104753972A (en) * 2013-12-25 2015-07-01 腾讯科技(深圳)有限公司 Network resource collection processing method and server
CN104933010B (en) * 2014-03-18 2019-02-19 华为技术有限公司 A kind of data de-duplication method and device
CN105589803B (en) * 2014-10-24 2018-12-28 阿里巴巴集团控股有限公司 A kind of generation method and terminal device of testing tool
CN105511812B (en) * 2015-12-10 2018-12-18 浪潮(北京)电子信息产业有限公司 A kind of storage system big data optimization method and device
CN105786655A (en) * 2016-03-08 2016-07-20 成都云祺科技有限公司 Repeated data deleting method for virtual machine backup data
CN106610792A (en) * 2016-07-28 2017-05-03 四川用联信息技术有限公司 Repeating data deleting algorithm in cloud storage
CN106294591A (en) * 2016-07-29 2017-01-04 北京金山安全软件有限公司 File storage method and device and electronic equipment
CN110096483B (en) * 2019-05-08 2021-04-30 北京奇艺世纪科技有限公司 Duplicate file detection method, terminal and server
CN111143288A (en) * 2019-12-22 2020-05-12 北京浪潮数据技术有限公司 Data storage method, system and related device
CN112559452B (en) * 2020-12-11 2021-12-17 北京云宽志业网络技术有限公司 Data deduplication processing method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275067B2 (en) * 2009-03-16 2016-03-01 International Busines Machines Corporation Apparatus and method to sequentially deduplicate data
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
US8396899B2 (en) * 2009-11-23 2013-03-12 Dell Products L.P. Efficient segment detection for deduplication
CN101706825B (en) * 2009-12-10 2011-04-20 华中科技大学 Replicated data deleting method based on file content types
CN101908077B (en) * 2010-08-27 2012-11-21 华中科技大学 Duplicated data deleting method applicable to cloud backup

Also Published As

Publication number Publication date
CN103309975A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN103309975B (en) Duplicated data deleting method and apparatus
Xia et al. {FastCDC}: A fast and efficient {Content-Defined} chunking approach for data deduplication
Rong et al. Fast and scalable distributed set similarity joins for big data analytics
KR101700340B1 (en) System and method for analyzing cluster result of mass data
Cormode Sketch techniques for approximate query processing
CN105447113B (en) A kind of information analysis method based on big data
Kryszkiewicz et al. TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality
CN103955530B (en) Data reconstruction and optimization method of on-line repeating data deletion system
WO2014000508A1 (en) Duplicated web page deletion method and device
Bhalerao et al. A survey: On data deduplication for efficiently utilizing cloud storage for big data backups
US20120254173A1 (en) Grouping data
Patwary et al. Window-based streaming graph partitioning algorithm
CN103150260A (en) Method and device for deleting repeating data
CN105511812A (en) Method and device for optimizing big data of memory system
Elagib et al. Big data analysis solutions using MapReduce framework
CN106469097A (en) A kind of method and apparatus recalling error correction candidate based on artificial intelligence
Kumar et al. Bucket based data deduplication technique for big data storage system
CN103995863A (en) Method and device for deleting repeating data
KR101666740B1 (en) Method for generating assocication rules for data mining based on semantic analysis in big data environment
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN112783417A (en) Data reduction method and device, computing equipment and storage medium
De Francisci et al. Scaling out all pairs similarity search with mapreduce
US11709798B2 (en) Hash suppression
Lee et al. Similar pair identification using locality-sensitive hashing technique
Sahoo et al. On the study of GRBF and polynomial kernel based support vector machine in web logs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant