CN103309975B - Duplicated data deleting method and apparatus - Google Patents
Duplicated data deleting method and apparatus Download PDFInfo
- Publication number
- CN103309975B CN103309975B CN201310230732.5A CN201310230732A CN103309975B CN 103309975 B CN103309975 B CN 103309975B CN 201310230732 A CN201310230732 A CN 201310230732A CN 103309975 B CN103309975 B CN 103309975B
- Authority
- CN
- China
- Prior art keywords
- file
- stored
- data
- finger print
- print information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a duplicated data deleting method and apparatus. The duplicated data deleting method comprises the following steps: identifying the classification of documents to be stored, determining duplicated data deleting rules used in stored documents according to the document classification, and performing duplicated data deleting on the documents to be stored according to the determined duplicated data deleting rules. According to the invention, the duplicated data deleting rules are determined according to the document classification, so that the duplicated data is deleted with pertinence, and the duplicated data deleting ratio is improved.
Description
Technical field
The present invention relates to field of data storage, more particularly to one kind based on document classification carry out data de-duplication method and
Equipment.
Background technology
With the popularization of cloud computing technology, the virtual desktop framework based on cloud computing(virtual desktop
Infrastructure, abbreviation VDI)Using being rapidly developed.It is current either domestic or external, numerous large enterprises and
Government is one after another by the conventional personal computer of oneself(Personal Computer, abbreviation PC)Machine switches to VDI desktop clouds, so
The PC of original each mutually isolated similar information isolated island is organically linked up.
According to the as shown by data of research, it is the data for repeating storage that the data stored between different user have 60%, particularly
The duplicate data stored between different user in same department is up to 80%, therefore, in field of data storage, how to have
The duplicate data that effect ground is deleted between user becomes people's concern.
The key point of data de-duplication technology is to utilize SHA-1 digest algorithms to calculate for identifying file not at present
With the finger print information of content, wherein, the mode of the finger print information of calculation document different content includes:Coarseness ground calculates each text
The finger print information of part, for example:Using the finger print information of the summary info calculation document of each file;Duplicate removal technology is in employing
After stating the calculated finger print information of mode, the finger print information stored in calculated finger print information and fingerprint database is entered
Row compares, and when calculated finger print information is identical with the finger print information stored in fingerprint database, illustrates to refer to for calculating
The file or data block of stricture of vagina information belongs to duplicate data, needs to carry out data de-duplication;Otherwise, for calculating finger print information
File or data block belong to non-duplicate data, it is not necessary to carry out data de-duplication.
But, there is problems with actual applications:
The file A stored in assuming fingerprint database, is calculated the fingerprint letter of file A using the summary info of file A
Breath 1, and file B to be stored are calculated the finger print information 2 of file B using the summary info of file B, wherein, file A and
File B belongs to identical file type.
, compared with file A, the summary info of file B is different from the summary info of file A for file B, and file B is except summary
Other parts of the outer other parts with file A in addition to summary are identical.Now, calculated finger print information 1 with calculate
The finger print information 2 for arriving is different, and file B belongs to non-duplicate data relative to file A, therefore, file B will be stored, but file B
It is middle presence in a large number with file A identical data, cause the data de-duplication rate of file(After original document total amount is processed with duplicate removal
The ratio of the file total amount of output)Than relatively low.
That is, for the file of identical file type, when the data for being used for calculating in file finger print information occur to become
During change, the relatively low problem of the data de-duplication rate of file is will appear from.
The content of the invention
Embodiments provide a kind of data de-duplication method and equipment.
According to the first aspect of the invention, there is provided a kind of method that duplicate removal process is carried out to file, including:
The classification of identification file to be stored;
The data de-duplication rule that the file to be stored is used is determined according to the classification of file;
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
In the implementation of first aspect, in the first possible implementation, the classification of the file includes commonly using
File and non-active file;
The classification of the identification file to be stored, specifically includes:
Obtain the occurrence number of the file type of the file to be stored, and judge that the occurrence number of the file type is
It is no more than threshold value, when the occurrence number of the file type is more than the threshold value, the file to be stored is defined as commonly using
File, when the occurrence number of the file type of the file to be stored for obtaining is not more than the threshold value, will be described to be stored
File is defined as non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file
When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when
When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text
Part.
It is in the first possible implementation of first aspect, in second possible implementation, described according to text
The classification of part determines the data de-duplication rule that the file to be stored is used, and specifically includes:
When the file to be stored is active file, the data de-duplication rule that the file to be stored is used is number
According to block level data de-duplication;
It is described regular according to the data de-duplication for determining, data de-duplication is carried out to the file to be stored, specifically
Including:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, is counted
Calculate the finger print information of each data block;
Finger print information of the finger print information of each data block with storage is compared;
When the finger print information of a data block is identical with the finger print information for storing, stores the data block and store
And reference information between the finger print information identical finger print information of the data block, and abandon the data block;When one
When the finger print information of data block is differed with the finger print information for storing, the data block and the calculated data block are stored
Finger print information.
It is in the first possible implementation of first aspect, in the third possible implementation, described according to text
The classification of part determines the data de-duplication rule that the file to be stored is used, and specifically includes:
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is
File-level data de-duplication;
It is described regular according to the data de-duplication for determining, data de-duplication is carried out to the file to be stored, specifically
Including:
According to file-level data de-duplication rule, at least part of number of files is selected from the file to be stored
According to calculating the finger print information of at least part of file data;
Finger print information of the finger print information of calculated at least part of file data with storage is compared;
When the finger print information of calculated at least part of file data is identical with the finger print information for storing, deposit
Store up the file to be stored and stored and the finger print information identical finger print information of at least part of file data between
Reference information, and abandon the file to be stored;When calculated finger print information is differed with the finger print information for storing
When, store the finger print information of the file to be stored and calculated at least part of file data.
According to the second aspect of the invention, there is provided a kind of duplicate removal engine apparatus, including:
Identification module, for recognizing the classification of file to be stored;
Deletion rule determining module, the classification for processing according to file determine the repeat number that the file to be stored is used
According to deletion rule;
Removing module, for according to the data de-duplication rule for determining, carrying out duplicate data to the file to be stored
Delete.
In the implementation of second aspect, in the first possible implementation, the classification of the file includes commonly using
File and non-active file;
The identification module, the occurrence number of the file type specifically for obtaining the file to be stored, and judge institute
The occurrence number of file type is stated whether more than threshold value, when the occurrence number of the file type is more than the threshold value, by institute
State file to be stored and be defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than institute
When stating threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file
When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when
When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text
Part.
It is in the first possible implementation of second aspect, in second possible implementation, described to delete rule
Then determining module, specifically for when the file to be stored is active file, the duplicate data that the file to be stored is used
Deletion rule is block level data de-duplication;
The removing module, specifically for according to block level data de-duplication rule, by the text to be stored
Part is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block with
The finger print information of storage is compared;When the finger print information of a data block is identical with the finger print information for storing, institute is stored
State data block and stored and reference information between the finger print information identical finger print information of the data block, and abandon institute
State data block;When the finger print information of a data block is differed with the finger print information of storage, the data block and calculating are stored
The finger print information of the data block for obtaining.
It is in the first possible implementation of second aspect, in the third possible implementation, described to delete rule
Then determining module, specifically for when the file to be stored is non-active file, the repeat number that the file to be stored is used
It is file-level data de-duplication according to deletion rule;
The removing module, specifically for according to file-level data de-duplication rule, from the file to be stored
It is middle to select at least part of file data, calculate the finger print information of at least part of file data;Will be calculated described
At least partly the finger print information of file data is compared with the finger print information of storage;When calculated at least part of text
When the finger print information of number of packages evidence is identical with the finger print information for storing, store the file to be stored with store with it is described extremely
Reference information between the finger print information identical finger print information of small part file data, and abandon the file to be stored;When
When calculated finger print information is differed with the finger print information for storing, the file to be stored and calculated institute are stored
State the finger print information of at least part of file data.
According to the third aspect of the invention we, there is provided a kind of data de-duplication equipment, including:
Input monitoring device, for recognizing the classification of file to be stored;
Processor, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file, root
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
In the implementation of the third aspect, in the first possible implementation, the classification of the file includes commonly using
File and non-active file;
The input monitoring device, the occurrence number of the file type specifically for obtaining the file to be stored, and judge
Whether the occurrence number of the file type is more than threshold value, when the occurrence number of the file type is more than the threshold value, will
The file to be stored is defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than
During the threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file
When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when
When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text
Part.
In the first possible implementation of the third aspect, in second possible implementation, the processor,
It is number specifically for the data de-duplication rule that when the file to be stored is active file, the file to be stored is used
According to block level data de-duplication, and according to block level data de-duplication rule, the file to be stored is divided into
Multiple data blocks, calculate the finger print information of each data block;By the finger print information of each data block and the finger for storing
Stricture of vagina information is compared;When the finger print information of a data block is identical with the finger print information for storing, the data block is stored
With reference information storing and between the finger print information identical finger print information of the data block, and the data are abandoned
Block;When the finger print information of a data block is differed with the finger print information of storage, the data block and calculated is stored
The finger print information of the data block.
In the first possible implementation of the third aspect, in the third possible implementation, the processor,
Specifically for when the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is
File-level data de-duplication, and according to the file-level data de-duplication rule, select from the file to be stored to
Small part file data, calculates the finger print information of at least part of file data;Will be calculated described at least part of
The finger print information of file data is compared with the finger print information of storage;When calculated at least part of file data
When finger print information is identical with the finger print information for storing, the file to be stored is stored with storing with least part of text
Reference information between the finger print information identical finger print information of number of packages evidence, and abandon the file to be stored;When being calculated
Finger print information when differing with the finger print information that stores, store the file to be stored and the calculated at least portion
Divide the finger print information of file data.
Classification of the embodiment of the present invention by identification file to be stored, and determine that the storage file makes according to document classification
Data de-duplication rule, according to the data de-duplication rule for determining, carries out duplicate data to the file to be stored
Delete, so using the classification of file, determine data de-duplication rule, repeat number is carried out to file to be stored targetedly
According to deletion, file data de-duplication rate is improve.
Description of the drawings
Schematic flow sheets of the Fig. 1 for a kind of data de-duplication method of the embodiment of the present invention one;
Schematic flow sheets of the Fig. 2 for a kind of data de-duplication method of the embodiment of the present invention two;
Fig. 3 is the schematic flow sheet of the acquisition methods of active file in active file data base;
Schematic flow sheets of the Fig. 4 for a kind of data de-duplication method of the embodiment of the present invention three;
Structural representations of the Fig. 5 for a kind of data de-duplication equipment of the embodiment of the present invention four;
Structural representations of the Fig. 6 for a kind of data de-duplication equipment of the embodiment of the present invention five;
Logical architecture figures of the Fig. 7 for duplicate data sweep equipment;
System architecture diagrams of the Fig. 8 for duplicate data sweep equipment.
Specific embodiment
In order to realize the object of the invention, a kind of data de-duplication method and equipment are embodiments provided, is passed through
The classification of identification file to be stored, and the data de-duplication rule that the storage file is used, root are determined according to document classification
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored, so using the classification of file,
Determine data de-duplication rule, data de-duplication is carried out to file to be stored targetedly, file repeat number is improve
According to deletion rate.
It should be noted that the setting numerical value being related in the embodiment of the present invention or threshold value or threshold value etc. can bases
Determination is actually needed, can not be limited with being determined according to experimental data here.
Each embodiment of the invention is described in detail with reference to Figure of description.
Embodiment one:
As shown in figure 1, a kind of schematic flow sheet of the data de-duplication method for the embodiment of the present invention one.Methods described
Including:
Step 101:The classification of identification file to be stored.
Wherein, the classification of the file includes active file and non-active file.
Specifically, in a step 101, the file format of the file to be stored to obtaining is identified, and judges text to be stored
The file type of part, and according to document classification rule, it is determined that judging the document classification classification belonging to the file type that obtains.
Wherein, the file type is included but is not limited to:Doc file types, txt file type, pdf file types, ppt
Etc. one or more in file type.
The document classification rule includes file size(It is divided into big file and small documents), the file generated time(It was divided into
Phase file and new file)And occurrence number(It is divided into active file and non-active file)Deng.
More preferably, first, the file format of file to be stored is obtained, the corresponding file type of the file format is determined.
For example:The file format of the file to be stored obtained by read-write operation is XXX.doc, it is determined that the tray
The corresponding file type of formula is doc file types.
Secondly, the file type stored in the file type for determining and active file data base is compared.
Specifically, judge the file type for determining whether with the file type phase that stores in active file data base
Together.
Or with the presence or absence of the file type identical file type with determination in lookup active file data base.
As active file data base is known by the file type of file of the file type identification equipment to receiving
Not, and the number of times that every kind of file type occurs is recorded, when reaching in the setting time cycle, to what is occurred in active file data base
File type is classified, and specifically includes:
The number of times that every kind of file type is occurred is compared with the threshold value of setting, when the number of times that file type occurs is more than
During the threshold value of setting, determine that the file type is active file;When the threshold value that the number of times that file type occurs no more than sets
When, determine that the file type is non-active file.
More preferably, the file type of the active file for determining only is stored in active file data base, what is will determine that out belongs to
The file type of non-active file is deleted.
It should be noted that the file type of active file not only can be stored in the active file data base, may be used also
To store the file type of non-active file, do not limit here.
So, by being adjusted to the file type of the active file in active file data base in real time, it is determined that sening as an envoy to
With frequency highest or higher file type, that is to say, that further delete and select the larger text of data de-duplication workload
Part type, is that the file type determines suitable data de-duplication rule, improves file data de-duplication rate.
3rd, when the file type identical file type with file to be stored is found in active file data base
When, the file to be stored is defined as into active file, when not finding and file to be stored in active file data base
File type identical file type when, determine file to be stored be non-active file.
Specifically, the method for recognizing the classification of file to be stored, specifically includes:
First, in the file format for obtaining file to be stored, determine the corresponding text of file format of the file to be stored
After part type, the occurrence number of the file type for determining is obtained.
Secondly, judge the occurrence number of the file type whether more than threshold value.
3rd, when it is determined that file type occurrence number be more than threshold value when, determine that the file of the file type is normal
With file, when the occurrence number of the file type is not more than threshold value, determine that the file of the file type is non-conventional text
Part.
Or, the method for recognizing the classification of file to be stored is specifically included:
First, in the file format for obtaining file to be stored, determine the corresponding text of file format of the file to be stored
Part type.
Next, searches whether the file type that there is the file to be stored in active file data base is searched.
3rd, when the file type of the file to be stored is found in active file data base, it is determined that described treat
Storage file is active file, when the file type of the file to be stored is not found in active file data base,
Determine that file to be stored is non-active file.
It should be noted that using the embodiment underneath with the scheme of active file data base.
Certainly, another embodiment is deposited again for the file type of the active file of determination can be stored in data base
The file type of the non-active file that storage determines.
Step 102:The data de-duplication rule that the file to be stored is used is determined according to the classification of file.
Specifically, in a step 102, when the file to be stored is active file, what the file to be stored was used
Data de-duplication rule is block level data de-duplication;
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is
File-level data de-duplication.
That is, the corresponding relation set up between the class categories and data de-duplication rule of file, that is, commonly use text
Corresponding relation between part and block level data de-duplication rule, non-active file and file-level data de-duplication rule
Between corresponding relation.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file
Multiple data blocks are divided into, the finger print information of each data block is calculated, and according to the finger of calculated each data block
Stricture of vagina information carries out the rule of data de-duplication.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates
The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data
The rule of data deletion.
Specifically, in a step 102, the data de-duplication that the file to be stored is used is determined according to the classification of file
Rule, specifically includes:
First, obtain the occurrence number of the file type of file to be stored.
For example:The file type of file to be stored is doc file types, then the doc for occurring in statistics file data base is literary
The occurrence number of part type is 100 times.
Secondly, judge the occurrence number of file type of the file to be stored whether more than threshold value.
Specifically, the occurrence number of the file type of the file to be stored and threshold value are compared.
3rd, when the occurrence number of the file type of the file to be stored is more than threshold value, determine the files classes of selection
The file of type is active file, according to the corresponding relation between active file and block level data de-duplication rule, it is determined that
The data de-duplication rule that the file to be stored is used is block level data de-duplication rule;When the text to be stored
When the occurrence number of the file type of part is not more than threshold value, determine that the file of the file type of selection is non-active file, according to
Corresponding relation between non-active file and file-level data de-duplication rule, determines the repetition that the file to be stored is used
Data deletion rule is file-level data de-duplication rule.
So, the number of times difference for being occurred in different time sections according to file type, real-time adjustment are directed to identical file class
The data de-duplication rule of type, through prolonged training study, it is possible to increase data de-duplication rate.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file
The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated
The finger print information of each data block carries out the rule of data de-duplication.
The division rule can be the division size of data block, divide duration etc., not limit here.
Specifically, it is assumed that the file to be stored is then divided into multiple data as 1M by the size of the data block for setting
Block(The size of each data block is 1M), the finger print information of each data block is obtained using hash algorithm.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated
Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data
The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type
In the data block that repeats, improve file data de-duplication rate.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates
The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data
The rule of data deletion.
Specifically, it is assumed that at least part of file data in the file to be stored of selection refers to plucking for the file to be stored
Partial data is wanted, then the finger print information of the document partial data of selection is calculated using hash algorithm, will be calculated
Finger print information of the finger print information as the file to be stored.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is,
Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
As can be seen here, block level data de-duplication rule belongs to particulate relative to file-level data de-duplication rule
Degree ground data de-duplication rule, can avoid carrying out data de-duplication to file using file-level data de-duplication rule
Also there is the situation of a large amount of duplicate data afterwards.
Step 103:According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
Specifically, in step 103, when the file to be stored is active file, what the file to be stored was used
Data de-duplication rule is block level data de-duplication, according to the block level data de-duplication, is treated to described
Storage file carries out data de-duplication, including:
First, according to block level data de-duplication rule, the file to be stored is divided into into multiple data
Block, calculates the finger print information of each data block.
Secondly, the finger print information by the finger print information of each data block with storage is compared.
Specifically, the finger print information stored in the finger print information of each data block and file fingerprint storehouse is compared
Compared with it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
3rd, when the finger print information of a data block is identical with the finger print information for storing, store the data block with
The reference information between data block finger print information identical finger print information that is having stored, and abandon the data block;
When the finger print information of a data block is differed with the finger print information of storage, the data block and calculated described is stored
The finger print information of data block.
When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is
File-level data de-duplication, according to the file-level data de-duplication, carries out duplicate data to the file to be stored and deletes
Remove, specifically include:
First, according to file-level data de-duplication rule, at least partly text is selected from the file to be stored
Number of packages evidence, calculates the finger print information of at least part of file data.
Secondly, the finger print information by the finger print information of calculated at least part of file data with storage is compared
Compared with.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse
Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file
In fingerprint base.
3rd, when the finger print information of calculated at least part of file data it is identical with the finger print information for storing
When, store the file to be stored and the finger print information identical finger print information with least part of file data for storing
Between reference information, and abandon the file to be stored;When calculated finger print information with the finger print information that stores not
When identical, the finger print information of the file to be stored and calculated at least part of file data is stored.
By the scheme of the embodiment of the present invention one, the classification of file to be stored is recognized, and according to document classification determines
The data de-duplication rule that storage file is used, according to the data de-duplication rule for determining, enters to the file to be stored
Row data de-duplication, so using the classification of file, determines data de-duplication rule, targetedly to file to be stored
Data de-duplication is carried out, file data de-duplication rate is improve.
Embodiment two:
As shown in Fig. 2 the flow process for a kind of method that data de-duplication is carried out to file of the embodiment of the present invention two is shown
It is intended to.The embodiment of the present invention two is the method with the embodiment of the present invention one under same design, and methods described includes:
Step 201:The active file that the file to be stored that judgement is received is stored in whether belonging to active file data base,
If belonging to, execution step 202;If being not belonging to, execution step 206.
Specifically, in step 201, the acquisition modes of the active file for storing in the active file data base include but
It is not limited to:
As shown in figure 3, for the schematic flow sheet of the acquisition methods of active file in active file data base.
Step 21:All Files in current active file data base is scanned, and determines the file type of each file.
Step 22:For identical file type, the file type is obtained from file type essential information storehouse and is occurred
Number of times, count the block level number of repetition of the file-level number of repetition and this document type of this document type, and generate text
Part type number of repetition statistical table.
As shown in table 1, it is file type number of repetition statistical table:
File type | Frequency of occurrence | File-level number of repetition | Block level number of repetition |
Doc file types | 150 | 56 | 94 |
Txt file type | 120 | 45 | 75 |
Pdf file types | 125 | 46 | 79 |
Table 1
Wherein, the file type essential information storehouse is that a kind of preservation file type information and file type information go out occurrence
Several data bases.
Step 23:The data message of arbitrary file type in file type number of repetition statistical table is read, according to the text
The file-level number of repetition and block level number of repetition of part type, determines the whole file repetitive rate of the file type.
Specifically, the whole file repetitive rate of the file type is equal to file-level number of repetition and the institute of the file type
State the ratio of the block level number of repetition of file type.
For example:In reading file type number of repetition statistical table, the data message of arbitrary file type is:Doc files classes
It is 94 that type, the file-level number of repetition of the doc file types are the block level number of repetition of 56, the doc file types,
Then the whole file repetitive rate of the doc file types is 56/94;Read arbitrary files classes in file type number of repetition statistical table
The data message of type is:Txt file type, the file-level number of repetition of the txt file type are 45, the txt file class
The block level number of repetition of type is 75, then the whole file repetitive rate of the txt file type is 45/75;Read file type
In number of repetition statistical table, the data message of arbitrary file type is:The file-level of pdf file types, the pdf file types
It is 79 that number of repetition is the block level number of repetition of 46, the pdf file types, then the whole file of the pdf file types
Repetitive rate is 46/79.
Step 24:The whole file repetitive rate of calculated each file type is compared with threshold value respectively.
Specifically, judge the whole file repetitive rate of calculated each file type whether more than threshold value.
It should be noted that the threshold value, can be a percentage value, and between 1% and 100%, specifically can basis
It is actually needed determination.
Step 25:According to comparative result, determine that the corresponding file of each file type is belonging to active file and still belongs to
In non-active file.
Specifically, a kind of file type is selected, when the whole file repetitive rate of calculated file type is more than threshold value
When, determine that the corresponding file of the file type belongs to active file;When the whole file repetitive rate of calculated file type
No more than threshold value when, determine that the corresponding file of the file type belongs to non-active file.
More preferably, the corresponding file type of active file that belongs to for determining is refreshed into active file data base, will be true
The fixed file type for belonging to non-active file is deleted from active file data base.
Specifically, the file type that will be stored in the file type of the file to be stored for receiving and active file data base
It is compared, when the file type identical file type with the file to be stored is found in active file data base
When, determine that file to be stored belongs to active file;When not finding in active file data base and the file to be stored
File type identical file type when, determine that file to be stored belongs to non-active file.
Step 202:When it is determined that the file to be stored for receiving belongs to common file, according to active file type with
Corresponding relation between block level data de-duplication rule, determines the data de-duplication rule that the file to be stored is used
It is then block level data de-duplication rule.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file
The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated
The finger print information of each data block carries out the rule of data de-duplication.
Step 203:According to block level data de-duplication rule, the file to be stored is divided into into many numbers
According to block, the finger print information of each data block is calculated.
Specifically, in step 203, it is assumed that the file to be stored is then divided by the size of the data block for setting as 1M
Become multiple data blocks(The size of each data block is 1M), the fingerprint letter of each data block is obtained using hash algorithm
Breath.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated
Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data
The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type
In the data block that repeats, improve file data de-duplication rate.
Step 204:Judge whether the finger print information of each data block is identical with the finger print information of storage.
Specifically, the finger print information stored in the finger print information of each data block and file fingerprint storehouse is compared
Compared with it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
Step 205:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 205, when the finger print information of a data block is identical with the finger print information for storing, deposit
Store up the data block and stored and reference information between the finger print information identical finger print information of the data block, and lose
Abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and
The finger print information of the calculated data block.
Step 206:When it is determined that the file to be stored for receiving is non-active file, according to non-active file and text
Corresponding relation between part level data de-duplication rule, determines that the corresponding data de-duplication rule of the file to be stored is
File-level data de-duplication rule.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates
The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data
The rule of data deletion.
Specifically, it is assumed that at least part of file data in the file to be stored of selection refers to plucking for the file to be stored
Partial data is wanted, then the finger print information of the document partial data of selection is calculated using hash algorithm, will be calculated
Finger print information of the finger print information as the file to be stored.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is,
Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
Step 207:According to file-level data de-duplication rule, select at least part of from the file to be stored
File data, calculates the finger print information of at least part of file data.
Step 208:Finger print information of the finger print information of calculated at least part of file data with storage is entered
Row compares.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse
Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file
In fingerprint base.
Step 209:According to comparative result, data de-duplication process is carried out to file to be stored.
Specifically, in step 209, when the finger print information of calculated at least part of file data with store
Finger print information it is identical when, store the file to be stored and the finger print information with least part of file data for storing
Reference information between identical finger print information, and abandon the file to be stored;When calculated finger print information with deposit
When the finger print information of storage is differed, the fingerprint of the file to be stored and calculated at least part of file data is stored
Information.
By the scheme of the embodiment of the present invention two, using mixing data de-duplication technology, being capable of file in reduction system
Cutting times and finger print information amount, for different files, targetedly using block level data de-duplication rule
With file-level data de-duplication rule, file data de-duplication rate is improve.
Embodiment three:
As shown in figure 4, a kind of schematic flow sheet of the data de-duplication method for the embodiment of the present invention three.It is of the invention real
It is the method with the embodiment of the present invention one and the embodiment of the present invention two under same inventive concept to apply example three, and methods described includes:
Step 301:The file to be stored of I/O port input is monitored, and determines that what is listened to treats using file type evaluator
The file type of storage file.
Specifically, in step 301, the file to be stored of I/O port input is monitored in real time, using file type evaluator
The file type of the file to be stored to listening to is identified.
More preferably, after the file type for determining file to be stored, this is found from file type essential information storehouse
It is determined that file type, the occurrence number of the file type of the determination is increased into setting value, and refreshes file type and believed substantially
The occurrence number of file type in breath storehouse.
Wherein, the file type essential information storehouse is that a kind of preservation file type information and file type information go out occurrence
Several data bases.
Step 302:Obtain the occurrence number of the file type of file to be stored.
Step 303:Judge the occurrence number of file type of file to be stored whether more than threshold value.
Specifically, the occurrence number of the file type of file to be stored and threshold value are compared.
When the occurrence number of the file type of file to be stored is more than threshold value, execution step 304,305,306 and 307;
When the occurrence number of the file type of file to be stored is not more than threshold value, execution step 308,309,310 and 311.
More preferably, in step 303, when the occurrence number of the file type of file to be stored is more than threshold value, determine institute
It is active file to state file to be stored, and the file type of file to be stored is refreshed into active file data base.
Step 304:When the occurrence number of the file type of file to be stored is more than threshold value, occurrence is gone out according to file type
Corresponding relation between the document classification and data de-duplication rule of number determination, determines the duplicate data of the file to be stored
Deletion rule is block level data de-duplication rule.
Wherein, it is right between the document classification and data de-duplication rule of the file type determination of the file to be stored
Should be related to for:The file type occurrence number of file to be stored is more than threshold value, i.e., then the file to be stored is active file, corresponding
Block level data de-duplication rule;The file type occurrence number of file to be stored is not more than threshold value, i.e., then this is to be stored
File be non-active file, respective file level data de-duplication rule.
Wherein, the block level data de-duplication rule refers to the data block division rule according to setting, by file
The corresponding file of type is divided into multiple data blocks, calculates the finger print information of each data block, and according to calculated
The finger print information of each data block carries out the rule of data de-duplication.
Step 305:According to block level data de-duplication rule, the file to be stored is divided into into many numbers
According to block, the finger print information of each data block is calculated.
Specifically, in step 305, it is assumed that the file to be stored is then divided by the size of the data block for setting as 1M
Become multiple data blocks(The size of each data block is 1M), the fingerprint letter of each data block is obtained using hash algorithm
Breath.
So, for identical file, division data block amount of capacity value is less, and granularity of division is less, then be calculated
Finger print information it is more, when file data de-duplication is carried out, data de-duplication rate is higher, and block level duplicate data
The more file type of the especially suitable occurrence number at short notice of deletion rule, not only facilitates quick determination this document type
In the data block that repeats, improve file data de-duplication rate.
Step 306:Finger print information of the finger print information of each data block with storage is compared.
Specifically, within step 306, the finger that will be stored in the finger print information of each data block and file fingerprint storehouse
Stricture of vagina information is compared, it is determined that whether the finger print information of each data block has been stored in file fingerprint storehouse.
Step 307:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 307, when the finger print information of a data block is identical with the finger print information for storing, deposit
Store up the data block and stored and reference information between the finger print information identical finger print information of the data block, and lose
Abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and
The finger print information of the calculated data block.
Step 308:When the occurrence number of the file type of file to be stored is not more than threshold value, occurred according to file type
Corresponding relation between the document classification and data de-duplication rule of number of times determination, determines that the file to be stored is corresponding heavy
Complex data deletion rule is file-level data de-duplication rule.
Wherein, the file-level data de-duplication rule refers to and at least part of file data selected from file, calculates
The finger print information of the file data selected, and repeated according to the finger print information of the calculated file data
The rule of data deletion.
Step 309:According to file-level data de-duplication rule, select at least part of from the file to be stored
File data, calculates the finger print information of at least part of file data.
Specifically, in a step 309, it is assumed that at least part of file data in the file to be stored of selection refers to described treating
The summary partial data of storage file, then calculate the finger print information of the document partial data of selection using hash algorithm,
Using calculated finger print information as the file to be stored finger print information.
Called file level data de-duplication rule is applied to the less file type of occurrence number in the short time, that is,
Say suitable for the less file type of file number of iterations, improve file data de-duplication rate.
Step 310:Finger print information of the finger print information of calculated at least part of file data with storage is entered
Row compares.
Specifically, will store in the finger print information of calculated at least part of file data and file fingerprint storehouse
Finger print information is compared, and determines whether the finger print information of calculated at least part of file data has been stored in file
In fingerprint base.
Step 311:According to judged result, data de-duplication process is carried out to file to be stored.
Specifically, in step 311, when the finger print information of calculated at least part of file data with store
Finger print information it is identical when, store the file to be stored and the finger print information with least part of file data for storing
Reference information between identical finger print information, and abandon the file to be stored;When calculated finger print information with deposit
When the finger print information of storage is differed, the fingerprint of the file to be stored and calculated at least part of file data is stored
Information.
Example IV:
As shown in figure 5, a kind of structural representation of the data de-duplication equipment for the embodiment of the present invention four, the present invention is in fact
It is the equipment with the embodiment of the present invention one to embodiment three under same design to apply example four, and the equipment includes:Identification module 11,
Deletion rule determining module 12 and removing module 13, wherein:
Identification module 11, for recognizing the classification of file to be stored;
Deletion rule determining module 12, for the duplicate data used according to the classification determination file to be stored of file
Deletion rule;
Removing module 13, for according to the data de-duplication rule for determining, carrying out repeat number to the file to be stored
According to deletion.
Specifically, the classification of the file includes active file and non-active file.
The identification module 11, for obtaining the occurrence number of the file type of the file to be stored, and judges described
Whether the occurrence number of file type is more than threshold value, when the occurrence number of the file type is more than the threshold value, will be described
File to be stored is defined as active file, when the occurrence number of the file type of the file to be stored for obtaining be not more than it is described
During threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file
When the file type of the file to be stored is found in data base, the file to be stored is defined as into active file, when
When the file type of the file to be stored is not found in active file data base, determine that file to be stored is non-conventional text
Part.
Specifically, the deletion rule determining module 12, specifically for when the file to be stored be active file when, institute
It is block level data de-duplication to state the data de-duplication rule that file to be stored uses.
The removing module 13, specifically for according to the block level data de-duplication rule, will be described to be stored
File is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block
It is compared with the finger print information of storage;When the finger print information of a data block is identical with the finger print information for storing, storage
The data block and stored and reference information between the finger print information identical finger print information of the data block, and abandon
The data block;When the finger print information of a data block is differed with the finger print information of storage, the data block and meter are stored
The finger print information of the data block for obtaining.
Specifically, the deletion rule determining module 12, specifically for when the file to be stored be non-active file when,
The data de-duplication rule that the file to be stored is used is file-level data de-duplication.
The removing module 13, specifically for according to file-level data de-duplication rule, from the text to be stored
At least part of file data is selected in part, the finger print information of at least part of file data is calculated;By calculated institute
The finger print information and the finger print information of storage for stating at least part of file data is compared;When calculated described at least part of
When the finger print information of file data is identical with the finger print information for storing, store the file to be stored with store with it is described
Reference information between the finger print information identical finger print information of at least part of file data, and abandon the file to be stored;
When calculated finger print information is differed with the finger print information that stores, the file to be stored and calculated is stored
The finger print information of at least part of file data.
It should be noted that duplicate removal engine apparatus according to the present invention can apply hard in document storage server
Part equipment, can also be the logical block applied in VDI systems, is integrated in VDI systems, is not specifically limited here.
Embodiment five:
As shown in fig. 6, a kind of structural representation of the data de-duplication equipment for the embodiment of the present invention five, the present invention is in fact
It is the equipment with the embodiment of the present invention four under same design to apply example five, and the equipment includes:Input monitoring device 21, processor
22nd, memorizer 23 and document data bank 24, wherein, input monitoring device 21, processor 22, memorizer 23 and document data bank 24 lead to
Cross bus 25 to connect, wherein:
Input monitoring device 21, for recognizing the classification of file to be stored;
Processor 22, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file,
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
Specifically, the classification of the file includes active file and non-active file.
The input monitoring device 21, for obtaining the occurrence number of the file type of the file to be stored, and judges institute
The occurrence number of file type is stated whether more than threshold value, when the occurrence number of the file type is more than the threshold value, by institute
State file to be stored and be defined as active file, when the occurrence number of the file type of the file to be stored for obtaining is not more than institute
When stating threshold value, the file to be stored is defined as into non-active file;
Or, the file type of the file to be stored is searched in active file data base 24 is searched, when in conventional text
When the file type of the file to be stored is found in part data base, the file to be stored is defined as into active file, when
When the file type of the file to be stored is not found in active file data base, determine file to be stored and commonly use for non-
File.
Specifically, the processor 22, specifically for when the file to be stored be active file when, the text to be stored
The data de-duplication rule that part is used is block level data de-duplication, and according to the block level data de-duplication
The file to be stored is divided into multiple data blocks by rule, calculates the finger print information of each data block;By it is described each
The finger print information of data block is compared with the finger print information of storage;When the finger print information and the fingerprint for storing of a data block
When information is identical, store the data block and stored and the finger print information identical finger print information of the data block between
Reference information, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, storage
The finger print information of the data block and the calculated data block.
The processor, specifically for when the file to be stored is non-active file, the file to be stored is used
Data de-duplication rule be file-level data de-duplication, and according to the file-level data de-duplication rule, from institute
At least part of file data is selected in stating file to be stored, the finger print information of at least part of file data is calculated;Will meter
The finger print information of the described at least part of file data for obtaining is compared with the finger print information of storage;When calculated institute
State at least part of file data finger print information it is identical with the finger print information for storing when, store the file to be stored and deposit
Storage and between the finger print information identical finger print information of at least part of file data reference information, and treat described in abandoning
Storage file;When calculated finger print information is differed with the finger print information that stores, store the file to be stored and
The finger print information of calculated at least part of file data.
It should be noted that the not duplicate data in file to be stored is stored in memorizer 23.
As shown in fig. 7, for the logical architecture figure of duplicate data sweep equipment.Wherein, the data de-duplication equipment bag
Include:Active file identification module 31, active file data base 32, active file adjusting module 33, IO watch-dogs 34, write command
Unit 35, reading instruction unit 36 and main storage 37.
Specifically, the IO watch-dogs 34, for receiving file to be stored, and the file to be stored for receiving are sent to
Active file identification module 31.
The active file identification module 31, for obtaining the occurrence number of the file type of file to be stored, and judges
Whether the occurrence number of the file type of acquisition is more than threshold value.
The active file identification module 31, for scanning All Files in active file data base, and determines each
The file type of file, for identical file type, obtains the file type from file type essential information storehouse and occurs
Number of times, count the number of number of times, the file-level number of repetition of this document type and this document type that the file type occurs
According to block level number of repetition, and file type number of repetition statistical table is generated, it is arbitrary in reading file type number of repetition statistical table
The data message of file type, according to the file-level number of repetition and block level number of repetition of the file type, determines institute
The whole file repetitive rate of file type is stated, whole file repetitive rate and the threshold value of calculated each file type are carried out
Relatively, and according to comparative result, determine that the corresponding file of each file type is belonging to active file and still falls within non-commonly using
File.
The active file identification module 31, for judge the file type of file to be stored whether with active file storehouse in
The file type of storage is identical;When the file type of file to be stored it is identical with the file type stored in active file data base
When, determine that file to be stored is active file;When the text stored in the file type of file to be stored and active file data base
When part type is differed, determine that file to be stored is non-active file.
The active file data base 32, for storing active file.
The active file adjusting module 33, for the corresponding file type of active file that belongs to for determining is refreshed to normal
With in document data bank, the corresponding file type of non-active file that belongs to for determining is deleted from active file storehouse.
Write command unit 35 and reading instruction unit 36, for performing read operation or write operation to file to be stored.
Specifically, the write command unit 35, for when the finger print information of a data block and the finger print information for storing
When identical, store the data block and stored and reference between the finger print information identical finger print information of the data block
Information, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, storage is described
The finger print information of data block and the calculated data block(Store into main storage 37);
Or, when the finger print information of calculated at least part of file data it is identical with the finger print information for storing
When, store the file to be stored and the finger print information identical finger print information with least part of file data for storing
Between reference information, and abandon the file to be stored;When calculated finger print information with the finger print information that stores not
When identical, the finger print information of the file to be stored and calculated at least part of file data is stored(Store to master
In memorizer 37).
As shown in figure 8, for the system architecture diagram of duplicate data sweep equipment.The system includes:Virtual machine(Virtual
Machine, VM)411~41n, hypervisor Hypervisor42, data de-duplication equipment 43 and main storage device
44, wherein:
Data de-duplication equipment 43, for collecting all files to be stored from Hypervisor42, and treats and deposits
Storage file carries out data de-duplication, by the data storage after data de-duplication to main storage device 44.
Specifically, data de-duplication equipment 43, for recognizing the classification of file to be stored;Determined according to the classification of file
The data de-duplication rule that the file to be stored is used;According to the data de-duplication rule for determining, to described to be stored
File carries out data de-duplication.
Specifically, the classification of the file includes active file and non-active file.
The data de-duplication equipment 43, for obtaining the occurrence number of the file type of the file to be stored, and
Whether the occurrence number of the file type is judged more than threshold value, when the occurrence number of the file type is more than the threshold value
When, determine the file to be stored for active file, when the file to be stored for obtaining file type occurrence number not
During more than the threshold value, determine that the file to be stored is non-active file;
Or, the file type of the file to be stored is searched in active file data base is searched, when in active file
When the file type of the file to be stored is found in data base, determine that the file to be stored is active file, when normal
During with the file type of the file to be stored is not found in document data bank, determine that file to be stored is non-conventional text
Part.
Specifically, the data de-duplication equipment 43, it is for when the file to be stored is active file, described to treat
The data de-duplication rule that storage file is used is block level data de-duplication;When the file to be stored is non-conventional
During file, the data de-duplication rule that the file to be stored is used is file-level data de-duplication.
The block level data de-duplication rule refers to the data block division rule according to setting, by file type pair
The file answered is divided into multiple data blocks, calculates the finger print information of each data block, and according to it is calculated each
The finger print information of data block carries out the rule of data de-duplication.
Specifically, the data de-duplication equipment 43, for according to block level data de-duplication rule, inciting somebody to action
The file to be stored is divided into multiple data blocks, calculates the finger print information of each data block;By described each data block
Finger print information with storage finger print information be compared;When the finger print information and the finger print information phase for storing of a data block
Meanwhile, store the data block and stored and reference letter between the finger print information identical finger print information of the data block
Breath, and abandon the data block;When the finger print information of a data block is differed with the finger print information of storage, the number is stored
According to block and the finger print information of the calculated data block.
Specifically, the file-level data de-duplication rule refers to and at least part of file data selected from file, counts
The finger print information of the file data selected, and weight is carried out according to the finger print information of the calculated file data
The rule that complex data is deleted.
Specifically, the data de-duplication equipment 43, for according to file-level data de-duplication rule, from institute
At least part of file data is selected in stating file to be stored, the finger print information of at least part of file data is calculated;Will meter
The finger print information of the described at least part of file data for obtaining is compared with the finger print information of storage;When calculated institute
State at least part of file data finger print information it is identical with the finger print information for storing when, store the file to be stored and deposit
Storage and between the finger print information identical finger print information of at least part of file data reference information, and treat described in abandoning
Storage file;When calculated finger print information is differed with the finger print information that stores, store the file to be stored and
The finger print information of calculated at least part of file data.
It will be understood by those skilled in the art that embodiments of the invention can be provided as method, device(Equipment), or computer
Program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or with reference in terms of software and hardware
Embodiment form.And, the present invention can be using the meter for wherein including computer usable program code at one or more
Calculation machine usable storage medium(Including but not limited to disk memory, CD-ROM, optical memory etc.)The computer journey of upper enforcement
The form of sequence product.
The present invention is with reference to method according to embodiments of the present invention, device(Equipment)With the flow chart of computer program
And/or block diagram is describing.It should be understood that can be by each flow process in computer program instructions flowchart and/or block diagram
And/or the combination of square frame and flow chart and/or flow process and/or square frame in block diagram.These computer programs can be provided to refer to
The processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is made to produce
One machine so that produced for realizing by the instruction of computer or the computing device of other programmable data processing devices
The device of the function of specifying in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in and can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory is produced to be included referring to
Make the manufacture of device, the command device realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or
The function of specifying in multiple square frames.
These computer program instructions can be also loaded in computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented process, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow process of flow chart or multiple flow processs and/or block diagram one
The step of function of specifying in individual square frame or multiple square frames.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described
Property concept, then can make other change and modification to these embodiments.So, claims are intended to be construed to include excellent
Select embodiment and fall into the had altered of the scope of the invention and change.
Obviously, those skilled in the art can carry out the essence of various changes and modification without deviating from the present invention to the present invention
God and scope.So, if these modifications of the present invention and modification belong to the scope of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to comprising these changes and modification.
Claims (12)
1. a kind of data de-duplication method, it is characterised in that include:
The number of times occurred according to the file type of file to be stored and the threshold value result of the comparison of setting, recognize the text to be stored
The classification of part;Or, according to the result that the file type of file to be stored is searched in active file data base, treat described in identification
The classification of storage file;
Wherein, the classification of the file includes active file and non-active file;
The data de-duplication rule that the file to be stored is used is determined according to the classification of file, is specifically included:Treat when described
Storage file be active file when, the file to be stored using data de-duplication rule delete for block level duplicate data
Remove;When the file to be stored is non-active file, the data de-duplication rule that the file to be stored is used is file
Level data de-duplication;
According to the data de-duplication rule for determining, data de-duplication is carried out to the file to be stored.
2. the method for claim 1, it is characterised in that the number of times occurred according to the file type of file to be stored
With the threshold value result of the comparison of setting, the classification of the file to be stored is recognized, is specifically included:
The occurrence number of the file type of the file to be stored is obtained, and judges whether the occurrence number of the file type is big
In threshold value, when the occurrence number of the file type is more than the threshold value, determine that the file to be stored is active file, when
When the occurrence number of the file type of the file to be stored for obtaining is not more than the threshold value, determine that the file to be stored is
Non- active file;
Or, the result that the file type according to file to be stored is searched in active file data base, identification described in treat
The classification of storage file, specifically includes:
The file type of the file to be stored is searched in active file data base is searched, when looking in active file data base
When finding the file type of the file to be stored, determine that the file to be stored is active file, when in active file data
When the file type of the file to be stored is not found in storehouse, determine that file to be stored is non-active file.
3. the method for claim 1, it is characterised in that described according to the data de-duplication for determining rule, to described
File to be stored carries out data de-duplication, specifically includes:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, calculates every
The finger print information of one data block;
Finger print information of the finger print information of each data block with storage is compared;
When the finger print information of a data block is identical with the finger print information for storing, store the data block with store with
Reference information between the finger print information identical finger print information of the data block, and abandon the data block;When a data
When the finger print information of block is differed with the finger print information for storing, the finger of the data block and the calculated data block is stored
Stricture of vagina information.
4. the method for claim 1, it is characterised in that described according to the data de-duplication for determining rule, to described
File to be stored carries out data de-duplication, specifically includes:
According to file-level data de-duplication rule, at least part of file data is selected from the file to be stored, counted
Calculate the finger print information of at least part of file data;
Finger print information of the finger print information of calculated at least part of file data with storage is compared;
When the finger print information of calculated at least part of file data is identical with the finger print information for storing, institute is stored
State file to be stored and the drawing and the finger print information identical finger print information of at least part of file data between for having stored
With information, and abandon the file to be stored;When calculated finger print information is differed with the finger print information for storing, deposit
Store up the finger print information of the file to be stored and calculated at least part of file data.
5. a kind of data de-duplication equipment, it is characterised in that include:
Identification module, for the number of times and the threshold value result of the comparison of setting that are occurred according to the file type of file to be stored, knows
The classification of not described file to be stored;Or, for being looked in active file data base according to the file type of file to be stored
The result looked for, recognizes the classification of the file to be stored;Wherein, the classification of the file includes active file and non-conventional text
Part;
Deletion rule determining module, for determining that the data de-duplication that the file to be stored is used is advised according to the classification of file
Then, specifically for:When the file to be stored is active file, the data de-duplication rule that the file to be stored is used
For block level data de-duplication;When the file to be stored is non-active file, the weight that the file to be stored is used
Complex data deletion rule is file-level data de-duplication;
Removing module, for according to the data de-duplication rule for determining, carrying out data de-duplication to the file to be stored.
6. equipment as claimed in claim 5, it is characterised in that the identification module goes out according to the file type of file to be stored
Existing number of times and the threshold value result of the comparison for setting, when recognizing the classification of the file to be stored, specifically for:Treat described in obtaining
The occurrence number of the file type of storage file, and whether the occurrence number of the file type is judged more than threshold value, when described
When the occurrence number of file type is more than the threshold value, the file to be stored is defined as into active file, described in obtaining
When the occurrence number of the file type of file to be stored is not more than the threshold value, the file to be stored is defined as into non-conventional text
Part;
Or, the result that the identification module is searched in active file data base according to the file type of file to be stored is known
During the classification of not described file to be stored, specifically for:The file to be stored is searched in active file data base is searched
File type, when the file type of the file to be stored is found in active file data base, by the text to be stored
Part is defined as active file, when the file type of the file to be stored is not found in active file data base, really
File to be stored is determined for non-active file.
7. equipment as claimed in claim 5, it is characterised in that
The removing module, specifically for according to block level data de-duplication rule, the file to be stored being drawn
It is divided into multiple data blocks, calculates the finger print information of each data block;By the finger print information of each data block and storage
Finger print information be compared;When the finger print information of a data block is identical with the finger print information for storing, the number is stored
The reference information between data block finger print information identical finger print information that is storing according to block and, and abandon the number
According to block;When the finger print information of a data block is differed with the finger print information of storage, store the data block and be calculated
The data block finger print information.
8. equipment as claimed in claim 5, it is characterised in that
The removing module, specifically for according to file-level data de-duplication rule, selecting from the file to be stored
At least part of file data is selected, the finger print information of at least part of file data is calculated;Described in will be calculated at least
The finger print information of partial document data is compared with the finger print information of storage;When calculated at least part of number of files
According to finger print information it is identical with the finger print information for storing when, store the file to be stored with storing with least portion
The reference information divided between the finger print information identical finger print information of file data, and abandon the file to be stored;Work as calculating
When the finger print information for obtaining and the finger print information for storing are differed, store the file to be stored and it is calculated it is described extremely
The finger print information of small part file data.
9. a kind of data de-duplication equipment, it is characterised in that include:
Input monitoring device, for the number of times occurred according to the file type of file to be stored and the threshold value result of the comparison for setting,
Recognize the classification of the file to be stored;Or, for the file type according to file to be stored in active file data base
The result of lookup, recognizes the classification of the file to be stored;Wherein, the classification of the file includes active file and non-conventional text
Part;
Processor, for determining that the data de-duplication that the file to be stored is used is regular according to the classification of file, according to true
Fixed data de-duplication rule, carries out data de-duplication to the file to be stored;
Wherein, according to the classification of file, the processor is determining that the data de-duplication that the file to be stored is used is regular
When, specifically for:When the file to be stored is active file, the data de-duplication rule that the file to be stored is used
For block level data de-duplication;When the file to be stored is non-active file, the weight that the file to be stored is used
Complex data deletion rule is file-level data de-duplication.
10. equipment as claimed in claim 9, it is characterised in that files classes of the input monitoring device according to file to be stored
The number of times that type occurs and the threshold value result of the comparison for setting, when recognizing the classification of the file to be stored, specifically for:Obtain institute
The occurrence number of the file type of file to be stored is stated, and judges whether the occurrence number of the file type is more than threshold value, when
When the occurrence number of the file type is more than the threshold value, the file to be stored is defined as into active file, when what is obtained
When the occurrence number of the file type of the file to be stored is not more than the threshold value, the file to be stored is defined as very
Use file;
Or, the result that the input monitoring device is searched in active file data base according to the file type of file to be stored,
When recognizing the classification of the file to be stored, specifically for:The file to be stored is searched in active file data base is searched
File type, when the file type of the file to be stored is found in active file data base, will be described to be stored
File is defined as active file, when the file type of the file to be stored is not found in active file data base,
Determine that file to be stored is non-active file.
11. equipment as claimed in claim 9, it is characterised in that the processor is according to the data de-duplication rule for determining
Then, when carrying out data de-duplication to the file to be stored, specifically for:
According to block level data de-duplication rule, the file to be stored is divided into into multiple data blocks, calculates every
The finger print information of one data block;Finger print information of the finger print information of each data block with storage is compared;When
When the finger print information of one data block is identical with the finger print information for storing, the data block is stored with storing with the number
According to the reference information between the finger print information identical finger print information of block, and abandon the data block;When the finger of a data block
When stricture of vagina information is differed with the finger print information for storing, the fingerprint letter of the data block and the calculated data block is stored
Breath.
12. equipment as claimed in claim 9, it is characterised in that the processor is according to the data de-duplication rule for determining
Then, when carrying out data de-duplication to the file to be stored, specifically for:
According to file-level data de-duplication rule, at least part of file data is selected from the file to be stored, counted
Calculate the finger print information of at least part of file data;By the finger print information of calculated at least part of file data
It is compared with the finger print information of storage;When the finger print information of calculated at least part of file data with store
When finger print information is identical, the file to be stored and the finger print information phase with least part of file data for storing are stored
Reference information between same finger print information, and abandon the file to be stored;When calculated finger print information with store
Finger print information when differing, store the fingerprint letter of the file to be stored and calculated at least part of file data
Breath.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310230732.5A CN103309975B (en) | 2013-06-09 | 2013-06-09 | Duplicated data deleting method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310230732.5A CN103309975B (en) | 2013-06-09 | 2013-06-09 | Duplicated data deleting method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103309975A CN103309975A (en) | 2013-09-18 |
CN103309975B true CN103309975B (en) | 2017-04-26 |
Family
ID=49135193
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310230732.5A Active CN103309975B (en) | 2013-06-09 | 2013-06-09 | Duplicated data deleting method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103309975B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104753972A (en) * | 2013-12-25 | 2015-07-01 | 腾讯科技(深圳)有限公司 | Network resource collection processing method and server |
CN104933010B (en) * | 2014-03-18 | 2019-02-19 | 华为技术有限公司 | A kind of data de-duplication method and device |
CN105589803B (en) * | 2014-10-24 | 2018-12-28 | 阿里巴巴集团控股有限公司 | A kind of generation method and terminal device of testing tool |
CN105511812B (en) * | 2015-12-10 | 2018-12-18 | 浪潮(北京)电子信息产业有限公司 | A kind of storage system big data optimization method and device |
CN105786655A (en) * | 2016-03-08 | 2016-07-20 | 成都云祺科技有限公司 | Repeated data deleting method for virtual machine backup data |
CN106610792A (en) * | 2016-07-28 | 2017-05-03 | 四川用联信息技术有限公司 | Repeating data deleting algorithm in cloud storage |
CN106294591A (en) * | 2016-07-29 | 2017-01-04 | 北京金山安全软件有限公司 | File storage method and device and electronic equipment |
CN110096483B (en) * | 2019-05-08 | 2021-04-30 | 北京奇艺世纪科技有限公司 | Duplicate file detection method, terminal and server |
CN111143288A (en) * | 2019-12-22 | 2020-05-12 | 北京浪潮数据技术有限公司 | Data storage method, system and related device |
CN112559452B (en) * | 2020-12-11 | 2021-12-17 | 北京云宽志业网络技术有限公司 | Data deduplication processing method, device, equipment and storage medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9275067B2 (en) * | 2009-03-16 | 2016-03-01 | International Busines Machines Corporation | Apparatus and method to sequentially deduplicate data |
CN101882141A (en) * | 2009-05-08 | 2010-11-10 | 北京众志和达信息技术有限公司 | Method and system for implementing repeated data deletion |
US8396899B2 (en) * | 2009-11-23 | 2013-03-12 | Dell Products L.P. | Efficient segment detection for deduplication |
CN101706825B (en) * | 2009-12-10 | 2011-04-20 | 华中科技大学 | Replicated data deleting method based on file content types |
CN101908077B (en) * | 2010-08-27 | 2012-11-21 | 华中科技大学 | Duplicated data deleting method applicable to cloud backup |
-
2013
- 2013-06-09 CN CN201310230732.5A patent/CN103309975B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN103309975A (en) | 2013-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103309975B (en) | Duplicated data deleting method and apparatus | |
Xia et al. | {FastCDC}: A fast and efficient {Content-Defined} chunking approach for data deduplication | |
Rong et al. | Fast and scalable distributed set similarity joins for big data analytics | |
KR101700340B1 (en) | System and method for analyzing cluster result of mass data | |
Cormode | Sketch techniques for approximate query processing | |
CN105447113B (en) | A kind of information analysis method based on big data | |
Kryszkiewicz et al. | TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality | |
CN103955530B (en) | Data reconstruction and optimization method of on-line repeating data deletion system | |
WO2014000508A1 (en) | Duplicated web page deletion method and device | |
Bhalerao et al. | A survey: On data deduplication for efficiently utilizing cloud storage for big data backups | |
US20120254173A1 (en) | Grouping data | |
Patwary et al. | Window-based streaming graph partitioning algorithm | |
CN103150260A (en) | Method and device for deleting repeating data | |
CN105511812A (en) | Method and device for optimizing big data of memory system | |
Elagib et al. | Big data analysis solutions using MapReduce framework | |
CN106469097A (en) | A kind of method and apparatus recalling error correction candidate based on artificial intelligence | |
Kumar et al. | Bucket based data deduplication technique for big data storage system | |
CN103995863A (en) | Method and device for deleting repeating data | |
KR101666740B1 (en) | Method for generating assocication rules for data mining based on semantic analysis in big data environment | |
CN110399464B (en) | Similar news judgment method and system and electronic equipment | |
CN112783417A (en) | Data reduction method and device, computing equipment and storage medium | |
De Francisci et al. | Scaling out all pairs similarity search with mapreduce | |
US11709798B2 (en) | Hash suppression | |
Lee et al. | Similar pair identification using locality-sensitive hashing technique | |
Sahoo et al. | On the study of GRBF and polynomial kernel based support vector machine in web logs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |