CN104809256A

CN104809256A - Data deduplication method and data deduplication method

Info

Publication number: CN104809256A
Application number: CN201510266694.8A
Authority: CN
Inventors: 王大亮; 杨琪
Original assignee: Data Hall (beijing) Polytron Technologies Inc
Current assignee: Data Hall (beijing) Polytron Technologies Inc
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2015-07-29

Abstract

The invention discloses a data deduplication method and a data deduplication method. The method includes the following steps: the metadata information of data to be processed is compared with the metadata information of stored data of a data platform, so that metadata information similarity is obtained; first data description information is compared with second data description information, so that data description similarity is obtained; weighted average is carried out on the metadata information similarity and the data description similarity, so that total similarity is obtained; according to the total similarity, the stored data are sequenced; the previous n data among the sequenced stored data are marked as suspected duplicated data. When the method and the system disclosed by the invention are adopted, the data deduplication range can be narrowed, consequently, the workload of manual data deduplication can be reduced effectively, and thereby the workload of manual data deduplication is controlled within an acceptable range.

Description

A kind of data duplicate removal method and system

Technical field

The present invention relates to data analysis field, particularly relate to a kind of data duplicate removal method and system.

Background technology

The present invention carries out data deduplication mainly for the data in data platform.Data platform, refers to the system carrying mass data, such as data sharing and transaction platform.Data deduplication, referring to and identifies because of different names, author, source, form and many parts of copies of the same number certificate existed, avoiding same number according to being kept in data platform in different forms.

Because the data in data platform can be shared and conclude the business, therefore when there is repeating data in data platform, puzzlement will be caused to data consumer, also can cause damage to data set provider.Such as, after a data are uploaded to data platform by data set provider A, this data platform is uploaded to by data set provider B again.If do not carry out data deduplication, then for data consumer, because of the identical data of download two parts of contents, and the waste of money, time and efforts may be caused; For data set provider, tentation data supplier A be copyrights of data and rightful holder, then data set provider A can have employed due to data consumer the identical data that data set provider B provides, and loses obtainable income when these data being supplied to this data consumer.Visible, data deduplication is very important for data platform.

Data duplicate removal method of the prior art, mainly sets up summary or fingerprint to data to be stored.The mode that usual employing calculates the cryptographic hash (comprising md5, crc32, sha256 scheduling algorithm) of data sets up summary or fingerprint.Then the cryptographic hash of data to be stored and the cryptographic hash storing data are compared, if identical, namely judge that data to be stored and certain have stored data identical.Afterwards, then the deleting duplicated data that takes further measures.

But said method is not suitable in data platform.Data on the one hand owing to storing in data platform are a lot, mass data carried out to the cost prohibitive of cryptographic hash calculating, and also can take larger storage space for the storage of cryptographic hash.The data of usual PB rank can generate the Hash table of TB rank, not only take a large amount of storage space, and the recall precision for cryptographic hash also can be caused to reduce, thus reduce data deduplication efficiency.On the other hand, because the data volume of data platform storage is very large, the possibility that collision occurs to calculate cryptographic hash is also higher, and this can cause again the different data of script to be mistaken for repeating data.

For these reasons, cause in prior art, for the data deduplication work in data platform, can only transfer to manually to complete.But, because the data volume in data platform is too much, cause the efficiency of manually carrying out data deduplication very low.

Therefore, need a kind of data duplicate removal method that effectively can reduce data deduplication scope badly, the workload of manually carrying out data deduplication to be controlled within the acceptable range.

Summary of the invention

The object of this invention is to provide a kind of data duplicate removal method and system, effectively can reduce the data duplicate removal method of data deduplication scope, the workload of manually carrying out data deduplication is controlled within the acceptable range.

For achieving the above object, the invention provides following scheme:

A kind of data duplicate removal method, comprising:

Obtain the pending data being uploaded to data platform;

Determine the metadata information of described pending data;

The metadata information of described pending data and the metadata information storing data of described data platform are compared, obtains metadata information similarity;

Obtain the first data specifying-information of described pending data;

The second data specifying-information of data has been stored described in acquisition;

First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity;

Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity;

According to described total similarity, the described data that stored are sorted;

Store n data before in data and be labeled as doubtful repeating data described in after sequence.

Optionally, described described pending data markers is doubtful repeating data after, also comprise:

The data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.

Optionally, the described data list by the information including described doubtful repeating data also comprises after being sent to manual examination and verification client:

When described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.

Optionally, the first data specifying-information described in described comparison and described second data specifying-information, obtain data description similarity, specifically comprise:

Adopt SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.

A kind of data deduplication system, comprising:

Pending data capture unit, for obtaining the pending data being uploaded to data platform;

Metadata information determining unit, for determining the metadata information of described pending data;

Metadata information comparing unit, for the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtains metadata information similarity;

First data specifying-information acquiring unit, for obtaining the first data specifying-information of described pending data;

Second data specifying-information acquiring unit, for having stored the second data specifying-information of data described in obtaining;

Data specifying-information comparing unit, for the first data specifying-information described in comparison and described second data specifying-information, obtains data description similarity;

Total similarity calculated, for being weighted on average described metadata information similarity and described data description similarity, obtains total similarity;

Sequencing unit, for sorting to the described data that stored according to described total similarity;

Doubtful repeating data indexing unit, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.

Optionally, also comprise:

Doubtful repeating data transmitting element, after described pending data markers is doubtful repeating data, the data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.

Optionally, also comprise:

Pending data saving unit, after the data list of the information including described doubtful repeating data is sent to manual examination and verification client, when described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.

Optionally, described data specifying-information comparing unit, specifically comprises:

Hamming distances computation subunit, for adopting SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.

According to specific embodiment provided by the invention, the invention discloses following technique effect:

Data duplicate removal method in the embodiment of the present invention and system, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of data duplicate removal method embodiment of the present invention;

Fig. 2 is the structural drawing of data deduplication system embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.

Fig. 1 is the process flow diagram of data duplicate removal method embodiment of the present invention.As shown in Figure 1, the method can comprise:

Step 101: obtain the pending data being uploaded to data platform;

Described pending data can be various types of data.Such as, can be the data of text, the data of picture/mb-type etc.

Step 102: the metadata information determining described pending data;

Described metadata information can be the keyword with summary character for described pending data.

Such as, described metadata information can comprise the information such as data ID, title, classification, form, keyword and source.

Step 103: the metadata information of described pending data and the metadata information storing data of described data platform are compared, obtains metadata information similarity;

The data that store of described data platform also have corresponding metadata information.The metadata information of pending data and the metadata information storing data can be compared.

Described metadata information similarity, can determine with the number storing metadata information identical between data according to pending data.Such as, total number of the metadata information that the number of identical metadata information can be adopted to have divided by pending data, obtains identical metadata information proportion in overall metadata information, using this ratio as metadata information similarity.Suppose that described metadata information comprises totally 6, data ID, title, classification, form, keyword and source, have 4 information identical with the metadata information storing data in the metadata information of wherein pending data, then similarity can be defined as 66.7%.

Step 104: the first data specifying-information obtaining described pending data;

Described data specifying-information, refers to the information for being described data content.Described data specifying-information can be generated by human-edited usually.

Suppose to have a data to be stored to be 40 Asian human face image information data of stochastic sampling.Then corresponding first data specifying-information can be just " the random 40 people's human face image information in Asia ".

Step 105: the second data specifying-information having stored data described in acquisition;

Step 106: the first data specifying-information described in comparison and described second data specifying-information, obtain data description similarity;

Concrete, SimHash algorithm can be adopted to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.

Step 107: described metadata information similarity and described data description similarity are weighted on average, obtain total similarity;

Concrete, for described metadata information similarity and described data description similarity, different weights can be distributed.Such as, the first weight can be distributed for data description similarity, for metadata information similarity distributes the second weight, described first weight is greater than described second weight, the proportion that data description similarity can be made so shared in total similarity increases, thus makes the calculating for similarity more accurate.

Step 108: the described data that stored are sorted according to described total similarity;

Concrete, can sort to the described data that stored according to total similarity order from high to low.The storage data that total similarity is the highest, will be positioned at first place.

Step 109: stored n data before in data and be labeled as doubtful repeating data described in after sequence;

Wherein, n is natural number, and the value of n can set according to the actual requirements.Such as, n can get 8,9 or 10 etc.

Be, after doubtful repeating data, described pending data can be transferred to manual examination and verification by described pending data markers.For doubtful repeating data, doubtful repeating data can not be saved to described data platform.

In sum, in the present embodiment, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.

In practical application, described described pending data markers is doubtful repeating data after, can also comprise the following steps:

Due to diversity and the complicacy of data content itself, cause the comparison process for data cannot be completely competent by computing machine.Concrete, for the data that portion is huger, some data are wherein carried out deleting or after the editor such as amendment, whether newly-generated data can be considered to be same number certificates with former data, and this is merely able to could be determined by manual examination and verification.Therefore, above-mentioned steps can make the comparison process for data more accurate.

In practical application, the described data list by the information including described doubtful repeating data can also comprise the following steps after being sent to manual examination and verification client:

Because when confirming that through manual examination and verification data to be stored are different from storing data, the probability that data to be stored are identical with storing data is zero substantially, therefore, now can determine data to be stored and to store data not identical, thus described pending data are saved to described data platform.

The invention also discloses a kind of data deduplication system.Fig. 2 is the structural drawing of data deduplication system embodiment of the present invention.As shown in Figure 2, this system can comprise:

Pending data capture unit 201, for obtaining the pending data being uploaded to data platform;

Metadata information determining unit 202, for determining the metadata information of described pending data;

Metadata information comparing unit 203, for the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtains metadata information similarity;

First data specifying-information acquiring unit 204, for obtaining the first data specifying-information of described pending data;

Second data specifying-information acquiring unit 205, for having stored the second data specifying-information of data described in obtaining;

Data specifying-information comparing unit 206, for the first data specifying-information described in comparison and described second data specifying-information, obtains data description similarity;

Total similarity calculated 207, for being weighted on average described metadata information similarity and described data description similarity, obtains total similarity;

Sequencing unit 208, for sorting to the described data that stored according to described total similarity;

Doubtful repeating data indexing unit 209, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.

In the present embodiment, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.

In practical application, this system can also comprise:

In practical application, described system can also comprise:

In practical application, described data specifying-information comparing unit 206, specifically can comprise:

In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.For system disclosed in embodiment, because it corresponds to the method disclosed in Example, so description is fairly simple, relevant part illustrates see method part.

Apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Claims

1. a data duplicate removal method, is characterized in that, comprising:

Obtain the pending data being uploaded to data platform;

Determine the metadata information of described pending data;

Obtain the first data specifying-information of described pending data;

2. method according to claim 1, is characterized in that, described described pending data markers is doubtful repeating data after, also comprise:

3. method according to claim 1, is characterized in that, the described data list by the information including described doubtful repeating data also comprises after being sent to manual examination and verification client:

4. method according to claim 1, is characterized in that, the first data specifying-information described in described comparison and described second data specifying-information, obtain data description similarity, specifically comprise:

5. a data deduplication system, is characterized in that, comprising:

Sequencing unit, for sorting to the described data that stored according to described total similarity; Doubtful repeating data indexing unit, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.

6. system according to claim 5, is characterized in that, also comprises:

7. system according to claim 5, is characterized in that, also comprises:

8. system according to claim 5, is characterized in that, described data specifying-information comparing unit, specifically comprises: