CN104809256A - Data deduplication method and data deduplication method - Google Patents

Data deduplication method and data deduplication method Download PDF

Info

Publication number
CN104809256A
CN104809256A CN201510266694.8A CN201510266694A CN104809256A CN 104809256 A CN104809256 A CN 104809256A CN 201510266694 A CN201510266694 A CN 201510266694A CN 104809256 A CN104809256 A CN 104809256A
Authority
CN
China
Prior art keywords
data
information
similarity
specifying
doubtful
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510266694.8A
Other languages
Chinese (zh)
Inventor
王大亮
杨琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Data Hall (beijing) Polytron Technologies Inc
Original Assignee
Data Hall (beijing) Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Hall (beijing) Polytron Technologies Inc filed Critical Data Hall (beijing) Polytron Technologies Inc
Priority to CN201510266694.8A priority Critical patent/CN104809256A/en
Publication of CN104809256A publication Critical patent/CN104809256A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Abstract

The invention discloses a data deduplication method and a data deduplication method. The method includes the following steps: the metadata information of data to be processed is compared with the metadata information of stored data of a data platform, so that metadata information similarity is obtained; first data description information is compared with second data description information, so that data description similarity is obtained; weighted average is carried out on the metadata information similarity and the data description similarity, so that total similarity is obtained; according to the total similarity, the stored data are sequenced; the previous n data among the sequenced stored data are marked as suspected duplicated data. When the method and the system disclosed by the invention are adopted, the data deduplication range can be narrowed, consequently, the workload of manual data deduplication can be reduced effectively, and thereby the workload of manual data deduplication is controlled within an acceptable range.

Description

A kind of data duplicate removal method and system
Technical field
The present invention relates to data analysis field, particularly relate to a kind of data duplicate removal method and system.
Background technology
The present invention carries out data deduplication mainly for the data in data platform.Data platform, refers to the system carrying mass data, such as data sharing and transaction platform.Data deduplication, referring to and identifies because of different names, author, source, form and many parts of copies of the same number certificate existed, avoiding same number according to being kept in data platform in different forms.
Because the data in data platform can be shared and conclude the business, therefore when there is repeating data in data platform, puzzlement will be caused to data consumer, also can cause damage to data set provider.Such as, after a data are uploaded to data platform by data set provider A, this data platform is uploaded to by data set provider B again.If do not carry out data deduplication, then for data consumer, because of the identical data of download two parts of contents, and the waste of money, time and efforts may be caused; For data set provider, tentation data supplier A be copyrights of data and rightful holder, then data set provider A can have employed due to data consumer the identical data that data set provider B provides, and loses obtainable income when these data being supplied to this data consumer.Visible, data deduplication is very important for data platform.
Data duplicate removal method of the prior art, mainly sets up summary or fingerprint to data to be stored.The mode that usual employing calculates the cryptographic hash (comprising md5, crc32, sha256 scheduling algorithm) of data sets up summary or fingerprint.Then the cryptographic hash of data to be stored and the cryptographic hash storing data are compared, if identical, namely judge that data to be stored and certain have stored data identical.Afterwards, then the deleting duplicated data that takes further measures.
But said method is not suitable in data platform.Data on the one hand owing to storing in data platform are a lot, mass data carried out to the cost prohibitive of cryptographic hash calculating, and also can take larger storage space for the storage of cryptographic hash.The data of usual PB rank can generate the Hash table of TB rank, not only take a large amount of storage space, and the recall precision for cryptographic hash also can be caused to reduce, thus reduce data deduplication efficiency.On the other hand, because the data volume of data platform storage is very large, the possibility that collision occurs to calculate cryptographic hash is also higher, and this can cause again the different data of script to be mistaken for repeating data.
For these reasons, cause in prior art, for the data deduplication work in data platform, can only transfer to manually to complete.But, because the data volume in data platform is too much, cause the efficiency of manually carrying out data deduplication very low.
Therefore, need a kind of data duplicate removal method that effectively can reduce data deduplication scope badly, the workload of manually carrying out data deduplication to be controlled within the acceptable range.
Summary of the invention
The object of this invention is to provide a kind of data duplicate removal method and system, effectively can reduce the data duplicate removal method of data deduplication scope, the workload of manually carrying out data deduplication is controlled within the acceptable range.
For achieving the above object, the invention provides following scheme:
A kind of data duplicate removal method, comprising:
Obtain the pending data being uploaded to data platform;
Determine the metadata information of described pending data;
The metadata information of described pending data and the metadata information storing data of described data platform are compared, obtains metadata information similarity;
Obtain the first data specifying-information of described pending data;
The second data specifying-information of data has been stored described in acquisition;
First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity;
Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity;
According to described total similarity, the described data that stored are sorted;
Store n data before in data and be labeled as doubtful repeating data described in after sequence.
Optionally, described described pending data markers is doubtful repeating data after, also comprise:
The data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
Optionally, the described data list by the information including described doubtful repeating data also comprises after being sent to manual examination and verification client:
When described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
Optionally, the first data specifying-information described in described comparison and described second data specifying-information, obtain data description similarity, specifically comprise:
Adopt SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
A kind of data deduplication system, comprising:
Pending data capture unit, for obtaining the pending data being uploaded to data platform;
Metadata information determining unit, for determining the metadata information of described pending data;
Metadata information comparing unit, for the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtains metadata information similarity;
First data specifying-information acquiring unit, for obtaining the first data specifying-information of described pending data;
Second data specifying-information acquiring unit, for having stored the second data specifying-information of data described in obtaining;
Data specifying-information comparing unit, for the first data specifying-information described in comparison and described second data specifying-information, obtains data description similarity;
Total similarity calculated, for being weighted on average described metadata information similarity and described data description similarity, obtains total similarity;
Sequencing unit, for sorting to the described data that stored according to described total similarity;
Doubtful repeating data indexing unit, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.
Optionally, also comprise:
Doubtful repeating data transmitting element, after described pending data markers is doubtful repeating data, the data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
Optionally, also comprise:
Pending data saving unit, after the data list of the information including described doubtful repeating data is sent to manual examination and verification client, when described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
Optionally, described data specifying-information comparing unit, specifically comprises:
Hamming distances computation subunit, for adopting SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
According to specific embodiment provided by the invention, the invention discloses following technique effect:
Data duplicate removal method in the embodiment of the present invention and system, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of data duplicate removal method embodiment of the present invention;
Fig. 2 is the structural drawing of data deduplication system embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Fig. 1 is the process flow diagram of data duplicate removal method embodiment of the present invention.As shown in Figure 1, the method can comprise:
Step 101: obtain the pending data being uploaded to data platform;
Described pending data can be various types of data.Such as, can be the data of text, the data of picture/mb-type etc.
Step 102: the metadata information determining described pending data;
Described metadata information can be the keyword with summary character for described pending data.
Such as, described metadata information can comprise the information such as data ID, title, classification, form, keyword and source.
Step 103: the metadata information of described pending data and the metadata information storing data of described data platform are compared, obtains metadata information similarity;
The data that store of described data platform also have corresponding metadata information.The metadata information of pending data and the metadata information storing data can be compared.
Described metadata information similarity, can determine with the number storing metadata information identical between data according to pending data.Such as, total number of the metadata information that the number of identical metadata information can be adopted to have divided by pending data, obtains identical metadata information proportion in overall metadata information, using this ratio as metadata information similarity.Suppose that described metadata information comprises totally 6, data ID, title, classification, form, keyword and source, have 4 information identical with the metadata information storing data in the metadata information of wherein pending data, then similarity can be defined as 66.7%.
Step 104: the first data specifying-information obtaining described pending data;
Described data specifying-information, refers to the information for being described data content.Described data specifying-information can be generated by human-edited usually.
Suppose to have a data to be stored to be 40 Asian human face image information data of stochastic sampling.Then corresponding first data specifying-information can be just " the random 40 people's human face image information in Asia ".
Step 105: the second data specifying-information having stored data described in acquisition;
Step 106: the first data specifying-information described in comparison and described second data specifying-information, obtain data description similarity;
Concrete, SimHash algorithm can be adopted to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
Step 107: described metadata information similarity and described data description similarity are weighted on average, obtain total similarity;
Concrete, for described metadata information similarity and described data description similarity, different weights can be distributed.Such as, the first weight can be distributed for data description similarity, for metadata information similarity distributes the second weight, described first weight is greater than described second weight, the proportion that data description similarity can be made so shared in total similarity increases, thus makes the calculating for similarity more accurate.
Step 108: the described data that stored are sorted according to described total similarity;
Concrete, can sort to the described data that stored according to total similarity order from high to low.The storage data that total similarity is the highest, will be positioned at first place.
Step 109: stored n data before in data and be labeled as doubtful repeating data described in after sequence;
Wherein, n is natural number, and the value of n can set according to the actual requirements.Such as, n can get 8,9 or 10 etc.
Be, after doubtful repeating data, described pending data can be transferred to manual examination and verification by described pending data markers.For doubtful repeating data, doubtful repeating data can not be saved to described data platform.
In sum, in the present embodiment, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.
In practical application, described described pending data markers is doubtful repeating data after, can also comprise the following steps:
The data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
Due to diversity and the complicacy of data content itself, cause the comparison process for data cannot be completely competent by computing machine.Concrete, for the data that portion is huger, some data are wherein carried out deleting or after the editor such as amendment, whether newly-generated data can be considered to be same number certificates with former data, and this is merely able to could be determined by manual examination and verification.Therefore, above-mentioned steps can make the comparison process for data more accurate.
In practical application, the described data list by the information including described doubtful repeating data can also comprise the following steps after being sent to manual examination and verification client:
When described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
Because when confirming that through manual examination and verification data to be stored are different from storing data, the probability that data to be stored are identical with storing data is zero substantially, therefore, now can determine data to be stored and to store data not identical, thus described pending data are saved to described data platform.
The invention also discloses a kind of data deduplication system.Fig. 2 is the structural drawing of data deduplication system embodiment of the present invention.As shown in Figure 2, this system can comprise:
Pending data capture unit 201, for obtaining the pending data being uploaded to data platform;
Metadata information determining unit 202, for determining the metadata information of described pending data;
Metadata information comparing unit 203, for the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtains metadata information similarity;
First data specifying-information acquiring unit 204, for obtaining the first data specifying-information of described pending data;
Second data specifying-information acquiring unit 205, for having stored the second data specifying-information of data described in obtaining;
Data specifying-information comparing unit 206, for the first data specifying-information described in comparison and described second data specifying-information, obtains data description similarity;
Total similarity calculated 207, for being weighted on average described metadata information similarity and described data description similarity, obtains total similarity;
Sequencing unit 208, for sorting to the described data that stored according to described total similarity;
Doubtful repeating data indexing unit 209, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.
In the present embodiment, by the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtain metadata information similarity; First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity; Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity; According to described total similarity, the described data that stored are sorted; Store n data before in data and be labeled as doubtful repeating data described in after sequence; Can reduce data deduplication scope, thus effectively the workload of data deduplication is manually carried out in reduction, and the workload of manually carrying out data deduplication is controlled in acceptable scope.
In practical application, this system can also comprise:
Doubtful repeating data transmitting element, after described pending data markers is doubtful repeating data, the data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
In practical application, described system can also comprise:
Pending data saving unit, after the data list of the information including described doubtful repeating data is sent to manual examination and verification client, when described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
In practical application, described data specifying-information comparing unit 206, specifically can comprise:
Hamming distances computation subunit, for adopting SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
In this instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.For system disclosed in embodiment, because it corresponds to the method disclosed in Example, so description is fairly simple, relevant part illustrates see method part.
Apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Claims (8)

1. a data duplicate removal method, is characterized in that, comprising:
Obtain the pending data being uploaded to data platform;
Determine the metadata information of described pending data;
The metadata information of described pending data and the metadata information storing data of described data platform are compared, obtains metadata information similarity;
Obtain the first data specifying-information of described pending data;
The second data specifying-information of data has been stored described in acquisition;
First data specifying-information described in comparison and described second data specifying-information, obtain data description similarity;
Described metadata information similarity and described data description similarity are weighted on average, obtain total similarity;
According to described total similarity, the described data that stored are sorted;
Store n data before in data and be labeled as doubtful repeating data described in after sequence.
2. method according to claim 1, is characterized in that, described described pending data markers is doubtful repeating data after, also comprise:
The data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
3. method according to claim 1, is characterized in that, the described data list by the information including described doubtful repeating data also comprises after being sent to manual examination and verification client:
When described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
4. method according to claim 1, is characterized in that, the first data specifying-information described in described comparison and described second data specifying-information, obtain data description similarity, specifically comprise:
Adopt SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
5. a data deduplication system, is characterized in that, comprising:
Pending data capture unit, for obtaining the pending data being uploaded to data platform;
Metadata information determining unit, for determining the metadata information of described pending data;
Metadata information comparing unit, for the metadata information of described pending data and the metadata information storing data of described data platform being compared, obtains metadata information similarity;
First data specifying-information acquiring unit, for obtaining the first data specifying-information of described pending data;
Second data specifying-information acquiring unit, for having stored the second data specifying-information of data described in obtaining;
Data specifying-information comparing unit, for the first data specifying-information described in comparison and described second data specifying-information, obtains data description similarity;
Total similarity calculated, for being weighted on average described metadata information similarity and described data description similarity, obtains total similarity;
Sequencing unit, for sorting to the described data that stored according to described total similarity; Doubtful repeating data indexing unit, for having stored n data before in data and be labeled as doubtful repeating data described in after sequence.
6. system according to claim 5, is characterized in that, also comprises:
Doubtful repeating data transmitting element, after described pending data markers is doubtful repeating data, the data list of the information including described doubtful repeating data is sent to manual examination and verification client, to carry out manual examination and verification to described doubtful repeating data and the described data that stored; Described data list is by the information structure having stored data described in after sorting.
7. system according to claim 5, is characterized in that, also comprises:
Pending data saving unit, after the data list of the information including described doubtful repeating data is sent to manual examination and verification client, when described doubtful repeating data is different from described pending data, described pending data are saved to described data platform.
8. system according to claim 5, is characterized in that, described data specifying-information comparing unit, specifically comprises:
Hamming distances computation subunit, for adopting SimHash algorithm to calculate Hamming distances between described first data specifying-information and described second data specifying-information, according to described Hamming distances determine between described first data specifying-information and described second data specifying-information data description similarity.
CN201510266694.8A 2015-05-22 2015-05-22 Data deduplication method and data deduplication method Pending CN104809256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510266694.8A CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510266694.8A CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Publications (1)

Publication Number Publication Date
CN104809256A true CN104809256A (en) 2015-07-29

Family

ID=53694078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510266694.8A Pending CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Country Status (1)

Country Link
CN (1) CN104809256A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407933A (en) * 2016-09-21 2017-02-15 国网四川省电力公司电力科学研究院 Power system standardized data integration system
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107944866A (en) * 2017-10-17 2018-04-20 厦门市美亚柏科信息股份有限公司 Transaction record rearrangement and computer-readable recording medium
CN110399363A (en) * 2019-06-25 2019-11-01 云南电网有限责任公司玉溪供电局 A kind of problem data Life cycle data quality management method and system
CN114445207A (en) * 2022-04-11 2022-05-06 广东企数标普科技有限公司 Tax administration system based on digital RMB

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating
US8266115B1 (en) * 2011-01-14 2012-09-11 Google Inc. Identifying duplicate electronic content based on metadata
CN104160712A (en) * 2011-10-30 2014-11-19 谷歌公司 Computing similarity between media programs
CN104216925A (en) * 2013-06-05 2014-12-17 中国科学院声学研究所 Repetition deleting processing method for video content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8266115B1 (en) * 2011-01-14 2012-09-11 Google Inc. Identifying duplicate electronic content based on metadata
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating
CN104160712A (en) * 2011-10-30 2014-11-19 谷歌公司 Computing similarity between media programs
CN104216925A (en) * 2013-06-05 2014-12-17 中国科学院声学研究所 Repetition deleting processing method for video content

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407933A (en) * 2016-09-21 2017-02-15 国网四川省电力公司电力科学研究院 Power system standardized data integration system
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN106844314B (en) * 2017-02-21 2019-10-18 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article
CN107944866A (en) * 2017-10-17 2018-04-20 厦门市美亚柏科信息股份有限公司 Transaction record rearrangement and computer-readable recording medium
CN110399363A (en) * 2019-06-25 2019-11-01 云南电网有限责任公司玉溪供电局 A kind of problem data Life cycle data quality management method and system
CN110399363B (en) * 2019-06-25 2023-02-28 云南电网有限责任公司玉溪供电局 Problem data full life cycle data quality management method and system
CN114445207A (en) * 2022-04-11 2022-05-06 广东企数标普科技有限公司 Tax administration system based on digital RMB

Similar Documents

Publication Publication Date Title
CN104809256A (en) Data deduplication method and data deduplication method
US8452106B2 (en) Partition min-hash for partial-duplicate image determination
US20230342403A1 (en) Method and system for document similarity analysis
US20160035044A1 (en) Account processing method and apparatus
CN108833458B (en) Application recommendation method, device, medium and equipment
CN109213738B (en) Cloud storage file-level repeated data deletion retrieval system and method
WO2018132414A1 (en) Data deduplication using multi-chunk predictive encoding
CN109033475A (en) A kind of file memory method, device, equipment and storage medium
WO2014078997A1 (en) Method and device for repairing data
CN109558397B (en) Data processing method, device, server and computer storage medium
CN103858164B (en) The method and corresponding equipment of automatic management image collection
CN108595517A (en) A kind of extensive document similarity detection method
CN107037978A (en) Data Migration bearing calibration and system
CN103902702A (en) Data storage system and data storage method
CN103049263B (en) Document classification method based on similarity
CN112231514B (en) Data deduplication method and device, storage medium and server
CN103699610A (en) Method for generating file verification information, file verifying method and file verifying equipment
CN110209714A (en) Report form generation method, device, computer equipment and computer readable storage medium
CN102184180A (en) Method and system for removing duplicated files
US20110289194A1 (en) Cloud data storage system
Du et al. Deduplicated disk image evidence acquisition and forensically-sound reconstruction
CN117216239A (en) Text deduplication method, text deduplication device, computer equipment and storage medium
CN117036115A (en) Contract data verification method, device and server
CN109857748B (en) Contract data processing method and device and electronic equipment
CN110019056A (en) Container separated from meta-data for cloud layer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150729

RJ01 Rejection of invention patent application after publication