CN106649646A

CN106649646A - Method and device for deleting duplicated data

Info

Publication number: CN106649646A
Application number: CN201611129751.9A
Authority: CN
Inventors: 孙健
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-05-10

Abstract

The embodiment of the invention discloses a method and device for deleting duplicated data. The method comprises the following steps: acquiring an MD5 value of to-be-processed data and a corresponding data mark; forming a key value pair of the to-be-processed data by the MD5 value and the data mark; comparing the MD5 value in the key value pair of the to-be-processed data with the MD5 value in the key value pair of the existing data; and if the MD5 value in the key value pair of the to-be-processed data is as same as the MD5 value in the key value pair of the existing data, deleting the to-be-processed data and confirming the data mark of the existing data which is as same as the to-be-processed data. According to the embodiment of the invention, the key value pair of the to-be-processed data is formed by the MD5 value and the data mark, the MD5 value of the to-be-processed data is compared with the MD5 value of the existing data, and the to-be-processed data, of which the MD5 value is as same as the MD5 value of the existing data, is deleted, so that the problem of the duplicated data existing in mass data can be solved, the effect of deleting the duplicated data before storage can be achieved, the usage rate of the hard disk is reduced and the cost is lowered.

Description

A kind of method and device of data deduplication

Technical field

The present embodiments relate to data processing technique, more particularly to a kind of method and device of data deduplication.

Background technology

In the current big data epoch, with informationalized development, it is the theory of many enterprise operators to be spoken with data. Enterprise's data volume to be processed is increased sharply, and while big data offers convenience, also some burdens is increased to technical staff, in magnanimity Data in, there is substantial amounts of duplicate data, cause the load of system increasing, data loading and query performance are with How drop, realize, to a large amount of deletions for repeating junk data, reducing the utilization rate of hard disk, becomes the big data epoch urgently to be resolved hurrily A difficult problem.

The content of the invention

The present invention provides a kind of method and device of data deduplication, to realize the duplicate removal to large-scale data, reduces hard disk Utilization rate.

In a first aspect, embodiments providing a kind of method of data deduplication, the method includes：

Obtain the MD5 values and corresponding Data Identification of pending data；

The MD5 values and the Data Identification are constituted into the key-value pair of the pending data；

Compare the MD5 values in the key-value pair of MD5 values in the key-value pair of the pending data and data with existing；

If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing, The pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.

Further, after the pending data is deleted, also include：

The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition In statistics storehouse.

Further, also include：

If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, The pending data and corresponding Data Identification are stored in into database；

The key-value pair of the pending data is preserved in the key-value pair of data with existing.

Further, obtaining the MD5 values and corresponding Data Identification of pending data includes：

Pending data is read by row；

Calculate the MD5 values of the pending data；

According to thread number when read access time and/or reading pending data, the data mark of the pending data is generated Know.

Further, calculating the MD5 values of the pending data includes：

If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected Omit data；

Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data Value.

Second aspect, the embodiment of the present invention additionally provides a kind of device of data deduplication, and the device includes：

Data Identification acquisition module, for obtaining the MD5 values and corresponding Data Identification of pending data；

Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key assignments of the pending data It is right；

MD5 value comparing modules, for the MD5 values in the key-value pair for comparing the pending data and the key assignments of data with existing The MD5 values of centering；

Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and data with existing MD5 values in key-value pair are identical, then delete the pending data, and determine the data with existing repeated with the pending data Data Identification.

Further, also including key-value pair preserving module, for after the pending data is deleted, by pending number According to the key-value pair of data with existing that repeats of key-value pair and the pending data be stored in repetition statistics storehouse.

Further, also including data memory module, specifically for：

Further, Data Identification acquisition module includes：

Data-reading unit, for reading pending data by row；

MD5 value computing units, for calculating the MD5 values of the pending data；

Data Identification signal generating unit, for according to thread number when read access time and/or reading pending data, generating institute State the Data Identification of pending data.

Further, MD5 values computing unit specifically for：

The technical scheme of the present embodiment, by the way that the MD5 values of pending data and Data Identification are constituted into key-value pair, and compares MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing, by the MD5 value phases with data with existing Same pending data is deleted, and solves the problems, such as there is duplicate data in mass data, has been reached before warehouse-in to data The effect of duplicate removal is carried out, the utilization rate of hard disk, reduces cost is reduced.

Description of the drawings

Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided；

Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided；

Fig. 3 is the general frame figure of the data handling system in a kind of data duplicate removal method provided in an embodiment of the present invention；

Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided；

Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided；

Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided.

Specific embodiment

With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided, and the present embodiment is applicable to The situation of effective duplicate removal is carried out to mass data, the method can be performed by the device of data deduplication, and the method is specifically included Following steps：

S110, the MD5 values and corresponding Data Identification that obtain pending data.

Wherein, MD5 (Message-Digest Algorithm 5, Message-Digest Algorithm 5), for guaranteeing information transfer Completely unanimously, it is one of widely used hash algorithm of computer, resists with compressibility, easy calculating, anti-modification property and by force The features such as collision.The type of pending data can be text type, can row wise or column wise wait mode to read data and calculate The corresponding MD5 values of data, Data Identification can be as the mark of every data, for distinguishing every data.

S120, the key-value pair that the MD5 values and the Data Identification are constituted the pending data.

Wherein, MD5 values can be stored on redis clusters, using the benefit of redis clusters be deposited on redis databases The data of storage are typically all the mode of key-value pair, it is possible to achieve efficient comparison operation.Can be by MD5 values and Data Identification group Into key-value pair, in being stored in redis Cluster Databases.It is all using a kind of distributed system foundation frame to be typically repeated processing data The distributed file system (Hadoop Distributed File System, HDFS) of structure Hadoop, so can be effectively Mass data storage is realized, while effectively preventing Single Point of Faliure, it is to avoid unnecessary loss.But, in the enterprising row data of HDFS During duplicate removal, data will be stored in advance in hard disk, cause data to put in storage, waste hard disk resources, increase hardware cost, consumed The substantial amounts of time, duplicate removal is carried out in redis databases can be realized before data loading just effectively a large amount of repetition rubbish Rubbish data are deleted, and reduce the utilization rate of hard disk, reduces cost.

MD5 values in S130, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing.

Wherein, data with existing can be the data for having been stored.

MD5 values in the key-value pair of the MD5 values in the key-value pair of pending data and data with existing are contrasted, is judged Whether the MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing are identical.Grasped by constantly comparing Make, delete rubbish duplicate data.

If the MD5 values in S140, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.

If the MD5 values in the key-value pair of pending data are identical with the MD5 values in the key-value pair of data with existing, can be with Think that two datas repeat mutually, therefore pending data is deleted, while determining and having that pending data repeats The Data Identification of data, to determine the data with existing repeated with pending data.

Above-mentioned steps are that S110, S120, S130 and S140 can be performed by hardware device, it is also possible to by different hard Part equipment is performed respectively, and the concrete equipment for performing is not limited here.

On the basis of above-mentioned technical proposal, after the pending data is deleted, further preferably include：

Wherein, the key-value pair of the data with existing for the key-value pair of pending data being repeated with the pending data is stored in In repeating statistics storehouse, it is possible to use the information of preservation calculates the Data duplication amount of the data of identical MD5 values, and Data duplication amount can As reference factor during consideration business demand.

On the basis of above-described embodiment, further preferably include：

Wherein, if the MD5 values in the key-value pair of pending data are different with the MD5 values in the key-value pair of data with existing, Then can confirm that and do not exist between pending data and data with existing repetition, then by the pending data and corresponding data mark Knowledge is stored in database.The key-value pair of pending data is preserved simultaneously, as the later reference frame for comparing.To not repeat Data stored, it is ensured that the integrality of data.

Embodiment two

Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided, and the present embodiment is in above-mentioned reality Apply and carried out on the basis of example further optimization, " obtaining the MD5 values and corresponding Data Identification of pending data " is entered into one Step is refined as " reading pending data by row；Calculate the MD5 values of the pending data；Treated according to read access time and/or reading Thread number during processing data, generates the Data Identification of the pending data." the method specifically includes following steps：

S210, by row read pending data.

Wherein, data can be before pre-processing data access link, can be transported to data by handbarrow In the middle of the server of pretreatment, the process of pending data is waited.In preprocessing server, program reads data by row.Generally use Carrying program be all by network transmission, typically using transmission control protocol (Transmission Control Protocol, TCP) communicate or FTP (File Transfer Protocol, FTP) transmission.Using pretreatment Program realizes the real-time access of data, improves the treatment effeciency of data.Fig. 3 is a kind of data deduplication provided in an embodiment of the present invention The general frame figure of the data handling system in method.As shown in figure 3, data carry server cluster by server 1- servers N is constituted, and data are carried server cluster 310 data are carried in preprocessing server cluster 320 using carrying program, pre- place Reason server cluster is made up of server 1'- server N', and wherein preprocessor can pacify in one or multiple servers Dress, so as to be mounted with that one or more server group of preprocessor, into preprocessing server cluster 320, calculates data MD5 values are simultaneously compared on redis server clusters 330 with the MD5 values of data with existing.Redis in the embodiment of the present invention Server cluster 330 is by server 1 "-server N " etc. get up what is constituted by high-speed traffic link connection, there will be no repetition Data be stored in database 340, wherein database 340 can be Hbase or oracle database.

S220, the MD5 values for calculating the pending data.

Wherein, the often capable pending data for reading is calculated into successively MD5 values.

S230, the thread number according to read access time and/or when reading pending data, generate the number of the pending data According to mark.

Wherein, it is that pending data generates unique Data Identification, i.e. data ID.ID can be by the reading of pending data At least one of thread number when time or reading pending data composition, by pretreatment cluster pending data is being read When, ID can also pre-process the preprocessing server device number in cluster.Because preprocessing server is probably multiple stage, so In order to Differentiated Services device is that each server arranges unique number, i.e. preprocessing server device number.

S240, the key-value pair that the MD5 values and the Data Identification are constituted the pending data；

MD5 values in S250, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing；

If the MD5 values in S260, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.

The technical scheme of the present embodiment, by pressing to go pending data is read, and calculates the MD5 values of each row of data, is generated Data Identification, the MD5 values of row data are compared with the MD5 values of data with existing, and pending data is deleted when identical, are solved There is the problem of duplicate data, to have reached carried out duplicate removal before warehouse-in to the duplicate data in mass data in mass data Effect, reduce hard disk utilization rate, reduces cost.Row data are read as by row using by pending data, it is each using judging Row data there will be the row data deletion of repetition situation with the presence or absence of repeating, and reach the effect of deduplication operation more specificization, and And cause data processing quicker by reading by row.

Embodiment three

Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided, and the present embodiment is in above-mentioned reality Apply and carried out on the basis of example further optimization, " calculating the MD5 values of the pending data " is further refined as " if Ignore data comprising default in the pending data, then the pending data is removed into described presetting and ignore data；Calculate The default MD5 values for ignoring the pending data after data are removed, as the MD5 values of the pending data." the method tool Body is comprised the following steps：

S410, by row read pending data.

If ignoring data comprising default in S420, the pending data, the pending data is removed described It is default to ignore data.

Wherein, before pending data is read, according to actual demand some data contents can be set to preset Ignore data, for example, can be port numbers or some unnecessary temporal informations of data etc., these be all operating system with What machine was produced, data itself are not worth can be sayed, therefore can be preset and be ignored that data are identical and remainder data is different treats Processing data thinks that same is processed, and by a data therein warehouse-in, other carry out delete processing, you can realize going for data Weight, before the MD5 values of data are calculated, removes the default purpose ignored data, can play raising data deduplication effect, saves A part of workload, saves data deduplication time.

S430, calculating remove the default MD5 values for ignoring the pending data after data, used as the pending data MD5 values.

Wherein, default ignoring after data remove, then the MD5 values for calculating pending data.

S440, the thread number according to read access time and/or when reading pending data, generate the number of the pending data According to mark.

S450, the key-value pair that the MD5 values and the Data Identification are constituted the pending data；

MD5 values in S460, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing；

If the MD5 values in S470, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.

The technical scheme of the present embodiment, by increased before the MD5 values of data are calculated data anticipation abscission ring section is ignored, The ineffective data part that removal system is randomly generated, reaches the effect for saving workload, saves the data deduplication time.

Example IV

Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided, and the present embodiment is above-mentioned A preferred embodiment on the basis of embodiment, this method specifically includes following steps：

S510, data are carried server cluster and pending data are transported to into preprocessing server.

The MD5 values of S520, preprocessing server PC cluster pending data, and obtain the data mark of pending data Know, constitute key-value pair, key-value pair is sent to redis clusters.

MD5 in the key-value pair that S530, redis cluster has deposited the MD5 values and redis databases in the key-value pair Value compares, and judges whether identical.If identical, S540 is performed, otherwise perform S550.

S540, by data de-duplication.

Wherein, if the MD5 values of pending data are identical with the MD5 values that redis databases are deposited, pending data is illustrated Repeat with data with existing, then pending data is confirmed as into duplicate data, be deleted.

S550, determine pending data be unique data.

Wherein, if the MD5 values of pending data are different from the MD5 values that redis databases are deposited, it is considered that for work as For front pending data, not there is a problem of Data duplication, then pending data is defined as into unique data.

S560, by pending data warehouse-in to the data storage device such as HBase or Oracle, and by the key of pending data During value is to being stored in redis databases.

The technical scheme of the embodiment of the present invention, by the process of multiple clusters, by the MD5 values and redis of pending data The MD5 values of database storage are compared, and judge whether data repeat, and data de-duplication have been reached to mass data Effective duplicate removal, improve deduplicated efficiency.

Embodiment five

Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided, and the device includes：

Data Identification acquisition module 610, for obtaining the MD5 values and corresponding Data Identification of pending data；

Key-value pair comprising modules 620, for the MD5 values and the Data Identification to be constituted the key of the pending data Value is right；

MD5 values comparing module 630, for the MD5 values in the key-value pair for comparing the pending data and data with existing MD5 values in key-value pair；

Duplicate data determining module 640, if for MD5 values and data with existing in the key-value pair of the pending data Key-value pair in MD5 values it is identical, then delete the pending data, and determine several with what the pending data repeated According to Data Identification.

Further, also include；

Key-value pair preserving module 620, for after the pending data is deleted, by the key-value pair of pending data with The key-value pair of the data with existing that the pending data repeats is stored in repetition statistics storehouse.

Further, also including data memory module, specifically for：

Further, Data Identification acquisition module 610 includes：

Data-reading unit, for reading pending data by row；

Further, MD5 values computing unit specifically for：

The method that the device of above-mentioned data deduplication can perform the data deduplication that any embodiment of the present invention is provided, possesses and holds The corresponding functional module of method of row data deduplication and beneficial effect.

Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of method of data deduplication, it is characterised in that include：

Obtain the md5-challenge MD5 values and corresponding Data Identification of pending data；

If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing, delete The pending data, and determine the Data Identification of the data with existing repeated with the pending data.

2. method according to claim 1, it is characterised in that after the pending data is deleted, also include：

The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition statistics In storehouse.

3. method according to claim 1, it is characterised in that also include：

If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, by institute State pending data and corresponding Data Identification is stored in database；

4. according to the arbitrary described method of claim 1-3, it is characterised in that obtain MD5 values of pending data and corresponding Data Identification includes：

Pending data is read by row；

Calculate the MD5 values of the pending data；

According to thread number when read access time and/or reading pending data, the Data Identification of the pending data is generated.

5. method according to claim 4, it is characterised in that calculating the MD5 values of the pending data includes：

If ignoring data comprising default in the pending data, the pending data is removed into described presetting and ignores number According to；

Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 values of the pending data.

6. a kind of device of data deduplication, it is characterised in that include:

Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key-value pair of the pending data；

MD5 value comparing modules, in the key-value pair for the MD5 values in the key-value pair for comparing the pending data and data with existing MD5 values；

Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and the key assignments of data with existing The MD5 values of centering are identical, then delete the pending data, and determine the number of the data with existing repeated with the pending data According to mark.

7. device according to claim 6, it is characterised in that also include：

Key-value pair preserving module, for after the pending data is deleted, the key-value pair of pending data being treated with described The key-value pair of the data with existing that processing data repeats is stored in repetition statistics storehouse.

8. device according to claim 6, it is characterised in that also including data memory module, specifically for：

9. according to the arbitrary described device of claim 6-8, it is characterised in that Data Identification acquisition module includes：

Data-reading unit, for reading pending data by row；

Data Identification signal generating unit, for according to read access time and/or read pending data when thread number, generate described in treat The Data Identification of processing data.

10. device according to claim 9, it is characterised in that MD5 values computing unit specifically for：