CN108241639B

CN108241639B - A kind of data duplicate removal method

Info

Publication number: CN108241639B
Application number: CN201611207408.1A
Authority: CN
Inventors: 王焰辉; 李振钊; 曾刚
Original assignee: Zhongke Star Map Co Ltd
Current assignee: Zhongke Star Map Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2019-07-23
Anticipated expiration: 2036-12-23
Also published as: CN108241639A

Abstract

The invention discloses a kind of data duplicate removal methods, this method comprises: the last byte based on data block, classifies to data block, while being arranged and corresponding to the database server that every a kind of data block is handled and stored；Interface server setting minimum data block length, if it is less than the minimum length, is transmitted directly to the corresponding database server of the data block for the data file of demand duplicate removal；Otherwise using different trail bytes to the data file piecemeal；In six kinds of most partitioned modes of block count, interface server selects the maximum two kinds of partitioned modes of repeated data amount, indicates that corresponding database server is stored；The database server only stores a pointer, is directed toward stored identical block for repeated data block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Description

A kind of data duplicate removal method

[technical field]

The invention belongs to computers and database field, specifically, being related to a kind of data duplicate removal method.

[background technique]

In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range of receiving is to need new tupe Could have stronger decision edge, the magnanimity for seeing clearly discovery power and process optimization ability, high growth rate and diversified information money It produces.

Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with Cloud computing is under the setting off of the technological innovation curtain of representative, these data that is difficult to collect originally and use start to be easy to be utilized Get up, by constantly bringing forth new ideas for all trades and professions, big data is gradually that the mankind create more values.

But although the computer for doing big data analysis is more and more, performance is become better and better, and faces magnanimity number According to still unable to do what one wishes, therefore the first step of big data analysis, it is to detect and eliminate repeated data therein, is gone by data On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data amount of analysis.

Common data duplicate removal method in the prior art is that weight is detected by comparing the cryptographic Hash of entire data file Complex data.This detection method is too simple, and recall rate is not high.

[summary of the invention]

In order to solve the above problem in the prior art, the invention proposes a kind of new data duplicate removal method, technologies Scheme is as follows:

A kind of data duplicate removal method, method includes the following steps:

Step 100: the last byte based on data block classifies to data block, while being arranged corresponding to each The database server that class data block is handled and stored；

Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, if it is less than The minimum length is transmitted directly to the corresponding database server of the data block；Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle: in addition to last block, each piece of length is not less than the minimum length, and ends up Byte is identical.

Step 300: in six kinds of most partitioned modes of block count, interface server selects repeated data amount maximum two Kind partitioned mode, indicates that corresponding database server is stored；

Step 400: the database server only stores a pointer, is directed toward stored phase for repeated data block Same data block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block Block.

Further, the cryptographic Hash is calculated using MD5 algorithm.

Further, the cryptographic Hash is calculated using SHA-1 algorithm.

Further, the cryptographic Hash is calculated using SHA-256 algorithm.

The solution have the advantages that: the recall rate of repeated data is improved, the data analysis of big data analysis is reduced Amount and memory space occupy.

[Detailed description of the invention]

Described herein the drawings are intended to provide a further understanding of the invention, constitutes part of this application, but It does not constitute improper limitations of the present invention, in the accompanying drawings:

Fig. 1 is the basic flow chart of the method for the present invention.

[specific embodiment]

Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and says It is bright to be only used to explain the present invention, but it is not intended as inappropriate limitation of the present invention.

System structure applied by the method for the present invention includes interface server and multiple database servers, the interface clothes Business device is responsible for managing the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly , if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one, To reduce database server quantity.

On the basis of above system structure, the basic step of the method for the present invention is as follows:

Based on above-mentioned basic step, specific step is as follows for the method for the present invention:

(1) interface server receives the data file for needing to store storage.

Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for adopting The data file is stored into database server with subsequent step.One typical example is the Web clothes on internet Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, interface server of the invention can also Be it is multiple, the present invention does not make limitation to its quantity.

(2) interface server checks the length L of the data file, if L is less than predefined minimum data block Length MinBlockLength then extracts the last byte B of the data file, goes to step 3.If L >= MinBlockLength then goes to step 5.

Above-mentioned length is all since a byte is 8, then necessarily to have 0≤B≤255 as unit of byte.The minimum Data block length be the present invention to file block when minimum length, occurrence can set as the case may be by administrator It is fixed.In a kind of preferred situation, the MinBlockLength=1024 byte.

(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤ 255, the data file is sent Server (B) by interface server, while saving the relevant information of the data file.

The present invention classifies to data block according to the byte of data end of block, takes since a byte one shares 256 Value, therefore is divided into 256 classes for data block, and it is corresponding be assigned to 256 database servers, each database server It numbers identical as classification (i.e. the value of trail byte) of its responsible data block.

256 database servers are preferred embodiments of the invention, and cost of implementation is higher, are deposited suitable for large-scale data Storage system may be multiplexed between database server if necessary to reduced cost, i.e., multiple data block classifications are one shared Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way Process.

The relevant information of the data file includes name, size, the number of database server of data file etc., Thus interface server can inquire the data file.

(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash It has stored in the server, if stored, which is repeated data, and only the data file retains a finger Needle is directed toward stored data；And if it is not stored, Server (B) stores the data file and its cryptographic Hash, method Terminate.

The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data The cryptographic Hash of file is compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is weight Complex data does not just have to store entire data file again.

The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to: MD4, MD5, SHA-1, SHA-256 etc..

(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V is arranged first =0.

(6) interface server scans backward since the MinBlockLength byte of the data file, when When scanning some byte equal to piecemeal vector, the position P of the byte is recorded₁, then again from P₁The position of+MinBlockLength Start to scan, find and record next byte location equal to piecemeal vector, so recycles, until the end of data file. In other words, the starting position scanned each time is all MinBlockLength at a distance from last record position, until reaching The end of data file.

(7) according to the position of step 6 recorded, using each position as the ending of a data block, to data file Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained K_VA data block.

The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.

(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9.

Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain A variety of data blocks divide forms, and quantity is from K₀To K₂₅₅.But the form of present invention circulation of being write as is intended merely to narration conveniently, In practical application, those skilled in the art be will be seen that, only can be carried out a scanning to data file and just be completed above-mentioned 256 Circulation, to improve execution efficiency.

(9) to K₀To K₂₅₅According to being ranked up from big to small (if there is equal K_V, then subscript it is big preceding), the row of acquisition In most preceding (i.e. maximum) 6 values, it is assumed that be K_V1, K_V2, K_V3, K_V4, K_V5, K_V6。

(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its knot Trail byte is sent to corresponding database server and carries out repeated detection.

As previously mentioned, sending Server (X) for the data block if the trail byte of a data block is X.

(11) each database server calculates its cryptographic Hash to each data block received, is sentenced according to the cryptographic Hash Whether the data block of breaking is repeated data (i.e. the data block is identical as stored data block), and judging result is sent to interface Server.

(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding Database server is stored.Interface server itself saves the relevant information of the data file, including its two kinds of piecemeal sides Formula and corresponding database server.

The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and phase The cryptographic Hash answered then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.

The considerations of two kinds of partitioned modes retained are for redundancy backup, the i.e. database server in a kind of partitioned mode After being out of order, another partitioned mode still can be combined into former data file.

In addition, selecting two kinds to be only that one kind of the invention is preferred in six kinds of partitioned modes in above-mentioned steps and implementing Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds Middle 3 kinds of selection.

The above description is only a preferred embodiment of the present invention, thus it is all according to the configuration described in the scope of the patent application of the present invention, The equivalent change or modification that feature and principle are done, is included in the scope of the patent application of the present invention.

Claims

1. a kind of data duplicate removal method, system structure applied by this method includes interface server and multiple database services Device, the interface server is responsible for managing the storage storage of data file, and database server is used for actual storage data, It is characterized in that, method includes the following steps:

Step 100: the last byte based on data block classifies to data block, while being arranged and corresponding to every a kind of number The database server for being handled and being stored according to block；

Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, most if it is less than this Small length is transmitted directly to the corresponding database server of the data block；Otherwise using different trail bytes to data text Part piecemeal, piecemeal are based on following principle: in addition to last block, each piece of length is not less than the minimum length, and trail byte It is identical；

Step 300: in six kinds of most partitioned modes of block count, interface server selects maximum two kinds points of repeated data amount Block mode indicates that corresponding database server is stored；

Step 400: the database server only stores a pointer for repeated data block, is directed toward stored identical number According to block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored；

(1) interface server receives the data file for needing to store storage；

Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for after using Continuous step stores the data file into database server；

(2) interface server checks the length L of the data file, if L is less than predefined minimum data block length MinBlockLength then extracts the last byte B of the data file, goes to step 3；If L >= MinBlockLength then goes to step 5；

(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤255, The data file is sent Server (B) by interface server, while saving the relevant information of the data file；

According to the byte of data end of block, classify to data block, it, will since a byte one shares 256 values Data block is divided into 256 classes, and it is corresponding be assigned to 256 database servers, the number of each database server and its The value of the classification trail byte of responsible data block is identical；

The relevant information of the data file includes the name, size, the number of database server of data file, thus interface Server can inquire the data file；

(4) Server (B) calculates the cryptographic Hash of the data file, judges whether the data file has deposited according to the cryptographic Hash In the server, if stored, which is repeated data for storage, and only the data file retains a pointer, is referred to To stored data；And if it is not stored, Server (B) stores the data file and its cryptographic Hash, and method terminates；

The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data file Cryptographic Hash be compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is repeat number According to just without storing entire data file again；

The hash algorithm includes but is not limited to: MD4, MD5, SHA-1, SHA-256；

(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V=0 is arranged first；

(6) interface server scans backward since the MinBlockLength byte of the data file, works as scanning When some byte is equal to piecemeal vector, the position P of the byte is recorded₁, then again from P₁The position of+MinBlockLength starts Scanning, finds and records next byte location equal to piecemeal vector, so recycles, until the end of data file；

(7) data file is carried out using each position as the ending of a data block according to the position of step 6 recorded Piecemeal obtains one or more data blocks, if having obtained K_VA data block；

The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is data The data block that the last one of file is ended up with B；Wherein the quantity of the data block of the second class can only be 1 or 0, primary sources The data volume of block may also be 0, this depends on the particular content of data file；

(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9；

Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain more Kind of data block divides form, and quantity is from K₀To K₂₅₅；

(9) to K₀To K₂₅₅According to being ranked up from big to small, there is equal K_V, then subscript it is big preceding, acquisition comes most preceding 6 A value, it is assumed that be K_V1, K_V2, K_V3, K_V4, K_V5, K_V6；

(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its ending character Section is sent to corresponding database server and carries out repeated detection；

If the trail byte of a data block is X, Server (X) is sent by the data block；

(11) each database server calculates its cryptographic Hash to each data block received, should according to cryptographic Hash judgement Whether data block is repeated data, and judging result is sent to interface server；

(12) judging result that the interface server is received according to step 11 selects repeated data from six kinds of partitioned modes Most two kinds are measured, when repeated data amount is identical, are then randomly choosed；According to both partitioned modes, corresponding data are notified Library server is stored；Interface server itself saves the relevant information of the data file, including its two kinds of partitioned modes and Corresponding database server；

The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and corresponding Cryptographic Hash then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.