CN108241639A

CN108241639A - A kind of data duplicate removal method

Info

Publication number: CN108241639A
Application number: CN201611207408.1A
Authority: CN
Inventors: 王焰辉; 李振钊; 曾刚
Original assignee: Space Star Technology (beijing) Co Ltd
Current assignee: Space Star Technology (beijing) Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2018-07-03
Anticipated expiration: 2036-12-23
Also published as: CN108241639B

Abstract

The invention discloses a kind of data duplicate removal method, this method includes：The last one byte based on data block, classifies to data block, while sets the database server for corresponding to and being handled and stored per a kind of data block；Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than the minimum length, is transmitted directly to the corresponding database server of the data block；Otherwise using different trail bytes to the data file piecemeal；In six kinds of most partitioned modes of block count, interface server selects two kinds of partitioned modes of repeated data amount maximum, indicates that corresponding database server is stored；The database server only stores a pointer, is directed toward stored identical block for repeated data block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Description

A kind of data duplicate removal method

【Technical field】

The invention belongs to computer and database field, specifically, being related to a kind of data duplicate removal method.

【Background technology】

In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range born is to need new tupe There could be stronger decision edge, see clearly the magnanimity for finding power and process optimization ability, high growth rate and diversified information money Production.

Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with Under setting off of the cloud computing for the technological innovation curtain of representative, these are difficult that the data collected and used start easily to be utilized originally Get up, by constantly bringing forth new ideas for all trades and professions, big data gradually creates more values for the mankind.

But although more and more for making the computer of big data analysis, performance is become better and better, and faces magnanimity number According to still unable to do what one wishes, therefore the first step of big data analysis, it is detection and eliminates repeated data therein, is gone by data On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data analysis amount.

Common data duplicate removal method of the prior art is to detect weight by comparing the cryptographic Hash of entire data file Complex data.This detection method is too simple, and recall rate is not high.

【Invention content】

In order to solve the above problem of the prior art, the present invention proposes a kind of new data duplicate removal method, technology Scheme is as follows：

A kind of data duplicate removal method, this method include the following steps：

Step 100：The last one byte based on data block, classifies to data block, while sets corresponding to each The database server that class data block is handled and stored；

Step 200：Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than The minimum length is transmitted directly to the corresponding database server of the data block；Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle：In addition to last block, each piece of length ends up not less than the minimum length Byte is identical.

Step 300：In six kinds of most partitioned modes of block count, the two of interface server selection repeated data amount maximum Kind partitioned mode, indicates that corresponding database server is stored；

Step 400：The database server only stores a pointer, is directed toward stored phase for repeated data block Same data block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block Block.

Further, the cryptographic Hash is calculated using MD5 algorithms.

Further, the cryptographic Hash is calculated using SHA-1 algorithms.

Further, the cryptographic Hash is calculated using SHA-256 algorithms.

The solution have the advantages that：The recall rate of repeated data is improved, reduces the data analysis of big data analysis Amount and memory space occupy.

【Description of the drawings】

Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but It does not constitute improper limitations of the present invention, in the accompanying drawings：

Fig. 1 is the basic flow chart of the method for the present invention.

【Specific embodiment】

Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say It is bright to be only used for explaining the present invention, but be not intended as inappropriate limitation of the present invention.

The system structure that the method for the present invention is applied includes interface server and multiple database servers, the interface clothes Business device is responsible for the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly , if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one, So as to reduce database server quantity.

On the basis of above system structure, the basic step of the method for the present invention is as follows：

Based on above-mentioned basic step, the method for the present invention is as follows：

(1) interface server receives the data file for needing to store storage.

Interface server receives the extraneous data file sent, and be responsible for adopting as whole system and extraneous interface The data file is stored into database server with follow-up step.One typical example is the Web clothes on internet Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, the interface server of the present invention also may be used To be multiple, the present invention does not make limitation to its quantity.

(2) interface server checks the length L of the data file, if L is less than predefined minimum data block Length MinBlockLength then extracts the last one byte B of the data file, goes to step 3.If L >= MinBlockLength then goes to step 5.

Above-mentioned length is all as unit of byte, since a byte is 8, then necessarily has 0≤B≤255.The minimum Data block length be the present invention to file block when minimum length, occurrence can as the case may be set by administrator It is fixed.In the case of one kind is preferred, the MinBlockLength=1024 bytes.

(3) Server (i) is respectively designated as to 256 database servers number in system in advance, wherein 0≤i≤ 255, which is sent to Server (B) by interface server, while preserves the relevant information of the data file.

The present invention classifies to data block, is taken since a byte one shares 256 according to the byte of data end of block Value, therefore is divided into 256 classes by data block, and it is corresponding be assigned to 256 database servers, each database server Number is identical with the classification (i.e. the value of trail byte) of its responsible data block.

256 database servers are the preferred embodiments of the present invention, and cost of implementation is higher, are deposited suitable for large-scale data Storage system if necessary to reduced cost, may be multiplexed between database server, i.e., multiple data block classifications share one Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way Process.

The relevant information of the data file includes name, size, the number of database server of data file etc., Thus interface server can inquire the data file.

(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash It has stored in the server, if stored, which is repeated data, and only the data file retains a finger Needle is directed toward stored data；If do not stored, Server (B) stores the data file and its cryptographic Hash, method Terminate.

The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data The cryptographic Hash of file is compared with existing cryptographic Hash, it is possible to judge whether the data file is repeated data, if weight Complex data does not just have to store entire data file again.

The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to：MD4、 MD5, SHA-1, SHA-256 etc..

(5) interface server prepares to carry out deblocking to the data file, sets initial piecemeal vector V first =0.

(6) interface server scans backward since the MinBlockLength byte of the data file, when When scanning some byte equal to piecemeal vector, the position P of the byte is recorded₁, then again from P₁The position of+MinBlockLength Start to scan, find and record next byte location equal to piecemeal vector, so recycle, until the end of data file. In other words, the starting position scanned each time and the distance of last record position are all MinBlockLength, until reaching The end of data file.

(7) according to the position recorded of step 6, using each position as the ending of a data block, to data file Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained K_VA data block.

The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.

(8) piecemeal vector V increases by 1, if V≤255, otherwise return to step 6 continues step 9.

Above-mentioned steps 6-8 is a cycle, file scan is carried out based on 0 to 255 each piecemeal vector value, so as to obtain A variety of data blocks segmentation forms, quantity is from K₀To K₂₅₅.But the form of cycle of the invention of being write as is intended merely to narration conveniently, In practical application, those skilled in the art will be seen that, only can carry out a scanning to data file and just complete above-mentioned 256 Cycle, so as to improve execution efficiency.

(9) to K₀To K₂₅₅According to being ranked up from big to small (if there is equal K_V, then subscript it is big preceding), the row of acquisition In most preceding (i.e. maximum) 6 values, it is assumed that be K_V1, K_V2, K_V3, K_V4, K_V5, K_V6。

(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its tie Trail byte is sent to corresponding database server and carries out repeated detection.

If as previously mentioned, the trail byte of a data block is X, which is sent to Server (X).

(11) each database server calculates its cryptographic Hash, is sentenced according to the cryptographic Hash to each data block received Whether the data block of breaking is repeated data (i.e. the data block is identical with stored data block), will determine that result is sent to interface Server.

(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding Database server is stored.Interface server itself preserves the relevant information of the data file, including its two kinds of piecemeal sides Formula and corresponding database server.

The storage mode of database server is identical with step 4, if the data block repeated, then only retains pointer and phase The cryptographic Hash answered if unduplicated data block, then preserves the data block and corresponding cryptographic Hash.

The considerations of two kinds of partitioned modes retained are for redundancy backup, i.e., in a kind of database server of partitioned mode After being out of order, another partitioned mode still can be combined into former data file.

In addition, in above-mentioned steps, it is only that a kind of of the present invention preferably implements that two kinds are selected in six kinds of partitioned modes Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds Middle 3 kinds of selection.

The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range, The equivalent change or modification that feature and principle are done, is included in the range of present patent application.

Claims

1. a kind of data duplicate removal method, which is characterized in that this method includes the following steps：

Step 100：The last one byte based on data block, classifies to data block, while sets and correspond to per a kind of number The database server for being handled and being stored according to block；

Step 200：Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than this most Small length is transmitted directly to the corresponding database server of the data block；Otherwise using different trail bytes to data text Part piecemeal, piecemeal are based on following principle：In addition to last block, each piece of length is not less than the minimum length, and trail byte It is identical；

Step 300：In six kinds of most partitioned modes of block count, two kinds points of interface server selection repeated data amount maximum Block mode indicates that corresponding database server is stored；

Step 400：The database server only stores a pointer, is directed toward stored identical number for repeated data block According to block；For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

2. data duplicate removal method according to claim 1, which is characterized in that the database server is based on data block Cryptographic Hash judges whether data block is repeated data block.

3. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use MD5 algorithm meters It calculates.

4. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use SHA-1 algorithms It calculates.

5. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is calculated using SHA-256 What method calculated.