CN108241639A - A kind of data duplicate removal method - Google Patents

A kind of data duplicate removal method Download PDF

Info

Publication number
CN108241639A
CN108241639A CN201611207408.1A CN201611207408A CN108241639A CN 108241639 A CN108241639 A CN 108241639A CN 201611207408 A CN201611207408 A CN 201611207408A CN 108241639 A CN108241639 A CN 108241639A
Authority
CN
China
Prior art keywords
data
data block
block
database server
duplicate removal
Prior art date
Application number
CN201611207408.1A
Other languages
Chinese (zh)
Other versions
CN108241639B (en
Inventor
王焰辉
李振钊
曾刚
Original Assignee
航天星图科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 航天星图科技(北京)有限公司 filed Critical 航天星图科技(北京)有限公司
Priority to CN201611207408.1A priority Critical patent/CN108241639B/en
Publication of CN108241639A publication Critical patent/CN108241639A/en
Application granted granted Critical
Publication of CN108241639B publication Critical patent/CN108241639B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention discloses a kind of data duplicate removal method, this method includes:The last one byte based on data block, classifies to data block, while sets the database server for corresponding to and being handled and stored per a kind of data block;Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than the minimum length, is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the data file piecemeal;In six kinds of most partitioned modes of block count, interface server selects two kinds of partitioned modes of repeated data amount maximum, indicates that corresponding database server is stored;The database server only stores a pointer, is directed toward stored identical block for repeated data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Description

A kind of data duplicate removal method

【Technical field】

The invention belongs to computer and database field, specifically, being related to a kind of data duplicate removal method.

【Background technology】

In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range born is to need new tupe There could be stronger decision edge, see clearly the magnanimity for finding power and process optimization ability, high growth rate and diversified information money Production.

Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with Under setting off of the cloud computing for the technological innovation curtain of representative, these are difficult that the data collected and used start easily to be utilized originally Get up, by constantly bringing forth new ideas for all trades and professions, big data gradually creates more values for the mankind.

But although more and more for making the computer of big data analysis, performance is become better and better, and faces magnanimity number According to still unable to do what one wishes, therefore the first step of big data analysis, it is detection and eliminates repeated data therein, is gone by data On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data analysis amount.

Common data duplicate removal method of the prior art is to detect weight by comparing the cryptographic Hash of entire data file Complex data.This detection method is too simple, and recall rate is not high.

【Invention content】

In order to solve the above problem of the prior art, the present invention proposes a kind of new data duplicate removal method, technology Scheme is as follows:

A kind of data duplicate removal method, this method include the following steps:

Step 100:The last one byte based on data block, classifies to data block, while sets corresponding to each The database server that class data block is handled and stored;

Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle:In addition to last block, each piece of length ends up not less than the minimum length Byte is identical.

Step 300:In six kinds of most partitioned modes of block count, the two of interface server selection repeated data amount maximum Kind partitioned mode, indicates that corresponding database server is stored;

Step 400:The database server only stores a pointer, is directed toward stored phase for repeated data block Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block Block.

Further, the cryptographic Hash is calculated using MD5 algorithms.

Further, the cryptographic Hash is calculated using SHA-1 algorithms.

Further, the cryptographic Hash is calculated using SHA-256 algorithms.

The solution have the advantages that:The recall rate of repeated data is improved, reduces the data analysis of big data analysis Amount and memory space occupy.

【Description of the drawings】

Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but It does not constitute improper limitations of the present invention, in the accompanying drawings:

Fig. 1 is the basic flow chart of the method for the present invention.

【Specific embodiment】

Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say It is bright to be only used for explaining the present invention, but be not intended as inappropriate limitation of the present invention.

The system structure that the method for the present invention is applied includes interface server and multiple database servers, the interface clothes Business device is responsible for the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly , if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one, So as to reduce database server quantity.

On the basis of above system structure, the basic step of the method for the present invention is as follows:

Step 100:The last one byte based on data block, classifies to data block, while sets corresponding to each The database server that class data block is handled and stored;

Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle:In addition to last block, each piece of length ends up not less than the minimum length Byte is identical.

Step 300:In six kinds of most partitioned modes of block count, the two of interface server selection repeated data amount maximum Kind partitioned mode, indicates that corresponding database server is stored;

Step 400:The database server only stores a pointer, is directed toward stored phase for repeated data block Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Based on above-mentioned basic step, the method for the present invention is as follows:

(1) interface server receives the data file for needing to store storage.

Interface server receives the extraneous data file sent, and be responsible for adopting as whole system and extraneous interface The data file is stored into database server with follow-up step.One typical example is the Web clothes on internet Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, the interface server of the present invention also may be used To be multiple, the present invention does not make limitation to its quantity.

(2) interface server checks the length L of the data file, if L is less than predefined minimum data block Length MinBlockLength then extracts the last one byte B of the data file, goes to step 3.If L >= MinBlockLength then goes to step 5.

Above-mentioned length is all as unit of byte, since a byte is 8, then necessarily has 0≤B≤255.The minimum Data block length be the present invention to file block when minimum length, occurrence can as the case may be set by administrator It is fixed.In the case of one kind is preferred, the MinBlockLength=1024 bytes.

(3) Server (i) is respectively designated as to 256 database servers number in system in advance, wherein 0≤i≤ 255, which is sent to Server (B) by interface server, while preserves the relevant information of the data file.

The present invention classifies to data block, is taken since a byte one shares 256 according to the byte of data end of block Value, therefore is divided into 256 classes by data block, and it is corresponding be assigned to 256 database servers, each database server Number is identical with the classification (i.e. the value of trail byte) of its responsible data block.

256 database servers are the preferred embodiments of the present invention, and cost of implementation is higher, are deposited suitable for large-scale data Storage system if necessary to reduced cost, may be multiplexed between database server, i.e., multiple data block classifications share one Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way Process.

The relevant information of the data file includes name, size, the number of database server of data file etc., Thus interface server can inquire the data file.

(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash It has stored in the server, if stored, which is repeated data, and only the data file retains a finger Needle is directed toward stored data;If do not stored, Server (B) stores the data file and its cryptographic Hash, method Terminate.

The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data The cryptographic Hash of file is compared with existing cryptographic Hash, it is possible to judge whether the data file is repeated data, if weight Complex data does not just have to store entire data file again.

The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to:MD4、 MD5, SHA-1, SHA-256 etc..

(5) interface server prepares to carry out deblocking to the data file, sets initial piecemeal vector V first =0.

(6) interface server scans backward since the MinBlockLength byte of the data file, when When scanning some byte equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength Start to scan, find and record next byte location equal to piecemeal vector, so recycle, until the end of data file. In other words, the starting position scanned each time and the distance of last record position are all MinBlockLength, until reaching The end of data file.

(7) according to the position recorded of step 6, using each position as the ending of a data block, to data file Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained KVA data block.

The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.

(8) piecemeal vector V increases by 1, if V≤255, otherwise return to step 6 continues step 9.

Above-mentioned steps 6-8 is a cycle, file scan is carried out based on 0 to 255 each piecemeal vector value, so as to obtain A variety of data blocks segmentation forms, quantity is from K0To K255.But the form of cycle of the invention of being write as is intended merely to narration conveniently, In practical application, those skilled in the art will be seen that, only can carry out a scanning to data file and just complete above-mentioned 256 Cycle, so as to improve execution efficiency.

(9) to K0To K255According to being ranked up from big to small (if there is equal KV, then subscript it is big preceding), the row of acquisition In most preceding (i.e. maximum) 6 values, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6

(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its tie Trail byte is sent to corresponding database server and carries out repeated detection.

If as previously mentioned, the trail byte of a data block is X, which is sent to Server (X).

(11) each database server calculates its cryptographic Hash, is sentenced according to the cryptographic Hash to each data block received Whether the data block of breaking is repeated data (i.e. the data block is identical with stored data block), will determine that result is sent to interface Server.

(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding Database server is stored.Interface server itself preserves the relevant information of the data file, including its two kinds of piecemeal sides Formula and corresponding database server.

The storage mode of database server is identical with step 4, if the data block repeated, then only retains pointer and phase The cryptographic Hash answered if unduplicated data block, then preserves the data block and corresponding cryptographic Hash.

The considerations of two kinds of partitioned modes retained are for redundancy backup, i.e., in a kind of database server of partitioned mode After being out of order, another partitioned mode still can be combined into former data file.

In addition, in above-mentioned steps, it is only that a kind of of the present invention preferably implements that two kinds are selected in six kinds of partitioned modes Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds Middle 3 kinds of selection.

The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range, The equivalent change or modification that feature and principle are done, is included in the range of present patent application.

Claims (5)

1. a kind of data duplicate removal method, which is characterized in that this method includes the following steps:
Step 100:The last one byte based on data block, classifies to data block, while sets and correspond to per a kind of number The database server for being handled and being stored according to block;
Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than this most Small length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to data text Part piecemeal, piecemeal are based on following principle:In addition to last block, each piece of length is not less than the minimum length, and trail byte It is identical;
Step 300:In six kinds of most partitioned modes of block count, two kinds points of interface server selection repeated data amount maximum Block mode indicates that corresponding database server is stored;
Step 400:The database server only stores a pointer, is directed toward stored identical number for repeated data block According to block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
2. data duplicate removal method according to claim 1, which is characterized in that the database server is based on data block Cryptographic Hash judges whether data block is repeated data block.
3. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use MD5 algorithm meters It calculates.
4. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use SHA-1 algorithms It calculates.
5. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is calculated using SHA-256 What method calculated.
CN201611207408.1A 2016-12-23 2016-12-23 A kind of data duplicate removal method CN108241639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611207408.1A CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611207408.1A CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Publications (2)

Publication Number Publication Date
CN108241639A true CN108241639A (en) 2018-07-03
CN108241639B CN108241639B (en) 2019-07-23

Family

ID=62704061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611207408.1A CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Country Status (1)

Country Link
CN (1) CN108241639B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103049263A (en) * 2012-12-12 2013-04-17 华中科技大学 Document classification method based on similarity
CN103873506A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Data block duplication removing system in storage cluster and method thereof
CN104978151A (en) * 2015-06-19 2015-10-14 浪潮电子信息产业股份有限公司 Application awareness based data reconstruction method in repeated data deletion and storage system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103049263A (en) * 2012-12-12 2013-04-17 华中科技大学 Document classification method based on similarity
CN103873506A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Data block duplication removing system in storage cluster and method thereof
CN104978151A (en) * 2015-06-19 2015-10-14 浪潮电子信息产业股份有限公司 Application awareness based data reconstruction method in repeated data deletion and storage system

Also Published As

Publication number Publication date
CN108241639B (en) 2019-07-23

Similar Documents

Publication Publication Date Title
US10257059B2 (en) Transforming event data using remote capture agents and transformation servers
US20140181030A1 (en) Identifying data items
Bahmani et al. Efficient distributed locality sensitive hashing
CN102236581B (en) Mapping reduction method and system thereof for data center
KR100441317B1 (en) Method and apparatus for classifying data packets
US7584264B2 (en) Data storage and retrieval systems and related methods of storing and retrieving data
CN102592103B (en) Secure file processing method, equipment and system
CN100399327C (en) Managing file system versions
CN106534273A (en) Block chain metadata storage system, and storage method and retrieval method thereof
US8775471B1 (en) Representing user behavior information
US10360196B2 (en) Grouping and managing event streams generated from captured network data
CN101807207B (en) Method for sharing document based on content difference comparison
JP4839585B2 (en) Resource information collection and distribution method and system
ES2625690T3 (en) Data processing method and device in a cluster system
US20120284384A1 (en) Computer processing method and system for network data
US20070130188A1 (en) Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm
CN101901248B (en) Method and device for creating and updating Bloom filter and searching elements
JP4198920B2 (en) Backup system, backup program and backup method
CN103095843B (en) A kind of data back up method and client based on version vector
US6687715B2 (en) Parallel lookups that keep order
US10374883B2 (en) Application-based configuration of network data capture by remote capture agents
US20150295778A1 (en) Inline visualizations of metrics related to captured network data
US20110125749A1 (en) Method and Apparatus for Storing and Indexing High-Speed Network Traffic Data
CN102833298A (en) Distributed repeated data deleting system and processing method thereof
KR20120120159A (en) Table search device, table search method, and table search system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4)

Applicant after: Zhongke Star Map Co., Ltd.

Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing

Applicant before: Space Star Technology (Beijing) Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant