CN108241639B - A kind of data duplicate removal method - Google Patents

A kind of data duplicate removal method Download PDF

Info

Publication number
CN108241639B
CN108241639B CN201611207408.1A CN201611207408A CN108241639B CN 108241639 B CN108241639 B CN 108241639B CN 201611207408 A CN201611207408 A CN 201611207408A CN 108241639 B CN108241639 B CN 108241639B
Authority
CN
China
Prior art keywords
data
data block
block
server
data file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611207408.1A
Other languages
Chinese (zh)
Other versions
CN108241639A (en
Inventor
王焰辉
李振钊
曾刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Star Map Co Ltd
Original Assignee
Zhongke Star Map Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Star Map Co Ltd filed Critical Zhongke Star Map Co Ltd
Priority to CN201611207408.1A priority Critical patent/CN108241639B/en
Publication of CN108241639A publication Critical patent/CN108241639A/en
Application granted granted Critical
Publication of CN108241639B publication Critical patent/CN108241639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data duplicate removal methods, this method comprises: the last byte based on data block, classifies to data block, while being arranged and corresponding to the database server that every a kind of data block is handled and stored;Interface server setting minimum data block length, if it is less than the minimum length, is transmitted directly to the corresponding database server of the data block for the data file of demand duplicate removal;Otherwise using different trail bytes to the data file piecemeal;In six kinds of most partitioned modes of block count, interface server selects the maximum two kinds of partitioned modes of repeated data amount, indicates that corresponding database server is stored;The database server only stores a pointer, is directed toward stored identical block for repeated data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.

Description

A kind of data duplicate removal method
[technical field]
The invention belongs to computers and database field, specifically, being related to a kind of data duplicate removal method.
[background technique]
In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range of receiving is to need new tupe Could have stronger decision edge, the magnanimity for seeing clearly discovery power and process optimization ability, high growth rate and diversified information money It produces.
Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with Cloud computing is under the setting off of the technological innovation curtain of representative, these data that is difficult to collect originally and use start to be easy to be utilized Get up, by constantly bringing forth new ideas for all trades and professions, big data is gradually that the mankind create more values.
But although the computer for doing big data analysis is more and more, performance is become better and better, and faces magnanimity number According to still unable to do what one wishes, therefore the first step of big data analysis, it is to detect and eliminate repeated data therein, is gone by data On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data amount of analysis.
Common data duplicate removal method in the prior art is that weight is detected by comparing the cryptographic Hash of entire data file Complex data.This detection method is too simple, and recall rate is not high.
[summary of the invention]
In order to solve the above problem in the prior art, the invention proposes a kind of new data duplicate removal method, technologies Scheme is as follows:
A kind of data duplicate removal method, method includes the following steps:
Step 100: the last byte based on data block classifies to data block, while being arranged corresponding to each The database server that class data block is handled and stored;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, if it is less than The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle: in addition to last block, each piece of length is not less than the minimum length, and ends up Byte is identical.
Step 300: in six kinds of most partitioned modes of block count, interface server selects repeated data amount maximum two Kind partitioned mode, indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer, is directed toward stored phase for repeated data block Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block Block.
Further, the cryptographic Hash is calculated using MD5 algorithm.
Further, the cryptographic Hash is calculated using SHA-1 algorithm.
Further, the cryptographic Hash is calculated using SHA-256 algorithm.
The solution have the advantages that: the recall rate of repeated data is improved, the data analysis of big data analysis is reduced Amount and memory space occupy.
[Detailed description of the invention]
Described herein the drawings are intended to provide a further understanding of the invention, constitutes part of this application, but It does not constitute improper limitations of the present invention, in the accompanying drawings:
Fig. 1 is the basic flow chart of the method for the present invention.
[specific embodiment]
Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and says It is bright to be only used to explain the present invention, but it is not intended as inappropriate limitation of the present invention.
System structure applied by the method for the present invention includes interface server and multiple database servers, the interface clothes Business device is responsible for managing the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly , if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one, To reduce database server quantity.
On the basis of above system structure, the basic step of the method for the present invention is as follows:
Step 100: the last byte based on data block classifies to data block, while being arranged corresponding to each The database server that class data block is handled and stored;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, if it is less than The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number According to file block, piecemeal is based on following principle: in addition to last block, each piece of length is not less than the minimum length, and ends up Byte is identical.
Step 300: in six kinds of most partitioned modes of block count, interface server selects repeated data amount maximum two Kind partitioned mode, indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer, is directed toward stored phase for repeated data block Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Based on above-mentioned basic step, specific step is as follows for the method for the present invention:
(1) interface server receives the data file for needing to store storage.
Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for adopting The data file is stored into database server with subsequent step.One typical example is the Web clothes on internet Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, interface server of the invention can also Be it is multiple, the present invention does not make limitation to its quantity.
(2) interface server checks the length L of the data file, if L is less than predefined minimum data block Length MinBlockLength then extracts the last byte B of the data file, goes to step 3.If L >= MinBlockLength then goes to step 5.
Above-mentioned length is all since a byte is 8, then necessarily to have 0≤B≤255 as unit of byte.The minimum Data block length be the present invention to file block when minimum length, occurrence can set as the case may be by administrator It is fixed.In a kind of preferred situation, the MinBlockLength=1024 byte.
(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤ 255, the data file is sent Server (B) by interface server, while saving the relevant information of the data file.
The present invention classifies to data block according to the byte of data end of block, takes since a byte one shares 256 Value, therefore is divided into 256 classes for data block, and it is corresponding be assigned to 256 database servers, each database server It numbers identical as classification (i.e. the value of trail byte) of its responsible data block.
256 database servers are preferred embodiments of the invention, and cost of implementation is higher, are deposited suitable for large-scale data Storage system may be multiplexed between database server if necessary to reduced cost, i.e., multiple data block classifications are one shared Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way Process.
The relevant information of the data file includes name, size, the number of database server of data file etc., Thus interface server can inquire the data file.
(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash It has stored in the server, if stored, which is repeated data, and only the data file retains a finger Needle is directed toward stored data;And if it is not stored, Server (B) stores the data file and its cryptographic Hash, method Terminate.
The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data The cryptographic Hash of file is compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is weight Complex data does not just have to store entire data file again.
The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to: MD4, MD5, SHA-1, SHA-256 etc..
(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V is arranged first =0.
(6) interface server scans backward since the MinBlockLength byte of the data file, when When scanning some byte equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength Start to scan, find and record next byte location equal to piecemeal vector, so recycles, until the end of data file. In other words, the starting position scanned each time is all MinBlockLength at a distance from last record position, until reaching The end of data file.
(7) according to the position of step 6 recorded, using each position as the ending of a data block, to data file Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained KVA data block.
The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.
(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9.
Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain A variety of data blocks divide forms, and quantity is from K0To K255.But the form of present invention circulation of being write as is intended merely to narration conveniently, In practical application, those skilled in the art be will be seen that, only can be carried out a scanning to data file and just be completed above-mentioned 256 Circulation, to improve execution efficiency.
(9) to K0To K255According to being ranked up from big to small (if there is equal KV, then subscript it is big preceding), the row of acquisition In most preceding (i.e. maximum) 6 values, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6
(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its knot Trail byte is sent to corresponding database server and carries out repeated detection.
As previously mentioned, sending Server (X) for the data block if the trail byte of a data block is X.
(11) each database server calculates its cryptographic Hash to each data block received, is sentenced according to the cryptographic Hash Whether the data block of breaking is repeated data (i.e. the data block is identical as stored data block), and judging result is sent to interface Server.
(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding Database server is stored.Interface server itself saves the relevant information of the data file, including its two kinds of piecemeal sides Formula and corresponding database server.
The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and phase The cryptographic Hash answered then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.
The considerations of two kinds of partitioned modes retained are for redundancy backup, the i.e. database server in a kind of partitioned mode After being out of order, another partitioned mode still can be combined into former data file.
In addition, selecting two kinds to be only that one kind of the invention is preferred in six kinds of partitioned modes in above-mentioned steps and implementing Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds Middle 3 kinds of selection.
The above description is only a preferred embodiment of the present invention, thus it is all according to the configuration described in the scope of the patent application of the present invention, The equivalent change or modification that feature and principle are done, is included in the scope of the patent application of the present invention.

Claims (1)

1. a kind of data duplicate removal method, system structure applied by this method includes interface server and multiple database services Device, the interface server is responsible for managing the storage storage of data file, and database server is used for actual storage data, It is characterized in that, method includes the following steps:
Step 100: the last byte based on data block classifies to data block, while being arranged and corresponding to every a kind of number The database server for being handled and being stored according to block;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, most if it is less than this Small length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to data text Part piecemeal, piecemeal are based on following principle: in addition to last block, each piece of length is not less than the minimum length, and trail byte It is identical;
Step 300: in six kinds of most partitioned modes of block count, interface server selects maximum two kinds points of repeated data amount Block mode indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer for repeated data block, is directed toward stored identical number According to block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored;
(1) interface server receives the data file for needing to store storage;
Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for after using Continuous step stores the data file into database server;
(2) interface server checks the length L of the data file, if L is less than predefined minimum data block length MinBlockLength then extracts the last byte B of the data file, goes to step 3;If L >= MinBlockLength then goes to step 5;
(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤255, The data file is sent Server (B) by interface server, while saving the relevant information of the data file;
According to the byte of data end of block, classify to data block, it, will since a byte one shares 256 values Data block is divided into 256 classes, and it is corresponding be assigned to 256 database servers, the number of each database server and its The value of the classification trail byte of responsible data block is identical;
The relevant information of the data file includes the name, size, the number of database server of data file, thus interface Server can inquire the data file;
(4) Server (B) calculates the cryptographic Hash of the data file, judges whether the data file has deposited according to the cryptographic Hash In the server, if stored, which is repeated data for storage, and only the data file retains a pointer, is referred to To stored data;And if it is not stored, Server (B) stores the data file and its cryptographic Hash, and method terminates;
The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data file Cryptographic Hash be compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is repeat number According to just without storing entire data file again;
The hash algorithm includes but is not limited to: MD4, MD5, SHA-1, SHA-256;
(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V=0 is arranged first;
(6) interface server scans backward since the MinBlockLength byte of the data file, works as scanning When some byte is equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength starts Scanning, finds and records next byte location equal to piecemeal vector, so recycles, until the end of data file;
(7) data file is carried out using each position as the ending of a data block according to the position of step 6 recorded Piecemeal obtains one or more data blocks, if having obtained KVA data block;
The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is data The data block that the last one of file is ended up with B;Wherein the quantity of the data block of the second class can only be 1 or 0, primary sources The data volume of block may also be 0, this depends on the particular content of data file;
(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9;
Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain more Kind of data block divides form, and quantity is from K0To K255
(9) to K0To K255According to being ranked up from big to small, there is equal KV, then subscript it is big preceding, acquisition comes most preceding 6 A value, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6
(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its ending character Section is sent to corresponding database server and carries out repeated detection;
If the trail byte of a data block is X, Server (X) is sent by the data block;
(11) each database server calculates its cryptographic Hash to each data block received, should according to cryptographic Hash judgement Whether data block is repeated data, and judging result is sent to interface server;
(12) judging result that the interface server is received according to step 11 selects repeated data from six kinds of partitioned modes Most two kinds are measured, when repeated data amount is identical, are then randomly choosed;According to both partitioned modes, corresponding data are notified Library server is stored;Interface server itself saves the relevant information of the data file, including its two kinds of partitioned modes and Corresponding database server;
The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and corresponding Cryptographic Hash then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.
CN201611207408.1A 2016-12-23 2016-12-23 A kind of data duplicate removal method Active CN108241639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611207408.1A CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611207408.1A CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Publications (2)

Publication Number Publication Date
CN108241639A CN108241639A (en) 2018-07-03
CN108241639B true CN108241639B (en) 2019-07-23

Family

ID=62704061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611207408.1A Active CN108241639B (en) 2016-12-23 2016-12-23 A kind of data duplicate removal method

Country Status (1)

Country Link
CN (1) CN108241639B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968575B (en) * 2018-09-30 2023-06-06 南京工程学院 Deduplication method of big data processing system
CN112162973A (en) * 2020-09-17 2021-01-01 华中科技大学 Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system
CN112988684A (en) * 2021-03-15 2021-06-18 浪潮云信息技术股份公司 Method and system for extracting and de-duplicating electronic official document data based on Hash algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103049263A (en) * 2012-12-12 2013-04-17 华中科技大学 Document classification method based on similarity
CN103873506A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Data block duplication removing system in storage cluster and method thereof
CN104978151A (en) * 2015-06-19 2015-10-14 浪潮电子信息产业股份有限公司 Application awareness based data reconstruction method in repeated data deletion and storage system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN103023970A (en) * 2012-11-15 2013-04-03 中国科学院计算机网络信息中心 Method and system for storing mass data of Internet of Things (IoT)
CN103049263A (en) * 2012-12-12 2013-04-17 华中科技大学 Document classification method based on similarity
CN103873506A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Data block duplication removing system in storage cluster and method thereof
CN104978151A (en) * 2015-06-19 2015-10-14 浪潮电子信息产业股份有限公司 Application awareness based data reconstruction method in repeated data deletion and storage system

Also Published As

Publication number Publication date
CN108241639A (en) 2018-07-03

Similar Documents

Publication Publication Date Title
DK2765524T3 (en) PROCEDURE FOR DATA PROCESSING AND FITTING IN A CLUSTER SYSTEM
CN105630955B (en) A kind of data acquisition system member management method of high-efficiency dynamic
CN102375837B (en) Data acquiring system and method
CN108052675A (en) Blog management method, system and computer readable storage medium
CN101158954B (en) Method for recognizing repeat data in computer storage
WO2020087082A1 (en) Trace and span sampling and analysis for instrumented software
US11966797B2 (en) Indexing data at a data intake and query system based on a node capacity threshold
CN102523290B (en) Data processing method, device and system
CN108241639B (en) A kind of data duplicate removal method
US11609913B1 (en) Reassigning data groups from backup to searching for a processing node
WO2004063928A1 (en) Database load reducing system and load reducing program
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
CN110413978B (en) Data paging export method, device, computer equipment and storage medium
CN109522316A (en) Log processing method, device, equipment and storage medium
US20190014016A1 (en) Data acquisition device, data acquisition method and storage medium
CN108243207B (en) A kind of date storage method of network cloud disk
WO2016029441A1 (en) File scanning method and apparatus
JP2019121334A (en) Data storage and dynamic migration method, and data storage and dynamic migration device
US20210294512A1 (en) Data storage method and apparatus, storage medium and computer device
CN109726340A (en) The querying method and device of uniform resource locator classification
CN108932271A (en) A kind of file management method and device
CN108090186A (en) A kind of electric power data De-weight method on big data platform
CN104956340A (en) Scalable data deduplication
CN107181773A (en) Data storage and data managing method, the equipment of distributed memory system
CN110222046B (en) List data processing method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4)

Applicant after: Zhongke Star Map Co., Ltd.

Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing

Applicant before: Space Star Technology (Beijing) Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant