CN108241639B - A kind of data duplicate removal method - Google Patents
A kind of data duplicate removal method Download PDFInfo
- Publication number
- CN108241639B CN108241639B CN201611207408.1A CN201611207408A CN108241639B CN 108241639 B CN108241639 B CN 108241639B CN 201611207408 A CN201611207408 A CN 201611207408A CN 108241639 B CN108241639 B CN 108241639B
- Authority
- CN
- China
- Prior art keywords
- data
- data block
- block
- server
- data file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Storage Device Security (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data duplicate removal methods, this method comprises: the last byte based on data block, classifies to data block, while being arranged and corresponding to the database server that every a kind of data block is handled and stored;Interface server setting minimum data block length, if it is less than the minimum length, is transmitted directly to the corresponding database server of the data block for the data file of demand duplicate removal;Otherwise using different trail bytes to the data file piecemeal;In six kinds of most partitioned modes of block count, interface server selects the maximum two kinds of partitioned modes of repeated data amount, indicates that corresponding database server is stored;The database server only stores a pointer, is directed toward stored identical block for repeated data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Description
[technical field]
The invention belongs to computers and database field, specifically, being related to a kind of data duplicate removal method.
[background technique]
In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can
The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range of receiving is to need new tupe
Could have stronger decision edge, the magnanimity for seeing clearly discovery power and process optimization ability, high growth rate and diversified information money
It produces.
Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with
Cloud computing is under the setting off of the technological innovation curtain of representative, these data that is difficult to collect originally and use start to be easy to be utilized
Get up, by constantly bringing forth new ideas for all trades and professions, big data is gradually that the mankind create more values.
But although the computer for doing big data analysis is more and more, performance is become better and better, and faces magnanimity number
According to still unable to do what one wishes, therefore the first step of big data analysis, it is to detect and eliminate repeated data therein, is gone by data
On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data amount of analysis.
Common data duplicate removal method in the prior art is that weight is detected by comparing the cryptographic Hash of entire data file
Complex data.This detection method is too simple, and recall rate is not high.
[summary of the invention]
In order to solve the above problem in the prior art, the invention proposes a kind of new data duplicate removal method, technologies
Scheme is as follows:
A kind of data duplicate removal method, method includes the following steps:
Step 100: the last byte based on data block classifies to data block, while being arranged corresponding to each
The database server that class data block is handled and stored;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, if it is less than
The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number
According to file block, piecemeal is based on following principle: in addition to last block, each piece of length is not less than the minimum length, and ends up
Byte is identical.
Step 300: in six kinds of most partitioned modes of block count, interface server selects repeated data amount maximum two
Kind partitioned mode, indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer, is directed toward stored phase for repeated data block
Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block
Block.
Further, the cryptographic Hash is calculated using MD5 algorithm.
Further, the cryptographic Hash is calculated using SHA-1 algorithm.
Further, the cryptographic Hash is calculated using SHA-256 algorithm.
The solution have the advantages that: the recall rate of repeated data is improved, the data analysis of big data analysis is reduced
Amount and memory space occupy.
[Detailed description of the invention]
Described herein the drawings are intended to provide a further understanding of the invention, constitutes part of this application, but
It does not constitute improper limitations of the present invention, in the accompanying drawings:
Fig. 1 is the basic flow chart of the method for the present invention.
[specific embodiment]
Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and says
It is bright to be only used to explain the present invention, but it is not intended as inappropriate limitation of the present invention.
System structure applied by the method for the present invention includes interface server and multiple database servers, the interface clothes
Business device is responsible for managing the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity
Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly
, if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one,
To reduce database server quantity.
On the basis of above system structure, the basic step of the method for the present invention is as follows:
Step 100: the last byte based on data block classifies to data block, while being arranged corresponding to each
The database server that class data block is handled and stored;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, if it is less than
The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number
According to file block, piecemeal is based on following principle: in addition to last block, each piece of length is not less than the minimum length, and ends up
Byte is identical.
Step 300: in six kinds of most partitioned modes of block count, interface server selects repeated data amount maximum two
Kind partitioned mode, indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer, is directed toward stored phase for repeated data block
Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Based on above-mentioned basic step, specific step is as follows for the method for the present invention:
(1) interface server receives the data file for needing to store storage.
Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for adopting
The data file is stored into database server with subsequent step.One typical example is the Web clothes on internet
Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, interface server of the invention can also
Be it is multiple, the present invention does not make limitation to its quantity.
(2) interface server checks the length L of the data file, if L is less than predefined minimum data block
Length MinBlockLength then extracts the last byte B of the data file, goes to step 3.If L >=
MinBlockLength then goes to step 5.
Above-mentioned length is all since a byte is 8, then necessarily to have 0≤B≤255 as unit of byte.The minimum
Data block length be the present invention to file block when minimum length, occurrence can set as the case may be by administrator
It is fixed.In a kind of preferred situation, the MinBlockLength=1024 byte.
(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤
255, the data file is sent Server (B) by interface server, while saving the relevant information of the data file.
The present invention classifies to data block according to the byte of data end of block, takes since a byte one shares 256
Value, therefore is divided into 256 classes for data block, and it is corresponding be assigned to 256 database servers, each database server
It numbers identical as classification (i.e. the value of trail byte) of its responsible data block.
256 database servers are preferred embodiments of the invention, and cost of implementation is higher, are deposited suitable for large-scale data
Storage system may be multiplexed between database server if necessary to reduced cost, i.e., multiple data block classifications are one shared
Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way
Process.
The relevant information of the data file includes name, size, the number of database server of data file etc.,
Thus interface server can inquire the data file.
(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash
It has stored in the server, if stored, which is repeated data, and only the data file retains a finger
Needle is directed toward stored data;And if it is not stored, Server (B) stores the data file and its cryptographic Hash, method
Terminate.
The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data
The cryptographic Hash of file is compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is weight
Complex data does not just have to store entire data file again.
The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to: MD4,
MD5, SHA-1, SHA-256 etc..
(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V is arranged first
=0.
(6) interface server scans backward since the MinBlockLength byte of the data file, when
When scanning some byte equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength
Start to scan, find and record next byte location equal to piecemeal vector, so recycles, until the end of data file.
In other words, the starting position scanned each time is all MinBlockLength at a distance from last record position, until reaching
The end of data file.
(7) according to the position of step 6 recorded, using each position as the ending of a data block, to data file
Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained KVA data block.
The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is
The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or
Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.
(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9.
Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain
A variety of data blocks divide forms, and quantity is from K0To K255.But the form of present invention circulation of being write as is intended merely to narration conveniently,
In practical application, those skilled in the art be will be seen that, only can be carried out a scanning to data file and just be completed above-mentioned 256
Circulation, to improve execution efficiency.
(9) to K0To K255According to being ranked up from big to small (if there is equal KV, then subscript it is big preceding), the row of acquisition
In most preceding (i.e. maximum) 6 values, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6。
(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its knot
Trail byte is sent to corresponding database server and carries out repeated detection.
As previously mentioned, sending Server (X) for the data block if the trail byte of a data block is X.
(11) each database server calculates its cryptographic Hash to each data block received, is sentenced according to the cryptographic Hash
Whether the data block of breaking is repeated data (i.e. the data block is identical as stored data block), and judging result is sent to interface
Server.
(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes
Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding
Database server is stored.Interface server itself saves the relevant information of the data file, including its two kinds of piecemeal sides
Formula and corresponding database server.
The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and phase
The cryptographic Hash answered then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.
The considerations of two kinds of partitioned modes retained are for redundancy backup, the i.e. database server in a kind of partitioned mode
After being out of order, another partitioned mode still can be combined into former data file.
In addition, selecting two kinds to be only that one kind of the invention is preferred in six kinds of partitioned modes in above-mentioned steps and implementing
Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds
Middle 3 kinds of selection.
The above description is only a preferred embodiment of the present invention, thus it is all according to the configuration described in the scope of the patent application of the present invention,
The equivalent change or modification that feature and principle are done, is included in the scope of the patent application of the present invention.
Claims (1)
1. a kind of data duplicate removal method, system structure applied by this method includes interface server and multiple database services
Device, the interface server is responsible for managing the storage storage of data file, and database server is used for actual storage data,
It is characterized in that, method includes the following steps:
Step 100: the last byte based on data block classifies to data block, while being arranged and corresponding to every a kind of number
The database server for being handled and being stored according to block;
Step 200: minimum data block length is arranged in interface server, for the data file of demand duplicate removal, most if it is less than this
Small length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to data text
Part piecemeal, piecemeal are based on following principle: in addition to last block, each piece of length is not less than the minimum length, and trail byte
It is identical;
Step 300: in six kinds of most partitioned modes of block count, interface server selects maximum two kinds points of repeated data amount
Block mode indicates that corresponding database server is stored;
Step 400: the database server only stores a pointer for repeated data block, is directed toward stored identical number
According to block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored;
(1) interface server receives the data file for needing to store storage;
Interface server receives the extraneous data file sent as whole system and extraneous interface, and is responsible for after using
Continuous step stores the data file into database server;
(2) interface server checks the length L of the data file, if L is less than predefined minimum data block length
MinBlockLength then extracts the last byte B of the data file, goes to step 3;If L >=
MinBlockLength then goes to step 5;
(3) it is respectively designated as Server (i) to 256 database servers number in system in advance, wherein 0≤i≤255,
The data file is sent Server (B) by interface server, while saving the relevant information of the data file;
According to the byte of data end of block, classify to data block, it, will since a byte one shares 256 values
Data block is divided into 256 classes, and it is corresponding be assigned to 256 database servers, the number of each database server and its
The value of the classification trail byte of responsible data block is identical;
The relevant information of the data file includes the name, size, the number of database server of data file, thus interface
Server can inquire the data file;
(4) Server (B) calculates the cryptographic Hash of the data file, judges whether the data file has deposited according to the cryptographic Hash
In the server, if stored, which is repeated data for storage, and only the data file retains a pointer, is referred to
To stored data;And if it is not stored, Server (B) stores the data file and its cryptographic Hash, and method terminates;
The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data file
Cryptographic Hash be compared with existing cryptographic Hash, so that it may judge whether the data file is repeated data, if it is repeat number
According to just without storing entire data file again;
The hash algorithm includes but is not limited to: MD4, MD5, SHA-1, SHA-256;
(5) interface server prepares to carry out deblocking to the data file, and initial piecemeal vector V=0 is arranged first;
(6) interface server scans backward since the MinBlockLength byte of the data file, works as scanning
When some byte is equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength starts
Scanning, finds and records next byte location equal to piecemeal vector, so recycles, until the end of data file;
(7) data file is carried out using each position as the ending of a data block according to the position of step 6 recorded
Piecemeal obtains one or more data blocks, if having obtained KVA data block;
The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is data
The data block that the last one of file is ended up with B;Wherein the quantity of the data block of the second class can only be 1 or 0, primary sources
The data volume of block may also be 0, this depends on the particular content of data file;
(8) piecemeal vector V increases by 1, if V≤255, otherwise return step 6 continues step 9;
Above-mentioned steps 6-8 is a circulation, and each piecemeal vector value based on 0 to 255 carries out file scan, to obtain more
Kind of data block divides form, and quantity is from K0To K255;
(9) to K0To K255According to being ranked up from big to small, there is equal KV, then subscript it is big preceding, acquisition comes most preceding 6
A value, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6;
(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its ending character
Section is sent to corresponding database server and carries out repeated detection;
If the trail byte of a data block is X, Server (X) is sent by the data block;
(11) each database server calculates its cryptographic Hash to each data block received, should according to cryptographic Hash judgement
Whether data block is repeated data, and judging result is sent to interface server;
(12) judging result that the interface server is received according to step 11 selects repeated data from six kinds of partitioned modes
Most two kinds are measured, when repeated data amount is identical, are then randomly choosed;According to both partitioned modes, corresponding data are notified
Library server is stored;Interface server itself saves the relevant information of the data file, including its two kinds of partitioned modes and
Corresponding database server;
The storage mode of database server is identical as step 4, if it is duplicate data block, then only retains pointer and corresponding
Cryptographic Hash then saves the data block and corresponding cryptographic Hash if it is unduplicated data block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207408.1A CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207408.1A CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108241639A CN108241639A (en) | 2018-07-03 |
CN108241639B true CN108241639B (en) | 2019-07-23 |
Family
ID=62704061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611207408.1A Active CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108241639B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968575B (en) * | 2018-09-30 | 2023-06-06 | 南京工程学院 | Deduplication method of big data processing system |
CN112162973A (en) * | 2020-09-17 | 2021-01-01 | 华中科技大学 | Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system |
CN112988684A (en) * | 2021-03-15 | 2021-06-18 | 浪潮云信息技术股份公司 | Method and system for extracting and de-duplicating electronic official document data based on Hash algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103049263A (en) * | 2012-12-12 | 2013-04-17 | 华中科技大学 | Document classification method based on similarity |
CN103873506A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Data block duplication removing system in storage cluster and method thereof |
CN104978151A (en) * | 2015-06-19 | 2015-10-14 | 浪潮电子信息产业股份有限公司 | Application awareness based data reconstruction method in repeated data deletion and storage system |
-
2016
- 2016-12-23 CN CN201611207408.1A patent/CN108241639B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103049263A (en) * | 2012-12-12 | 2013-04-17 | 华中科技大学 | Document classification method based on similarity |
CN103873506A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Data block duplication removing system in storage cluster and method thereof |
CN104978151A (en) * | 2015-06-19 | 2015-10-14 | 浪潮电子信息产业股份有限公司 | Application awareness based data reconstruction method in repeated data deletion and storage system |
Also Published As
Publication number | Publication date |
---|---|
CN108241639A (en) | 2018-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DK2765524T3 (en) | PROCEDURE FOR DATA PROCESSING AND FITTING IN A CLUSTER SYSTEM | |
CN105630955B (en) | A kind of data acquisition system member management method of high-efficiency dynamic | |
CN102375837B (en) | Data acquiring system and method | |
CN108052675A (en) | Blog management method, system and computer readable storage medium | |
CN101158954B (en) | Method for recognizing repeat data in computer storage | |
WO2020087082A1 (en) | Trace and span sampling and analysis for instrumented software | |
US11966797B2 (en) | Indexing data at a data intake and query system based on a node capacity threshold | |
CN102523290B (en) | Data processing method, device and system | |
CN108241639B (en) | A kind of data duplicate removal method | |
US11609913B1 (en) | Reassigning data groups from backup to searching for a processing node | |
WO2004063928A1 (en) | Database load reducing system and load reducing program | |
CN108228322B (en) | Distributed link tracking and analyzing method, server and global scheduler | |
CN110413978B (en) | Data paging export method, device, computer equipment and storage medium | |
CN109522316A (en) | Log processing method, device, equipment and storage medium | |
US20190014016A1 (en) | Data acquisition device, data acquisition method and storage medium | |
CN108243207B (en) | A kind of date storage method of network cloud disk | |
WO2016029441A1 (en) | File scanning method and apparatus | |
JP2019121334A (en) | Data storage and dynamic migration method, and data storage and dynamic migration device | |
US20210294512A1 (en) | Data storage method and apparatus, storage medium and computer device | |
CN109726340A (en) | The querying method and device of uniform resource locator classification | |
CN108932271A (en) | A kind of file management method and device | |
CN108090186A (en) | A kind of electric power data De-weight method on big data platform | |
CN104956340A (en) | Scalable data deduplication | |
CN107181773A (en) | Data storage and data managing method, the equipment of distributed memory system | |
CN110222046B (en) | List data processing method, device, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4) Applicant after: Zhongke Star Map Co., Ltd. Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing Applicant before: Space Star Technology (Beijing) Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |