CN108241639A - A kind of data duplicate removal method - Google Patents
A kind of data duplicate removal method Download PDFInfo
- Publication number
- CN108241639A CN108241639A CN201611207408.1A CN201611207408A CN108241639A CN 108241639 A CN108241639 A CN 108241639A CN 201611207408 A CN201611207408 A CN 201611207408A CN 108241639 A CN108241639 A CN 108241639A
- Authority
- CN
- China
- Prior art keywords
- data
- data block
- block
- database server
- duplicate removal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Storage Device Security (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data duplicate removal method, this method includes:The last one byte based on data block, classifies to data block, while sets the database server for corresponding to and being handled and stored per a kind of data block;Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than the minimum length, is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the data file piecemeal;In six kinds of most partitioned modes of block count, interface server selects two kinds of partitioned modes of repeated data amount maximum, indicates that corresponding database server is stored;The database server only stores a pointer, is directed toward stored identical block for repeated data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Description
【Technical field】
The invention belongs to computer and database field, specifically, being related to a kind of data duplicate removal method.
【Background technology】
In recent years, in order to handle a large amount of information, there is the concept of big data.So-called big data, referring to can not be can
The data acquisition system for being captured, being managed and being handled with conventional software tool in the time range born is to need new tupe
There could be stronger decision edge, see clearly the magnanimity for finding power and process optimization ability, high growth rate and diversified information money
Production.
Due to the mass property of data, people only with one's own be difficult to these data these analysis, but with
Under setting off of the cloud computing for the technological innovation curtain of representative, these are difficult that the data collected and used start easily to be utilized originally
Get up, by constantly bringing forth new ideas for all trades and professions, big data gradually creates more values for the mankind.
But although more and more for making the computer of big data analysis, performance is become better and better, and faces magnanimity number
According to still unable to do what one wishes, therefore the first step of big data analysis, it is detection and eliminates repeated data therein, is gone by data
On the one hand weight is the occupancy for reducing memory space and network bandwidth, be on the other hand to reduce data analysis amount.
Common data duplicate removal method of the prior art is to detect weight by comparing the cryptographic Hash of entire data file
Complex data.This detection method is too simple, and recall rate is not high.
【Invention content】
In order to solve the above problem of the prior art, the present invention proposes a kind of new data duplicate removal method, technology
Scheme is as follows:
A kind of data duplicate removal method, this method include the following steps:
Step 100:The last one byte based on data block, classifies to data block, while sets corresponding to each
The database server that class data block is handled and stored;
Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than
The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number
According to file block, piecemeal is based on following principle:In addition to last block, each piece of length ends up not less than the minimum length
Byte is identical.
Step 300:In six kinds of most partitioned modes of block count, the two of interface server selection repeated data amount maximum
Kind partitioned mode, indicates that corresponding database server is stored;
Step 400:The database server only stores a pointer, is directed toward stored phase for repeated data block
Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Further, the database server judges whether data block is repeated data based on the cryptographic Hash of data block
Block.
Further, the cryptographic Hash is calculated using MD5 algorithms.
Further, the cryptographic Hash is calculated using SHA-1 algorithms.
Further, the cryptographic Hash is calculated using SHA-256 algorithms.
The solution have the advantages that:The recall rate of repeated data is improved, reduces the data analysis of big data analysis
Amount and memory space occupy.
【Description of the drawings】
Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but
It does not constitute improper limitations of the present invention, in the accompanying drawings:
Fig. 1 is the basic flow chart of the method for the present invention.
【Specific embodiment】
Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say
It is bright to be only used for explaining the present invention, but be not intended as inappropriate limitation of the present invention.
The system structure that the method for the present invention is applied includes interface server and multiple database servers, the interface clothes
Business device is responsible for the storage storage of data file, and database server is used for actual storage data.In order to store magnanimity
Data, preferred embodiment of the invention are using 256 database servers, this is for large data storage system certainly
, if user is small business, in order to reduce cost, it is also contemplated that multiple servers therein are merged into one,
So as to reduce database server quantity.
On the basis of above system structure, the basic step of the method for the present invention is as follows:
Step 100:The last one byte based on data block, classifies to data block, while sets corresponding to each
The database server that class data block is handled and stored;
Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than
The minimum length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to the number
According to file block, piecemeal is based on following principle:In addition to last block, each piece of length ends up not less than the minimum length
Byte is identical.
Step 300:In six kinds of most partitioned modes of block count, the two of interface server selection repeated data amount maximum
Kind partitioned mode, indicates that corresponding database server is stored;
Step 400:The database server only stores a pointer, is directed toward stored phase for repeated data block
Same data block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
Based on above-mentioned basic step, the method for the present invention is as follows:
(1) interface server receives the data file for needing to store storage.
Interface server receives the extraneous data file sent, and be responsible for adopting as whole system and extraneous interface
The data file is stored into database server with follow-up step.One typical example is the Web clothes on internet
Business device is equivalent to interface server, receives the data file that user uploads and storage.In addition, the interface server of the present invention also may be used
To be multiple, the present invention does not make limitation to its quantity.
(2) interface server checks the length L of the data file, if L is less than predefined minimum data block
Length MinBlockLength then extracts the last one byte B of the data file, goes to step 3.If L >=
MinBlockLength then goes to step 5.
Above-mentioned length is all as unit of byte, since a byte is 8, then necessarily has 0≤B≤255.The minimum
Data block length be the present invention to file block when minimum length, occurrence can as the case may be set by administrator
It is fixed.In the case of one kind is preferred, the MinBlockLength=1024 bytes.
(3) Server (i) is respectively designated as to 256 database servers number in system in advance, wherein 0≤i≤
255, which is sent to Server (B) by interface server, while preserves the relevant information of the data file.
The present invention classifies to data block, is taken since a byte one shares 256 according to the byte of data end of block
Value, therefore is divided into 256 classes by data block, and it is corresponding be assigned to 256 database servers, each database server
Number is identical with the classification (i.e. the value of trail byte) of its responsible data block.
256 database servers are the preferred embodiments of the present invention, and cost of implementation is higher, are deposited suitable for large-scale data
Storage system if necessary to reduced cost, may be multiplexed between database server, i.e., multiple data block classifications share one
Database server, the database server are also just provided with multiple numbers, do not influence the specific implementation of the method for the present invention in this way
Process.
The relevant information of the data file includes name, size, the number of database server of data file etc.,
Thus interface server can inquire the data file.
(4) Server (B) calculates the cryptographic Hash of the data file, whether judges the data file according to the cryptographic Hash
It has stored in the server, if stored, which is repeated data, and only the data file retains a finger
Needle is directed toward stored data;If do not stored, Server (B) stores the data file and its cryptographic Hash, method
Terminate.
The data file of each storage or the cryptographic Hash of data block are saved due to database server, by the data
The cryptographic Hash of file is compared with existing cryptographic Hash, it is possible to judge whether the data file is repeated data, if weight
Complex data does not just have to store entire data file again.
The hash algorithm that the present invention uses can be any one hash algorithm in this field, including but not limited to:MD4、
MD5, SHA-1, SHA-256 etc..
(5) interface server prepares to carry out deblocking to the data file, sets initial piecemeal vector V first
=0.
(6) interface server scans backward since the MinBlockLength byte of the data file, when
When scanning some byte equal to piecemeal vector, the position P of the byte is recorded1, then again from P1The position of+MinBlockLength
Start to scan, find and record next byte location equal to piecemeal vector, so recycle, until the end of data file.
In other words, the starting position scanned each time and the distance of last record position are all MinBlockLength, until reaching
The end of data file.
(7) according to the position recorded of step 6, using each position as the ending of a data block, to data file
Piecemeal is carried out, it is hereby achieved that one or more data blocks, if having obtained KVA data block.
The data block that step 7 obtains may have two classes, and the first kind is the data block to be ended up with piecemeal vector V, and the second class is
The last one data block of data file, i.e., the data block to be ended up with B.Wherein the quantity of the data block of the second class can only be 1 or
Person 0, and the data volume of primary sources block may also be 0, this depends on the particular content of data file.
(8) piecemeal vector V increases by 1, if V≤255, otherwise return to step 6 continues step 9.
Above-mentioned steps 6-8 is a cycle, file scan is carried out based on 0 to 255 each piecemeal vector value, so as to obtain
A variety of data blocks segmentation forms, quantity is from K0To K255.But the form of cycle of the invention of being write as is intended merely to narration conveniently,
In practical application, those skilled in the art will be seen that, only can carry out a scanning to data file and just complete above-mentioned 256
Cycle, so as to improve execution efficiency.
(9) to K0To K255According to being ranked up from big to small (if there is equal KV, then subscript it is big preceding), the row of acquisition
In most preceding (i.e. maximum) 6 values, it is assumed that be KV1, KV2, KV3, KV4, KV5, KV6。
(10) according to six piecemeal vectors of V1 to V6 to the piecemeal of the data file as a result, by each data block according to its tie
Trail byte is sent to corresponding database server and carries out repeated detection.
If as previously mentioned, the trail byte of a data block is X, which is sent to Server (X).
(11) each database server calculates its cryptographic Hash, is sentenced according to the cryptographic Hash to each data block received
Whether the data block of breaking is repeated data (i.e. the data block is identical with stored data block), will determine that result is sent to interface
Server.
(12) judging result that the interface server is received according to step 11 selects to repeat from six kinds of partitioned modes
Data volume most two kinds (if repeated data amount is identical, randomly choosing).According to both partitioned modes, notice is corresponding
Database server is stored.Interface server itself preserves the relevant information of the data file, including its two kinds of piecemeal sides
Formula and corresponding database server.
The storage mode of database server is identical with step 4, if the data block repeated, then only retains pointer and phase
The cryptographic Hash answered if unduplicated data block, then preserves the data block and corresponding cryptographic Hash.
The considerations of two kinds of partitioned modes retained are for redundancy backup, i.e., in a kind of database server of partitioned mode
After being out of order, another partitioned mode still can be combined into former data file.
In addition, in above-mentioned steps, it is only that a kind of of the present invention preferably implements that two kinds are selected in six kinds of partitioned modes
Mode, those skilled in the art can select other numerical value as the case may be, such as 2 kinds or 5 kinds are selected from 3 kinds
Middle 3 kinds of selection.
The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range,
The equivalent change or modification that feature and principle are done, is included in the range of present patent application.
Claims (5)
1. a kind of data duplicate removal method, which is characterized in that this method includes the following steps:
Step 100:The last one byte based on data block, classifies to data block, while sets and correspond to per a kind of number
The database server for being handled and being stored according to block;
Step 200:Interface server sets minimum data block length, for the data file of demand duplicate removal, if less than this most
Small length is transmitted directly to the corresponding database server of the data block;Otherwise using different trail bytes to data text
Part piecemeal, piecemeal are based on following principle:In addition to last block, each piece of length is not less than the minimum length, and trail byte
It is identical;
Step 300:In six kinds of most partitioned modes of block count, two kinds points of interface server selection repeated data amount maximum
Block mode indicates that corresponding database server is stored;
Step 400:The database server only stores a pointer, is directed toward stored identical number for repeated data block
According to block;For non-duplicate data block, then entire data block and its cryptographic Hash are stored.
2. data duplicate removal method according to claim 1, which is characterized in that the database server is based on data block
Cryptographic Hash judges whether data block is repeated data block.
3. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use MD5 algorithm meters
It calculates.
4. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is to use SHA-1 algorithms
It calculates.
5. according to the data duplicate removal method described in claim 1-2, which is characterized in that the cryptographic Hash is calculated using SHA-256
What method calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207408.1A CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611207408.1A CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108241639A true CN108241639A (en) | 2018-07-03 |
CN108241639B CN108241639B (en) | 2019-07-23 |
Family
ID=62704061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611207408.1A Active CN108241639B (en) | 2016-12-23 | 2016-12-23 | A kind of data duplicate removal method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108241639B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968575A (en) * | 2018-09-30 | 2020-04-07 | 南京工程学院 | Duplication eliminating method for big data processing system |
CN112162973A (en) * | 2020-09-17 | 2021-01-01 | 华中科技大学 | Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system |
CN112988684A (en) * | 2021-03-15 | 2021-06-18 | 浪潮云信息技术股份公司 | Method and system for extracting and de-duplicating electronic official document data based on Hash algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103049263A (en) * | 2012-12-12 | 2013-04-17 | 华中科技大学 | Document classification method based on similarity |
CN103873506A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Data block duplication removing system in storage cluster and method thereof |
CN104978151A (en) * | 2015-06-19 | 2015-10-14 | 浪潮电子信息产业股份有限公司 | Application awareness based data reconstruction method in repeated data deletion and storage system |
-
2016
- 2016-12-23 CN CN201611207408.1A patent/CN108241639B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
CN103023970A (en) * | 2012-11-15 | 2013-04-03 | 中国科学院计算机网络信息中心 | Method and system for storing mass data of Internet of Things (IoT) |
CN103049263A (en) * | 2012-12-12 | 2013-04-17 | 华中科技大学 | Document classification method based on similarity |
CN103873506A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Data block duplication removing system in storage cluster and method thereof |
CN104978151A (en) * | 2015-06-19 | 2015-10-14 | 浪潮电子信息产业股份有限公司 | Application awareness based data reconstruction method in repeated data deletion and storage system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968575A (en) * | 2018-09-30 | 2020-04-07 | 南京工程学院 | Duplication eliminating method for big data processing system |
CN110968575B (en) * | 2018-09-30 | 2023-06-06 | 南京工程学院 | Deduplication method of big data processing system |
CN112162973A (en) * | 2020-09-17 | 2021-01-01 | 华中科技大学 | Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system |
CN112988684A (en) * | 2021-03-15 | 2021-06-18 | 浪潮云信息技术股份公司 | Method and system for extracting and de-duplicating electronic official document data based on Hash algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN108241639B (en) | 2019-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2007336337B2 (en) | System and method for optimizing changes of data sets | |
CN105630955B (en) | A kind of data acquisition system member management method of high-efficiency dynamic | |
CN105989076A (en) | Data statistical method and device | |
CN101158954B (en) | Method for recognizing repeat data in computer storage | |
CN110309336A (en) | Image search method, device, system, server and storage medium | |
CN106874348A (en) | File is stored and the method for indexing means, device and reading file | |
WO2004063928A1 (en) | Database load reducing system and load reducing program | |
CN108134775A (en) | A kind of data processing method and equipment | |
US20140280929A1 (en) | Multi-tier message correlation | |
CN108241639B (en) | A kind of data duplicate removal method | |
CN109522316A (en) | Log processing method, device, equipment and storage medium | |
CN113687964B (en) | Data processing method, device, electronic equipment, storage medium and program product | |
CN108874946A (en) | A kind of ID management method and device | |
CN108243207B (en) | A kind of date storage method of network cloud disk | |
CN106599190A (en) | Dynamic Skyline query method based on cloud computing | |
CN116095029A (en) | Network data stream measuring method, system, terminal and storage medium | |
CN108090186A (en) | A kind of electric power data De-weight method on big data platform | |
CN104956340A (en) | Scalable data deduplication | |
CN110222046B (en) | List data processing method, device, server and storage medium | |
CN104753626A (en) | Data compression method, equipment and system | |
CN111159131A (en) | Performance optimization method, device, equipment and computer readable storage medium | |
CN106658444A (en) | Short message channel provider selection method and terminal | |
CN106844480B (en) | A kind of cleaning comparison storage method | |
CN105260423A (en) | Duplicate removal method and apparatus for electronic cards | |
CN106331182B (en) | A kind of file synchronisation method based on desktop virtualization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4) Applicant after: Zhongke Star Map Co., Ltd. Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing Applicant before: Space Star Technology (Beijing) Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |