CN105787008A - Data deduplication cleaning method for large data volume - Google Patents
Data deduplication cleaning method for large data volume Download PDFInfo
- Publication number
- CN105787008A CN105787008A CN201610098006.6A CN201610098006A CN105787008A CN 105787008 A CN105787008 A CN 105787008A CN 201610098006 A CN201610098006 A CN 201610098006A CN 105787008 A CN105787008 A CN 105787008A
- Authority
- CN
- China
- Prior art keywords
- data
- data block
- row
- deduplication
- duplicate removal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000004140 cleaning Methods 0.000 title claims abstract description 17
- 238000000354 decomposition reaction Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 3
- 238000003672 processing method Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data deduplication cleaning method with large data volume, which comprises the following concrete implementation processes: decomposing data into a plurality of data blocks, initializing the data in each data block into a task after carrying out deduplication processing, and loading the task into a task pool for parallel execution; after the execution of each data block task is finished, comparing the data block task with another data block which is finished with the duplicate removal processing, removing the duplicate, combining the data blocks into a new data block, and repeatedly executing the process until all the data blocks are finally combined into one data block, namely finishing the data duplicate removal processing. Compared with the prior art, the data deduplication cleaning method with large data volume greatly improves the efficiency of data cleaning by decomposing data, executing in parallel and adopting an MD5 calculation method.
Description
Technical field
The present invention relates to field of computer technology, specifically the data deduplication cleaning method of a kind of big data quantity.
Background technology
, there is a lot of redundant data in the application system of enterprise, such as ERP, CRM etc., not only increases the cost of data management, and had a strong impact on quality and the efficiency that data query is analyzed;It is thus desirable to provide data deduplication processing method that is a kind of efficient and that support big data.Traditional data deduplication processing method, is generally adopted line by line the data that the method inquiry of cell circulation contrast one by one repeats, and efficiency is very low.
Based on this, now provide the data deduplication cleaning method of a kind of big data quantity.
Summary of the invention
The technical assignment of the present invention is for above weak point, it is provided that the data deduplication cleaning method of a kind of big data quantity.
A kind of data deduplication cleaning method of big data quantity, it realizes process and is:
Data are decomposed into several data blocks, after the data deduplication in each data block is processed, are initialized as a task, be loaded into executed in parallel in task pool;
After each data block tasks carrying completes, complete, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, repeat this process, until all data blocks finally merge into a data block, namely complete data deduplication and process.
Data are decomposed into several data blocks according to decomposition strategy, and this decomposition strategy includes data and splits strategy and Data duplication foundation.
Data deduplication in data block processes and comprises the steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
Data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.
The detailed process that above-mentioned duplicate removal merges is:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block.
Compared to the prior art the data deduplication cleaning method of a kind of big data quantity of the present invention, has the advantages that
The present invention based on iterative parallel computation the data duplicate removal method adopting data summarization technology MD5, decrease greatly recycle ratio compared with number of times;Adopt multithreads computing technology, it is possible to support that the data deduplication of big data quantity is cleaned, be greatly improved the efficiency of data cleansing simultaneously;There is the features such as execution efficiency is high, highly reliable, by decomposition data, executed in parallel, practical, it is easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is the principle schematic that data deduplication processes.
Accompanying drawing 2 is the schematic diagram that data deduplication processes overall flow.
Accompanying drawing 3 is the schematic diagram of duplicate removal handling process in data block.
Accompanying drawing 4 is the schematic diagram of duplicate removal merging treatment flow process between data block.
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
A kind of data deduplication cleaning method of the big data quantity of the present invention, it realizes process and is:
S10: data are split as several data blocks according to data deduplication rule;
S11: the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel;
S12: each data block has performed after duplicate removal processes, completes the data block duplicate removal of comparing that duplicate removal processes and merges with another one;Repeat S12 step, until all data blocks finally merge into a data block, namely complete data deduplication and process.
In step slo, according to data deduplication rule, data are split as several data blocks.Further illustrating, data deduplication rule comprises data and splits strategy.
Data are decomposed into several data blocks according to decomposition strategy, and this decomposition strategy includes data and splits strategy and Data duplication foundation.
In step s 11, the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel.Further illustrating, the data in data block process and comprise the following steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
In step s 12, each data block has performed after duplicate removal processes, to nullify current task, completes the data block duplicate removal of comparing that duplicate removal processes with another one and merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool;The processing procedure of this task comprises the following steps:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block;
Repeated execution of steps S12.
Embodiment illustrates: in certain ERP system, it is necessary to the repetition data in cleaning products tables of data.It is provided with repetition processing rule, splits strategy and Data duplication foundation including data.
As shown in Figure 1, data are decomposed into 5 data blocks according to decomposition strategy, the data deduplication in each data block is processed the task that is initialized as, are loaded into executed in parallel in task pool.After each data block tasks carrying completes, completing, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, new data block repeats this process, until finally merging into a data block.
Implement flow process, as shown in Figure 2.
S10 walks, and data to be cleaned, splits strategy according to the data set, is decomposed into 5 data blocks.
S11 walks, and the data deduplication in each data block is processed, is initially a task, and is loaded in task pool executed in parallel.The execution flow process of each task, as it is shown on figure 3, further illustrate, comprises the following steps:
1) increase by calculates row, is used for the value according to repeating to comprise row in foundation and generates MD5 code;
2) use MD5 algorithm to generate every a line and calculate the value of row;
3) according to the value of MD5 row, data are ranked up;
4) the data row that the value of MD5 row repeats is removed.
S12 walks, and each data block has performed after duplicate removal processes, to nullify current task, and the data block duplicate removal of comparing completing duplicate removal process with another one merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool;As shown in Figure 4, the processing procedure of this task comprises the following steps:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block;
Repeated execution of steps S12.
By detailed description of the invention above, described those skilled in the art can be easy to realize the present invention.It is understood that the present invention is not limited to above-mentioned detailed description of the invention.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizing different technical schemes.
Except the technical characteristic described in description, it is the known technology of those skilled in the art.
Claims (5)
1. the data deduplication cleaning method of a big data quantity, it is characterised in that it realizes process and is:
Data are decomposed into several data blocks, after the data deduplication in each data block is processed, are initialized as a task, be loaded into executed in parallel in task pool;
After each data block tasks carrying completes, complete, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, repeat this process, until all data blocks finally merge into a data block, namely complete data deduplication and process.
2. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that data are decomposed into several data blocks according to decomposition strategy, this decomposition strategy includes data and splits strategy and Data duplication foundation.
3. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that the data deduplication in data block processes and comprises the steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
4. the data deduplication cleaning method of a kind of big data quantity according to claim 3, it is characterized in that, data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.
5. the data deduplication cleaning method of a kind of big data quantity according to claim 4, it is characterised in that the detailed process that above-mentioned duplicate removal merges is:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610098006.6A CN105787008A (en) | 2016-02-23 | 2016-02-23 | Data deduplication cleaning method for large data volume |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610098006.6A CN105787008A (en) | 2016-02-23 | 2016-02-23 | Data deduplication cleaning method for large data volume |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105787008A true CN105787008A (en) | 2016-07-20 |
Family
ID=56402716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610098006.6A Pending CN105787008A (en) | 2016-02-23 | 2016-02-23 | Data deduplication cleaning method for large data volume |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105787008A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776951A (en) * | 2016-12-02 | 2017-05-31 | 航天星图科技(北京)有限公司 | One kind cleaning contrast storage method |
CN108319624A (en) * | 2017-01-18 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Data load method and device |
WO2018184418A1 (en) * | 2017-04-06 | 2018-10-11 | 平安科技(深圳)有限公司 | Data cleaning method, terminal and computer readable storage medium |
CN110955637A (en) * | 2019-11-27 | 2020-04-03 | 集奥聚合(北京)人工智能科技有限公司 | Method for realizing ordering of oversized files based on low memory |
CN112256685A (en) * | 2020-10-30 | 2021-01-22 | 深圳物讯科技有限公司 | Spreadsheet-based segmentation de-duplication import method and related product |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130339297A1 (en) * | 2012-06-18 | 2013-12-19 | Actifio, Inc. | System and method for efficient database record replication using different replication strategies based on the database records |
CN103699441A (en) * | 2013-12-05 | 2014-04-02 | 深圳先进技术研究院 | MapReduce report task execution method based on task granularity |
CN103914522A (en) * | 2014-03-20 | 2014-07-09 | 电子科技大学 | Data block merging method applied to deleting duplicated data in cloud storage |
CN105320773A (en) * | 2015-11-03 | 2016-02-10 | 中国人民解放军理工大学 | Distributed duplicated data deleting system and method based on Hadoop platform |
-
2016
- 2016-02-23 CN CN201610098006.6A patent/CN105787008A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130339297A1 (en) * | 2012-06-18 | 2013-12-19 | Actifio, Inc. | System and method for efficient database record replication using different replication strategies based on the database records |
CN103699441A (en) * | 2013-12-05 | 2014-04-02 | 深圳先进技术研究院 | MapReduce report task execution method based on task granularity |
CN103914522A (en) * | 2014-03-20 | 2014-07-09 | 电子科技大学 | Data block merging method applied to deleting duplicated data in cloud storage |
CN105320773A (en) * | 2015-11-03 | 2016-02-10 | 中国人民解放军理工大学 | Distributed duplicated data deleting system and method based on Hadoop platform |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776951A (en) * | 2016-12-02 | 2017-05-31 | 航天星图科技(北京)有限公司 | One kind cleaning contrast storage method |
CN106776951B (en) * | 2016-12-02 | 2019-04-26 | 中科星图股份有限公司 | A kind of cleaning comparison storage method |
CN108319624A (en) * | 2017-01-18 | 2018-07-24 | 腾讯科技(深圳)有限公司 | Data load method and device |
WO2018184418A1 (en) * | 2017-04-06 | 2018-10-11 | 平安科技(深圳)有限公司 | Data cleaning method, terminal and computer readable storage medium |
CN110955637A (en) * | 2019-11-27 | 2020-04-03 | 集奥聚合(北京)人工智能科技有限公司 | Method for realizing ordering of oversized files based on low memory |
CN112256685A (en) * | 2020-10-30 | 2021-01-22 | 深圳物讯科技有限公司 | Spreadsheet-based segmentation de-duplication import method and related product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105787008A (en) | Data deduplication cleaning method for large data volume | |
Liu et al. | An effective differential evolution algorithm for permutation flow shop scheduling problem | |
CN107301504A (en) | Leapfroged based on mixing-the production and transport coordinated dispatching method of path relinking and system | |
CN103309975A (en) | Duplicated data deleting method and apparatus | |
CN105373517A (en) | Spark-based distributed matrix inversion parallel operation method | |
CN110060740A (en) | A kind of nonredundancy gene set clustering method, system and electronic equipment | |
CN105550825B (en) | Flexible factory job scheduling method based on MapReduce parallelization in cloud computing environment | |
CN105488692A (en) | Method and device for computing number of online users | |
CN107016110B (en) | OWLHorst rule distributed parallel reasoning algorithm combined with Spark platform | |
CN106354552B (en) | Parallel computation method for allocating tasks and device | |
CN104090995A (en) | Automatic generating method of rebar unit grids in ABAQUS tire model | |
Huang et al. | Tabu search algorithm combined with global perturbation for packing arbitrary sized circles into a circular container | |
CN116595918B (en) | Method, device, equipment and storage medium for verifying quick logical equivalence | |
CN107291843A (en) | Hierarchical clustering improved method based on Distributed Computing Platform | |
CN103226466A (en) | Efficient incremental data capturing method | |
CN107038260A (en) | A kind of efficient parallel loading method for keeping titan Real-time Data Uniforms | |
CN104392124A (en) | Three-stage flexible flow workshop scheduling method based on ST heuristic algorithm | |
CN111045920A (en) | Workload-aware multi-branch software change-level defect prediction method | |
Smits et al. | Scalable symbolic regression by continuous evolution with very small populations | |
CN113127461B (en) | Data cleaning method and device, electronic equipment and storage medium | |
CN104050079A (en) | Real-time system testing method based on time automata | |
CN103268384B (en) | A kind of method of orderly extraction structure outline | |
CN105893145A (en) | Task scheduling method and device based on genetic algorithm | |
CN106777262B (en) | High-throughput sequencing data quality filtering method and filtering device | |
Sun et al. | Using sampling methods to improve binding site predictions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160720 |