CN105787008A - Data deduplication cleaning method for large data volume - Google Patents

Data deduplication cleaning method for large data volume Download PDF

Info

Publication number
CN105787008A
CN105787008A CN201610098006.6A CN201610098006A CN105787008A CN 105787008 A CN105787008 A CN 105787008A CN 201610098006 A CN201610098006 A CN 201610098006A CN 105787008 A CN105787008 A CN 105787008A
Authority
CN
China
Prior art keywords
data
data block
row
deduplication
duplicate removal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610098006.6A
Other languages
Chinese (zh)
Inventor
岳现国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN201610098006.6A priority Critical patent/CN105787008A/en
Publication of CN105787008A publication Critical patent/CN105787008A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data deduplication cleaning method with large data volume, which comprises the following concrete implementation processes: decomposing data into a plurality of data blocks, initializing the data in each data block into a task after carrying out deduplication processing, and loading the task into a task pool for parallel execution; after the execution of each data block task is finished, comparing the data block task with another data block which is finished with the duplicate removal processing, removing the duplicate, combining the data blocks into a new data block, and repeatedly executing the process until all the data blocks are finally combined into one data block, namely finishing the data duplicate removal processing. Compared with the prior art, the data deduplication cleaning method with large data volume greatly improves the efficiency of data cleaning by decomposing data, executing in parallel and adopting an MD5 calculation method.

Description

A kind of data deduplication cleaning method of big data quantity
Technical field
The present invention relates to field of computer technology, specifically the data deduplication cleaning method of a kind of big data quantity.
Background technology
, there is a lot of redundant data in the application system of enterprise, such as ERP, CRM etc., not only increases the cost of data management, and had a strong impact on quality and the efficiency that data query is analyzed;It is thus desirable to provide data deduplication processing method that is a kind of efficient and that support big data.Traditional data deduplication processing method, is generally adopted line by line the data that the method inquiry of cell circulation contrast one by one repeats, and efficiency is very low.
Based on this, now provide the data deduplication cleaning method of a kind of big data quantity.
Summary of the invention
The technical assignment of the present invention is for above weak point, it is provided that the data deduplication cleaning method of a kind of big data quantity.
A kind of data deduplication cleaning method of big data quantity, it realizes process and is:
Data are decomposed into several data blocks, after the data deduplication in each data block is processed, are initialized as a task, be loaded into executed in parallel in task pool;
After each data block tasks carrying completes, complete, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, repeat this process, until all data blocks finally merge into a data block, namely complete data deduplication and process.
Data are decomposed into several data blocks according to decomposition strategy, and this decomposition strategy includes data and splits strategy and Data duplication foundation.
Data deduplication in data block processes and comprises the steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
Data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.
The detailed process that above-mentioned duplicate removal merges is:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block.
Compared to the prior art the data deduplication cleaning method of a kind of big data quantity of the present invention, has the advantages that
The present invention based on iterative parallel computation the data duplicate removal method adopting data summarization technology MD5, decrease greatly recycle ratio compared with number of times;Adopt multithreads computing technology, it is possible to support that the data deduplication of big data quantity is cleaned, be greatly improved the efficiency of data cleansing simultaneously;There is the features such as execution efficiency is high, highly reliable, by decomposition data, executed in parallel, practical, it is easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is the principle schematic that data deduplication processes.
Accompanying drawing 2 is the schematic diagram that data deduplication processes overall flow.
Accompanying drawing 3 is the schematic diagram of duplicate removal handling process in data block.
Accompanying drawing 4 is the schematic diagram of duplicate removal merging treatment flow process between data block.
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
A kind of data deduplication cleaning method of the big data quantity of the present invention, it realizes process and is:
S10: data are split as several data blocks according to data deduplication rule;
S11: the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel;
S12: each data block has performed after duplicate removal processes, completes the data block duplicate removal of comparing that duplicate removal processes and merges with another one;Repeat S12 step, until all data blocks finally merge into a data block, namely complete data deduplication and process.
In step slo, according to data deduplication rule, data are split as several data blocks.Further illustrating, data deduplication rule comprises data and splits strategy.
Data are decomposed into several data blocks according to decomposition strategy, and this decomposition strategy includes data and splits strategy and Data duplication foundation.
In step s 11, the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel.Further illustrating, the data in data block process and comprise the following steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
In step s 12, each data block has performed after duplicate removal processes, to nullify current task, completes the data block duplicate removal of comparing that duplicate removal processes with another one and merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool;The processing procedure of this task comprises the following steps:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block;
Repeated execution of steps S12.
Embodiment illustrates: in certain ERP system, it is necessary to the repetition data in cleaning products tables of data.It is provided with repetition processing rule, splits strategy and Data duplication foundation including data.
As shown in Figure 1, data are decomposed into 5 data blocks according to decomposition strategy, the data deduplication in each data block is processed the task that is initialized as, are loaded into executed in parallel in task pool.After each data block tasks carrying completes, completing, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, new data block repeats this process, until finally merging into a data block.
Implement flow process, as shown in Figure 2.
S10 walks, and data to be cleaned, splits strategy according to the data set, is decomposed into 5 data blocks.
S11 walks, and the data deduplication in each data block is processed, is initially a task, and is loaded in task pool executed in parallel.The execution flow process of each task, as it is shown on figure 3, further illustrate, comprises the following steps:
1) increase by calculates row, is used for the value according to repeating to comprise row in foundation and generates MD5 code;
2) use MD5 algorithm to generate every a line and calculate the value of row;
3) according to the value of MD5 row, data are ranked up;
4) the data row that the value of MD5 row repeats is removed.
S12 walks, and each data block has performed after duplicate removal processes, to nullify current task, and the data block duplicate removal of comparing completing duplicate removal process with another one merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool;As shown in Figure 4, the processing procedure of this task comprises the following steps:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block;
Repeated execution of steps S12.
By detailed description of the invention above, described those skilled in the art can be easy to realize the present invention.It is understood that the present invention is not limited to above-mentioned detailed description of the invention.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizing different technical schemes.
Except the technical characteristic described in description, it is the known technology of those skilled in the art.

Claims (5)

1. the data deduplication cleaning method of a big data quantity, it is characterised in that it realizes process and is:
Data are decomposed into several data blocks, after the data deduplication in each data block is processed, are initialized as a task, be loaded into executed in parallel in task pool;
After each data block tasks carrying completes, complete, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, repeat this process, until all data blocks finally merge into a data block, namely complete data deduplication and process.
2. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that data are decomposed into several data blocks according to decomposition strategy, this decomposition strategy includes data and splits strategy and Data duplication foundation.
3. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that the data deduplication in data block processes and comprises the steps:
Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row;
Calculate the value of the MD5 row of every a line;
It is ranked up according to the value of MD5 row;
Remove the data row that the value of MD5 row repeats.
4. the data deduplication cleaning method of a kind of big data quantity according to claim 3, it is characterized in that, data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.
5. the data deduplication cleaning method of a kind of big data quantity according to claim 4, it is characterised in that the detailed process that above-mentioned duplicate removal merges is:
The value of former data block and the latter data block comparison each row one by one MD5;
Data row identical with MD5 value in former data block in the latter's data block is deleted;
The two data block merges into new data block.
CN201610098006.6A 2016-02-23 2016-02-23 Data deduplication cleaning method for large data volume Pending CN105787008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610098006.6A CN105787008A (en) 2016-02-23 2016-02-23 Data deduplication cleaning method for large data volume

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610098006.6A CN105787008A (en) 2016-02-23 2016-02-23 Data deduplication cleaning method for large data volume

Publications (1)

Publication Number Publication Date
CN105787008A true CN105787008A (en) 2016-07-20

Family

ID=56402716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610098006.6A Pending CN105787008A (en) 2016-02-23 2016-02-23 Data deduplication cleaning method for large data volume

Country Status (1)

Country Link
CN (1) CN105787008A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN108319624A (en) * 2017-01-18 2018-07-24 腾讯科技(深圳)有限公司 Data load method and device
WO2018184418A1 (en) * 2017-04-06 2018-10-11 平安科技(深圳)有限公司 Data cleaning method, terminal and computer readable storage medium
CN110955637A (en) * 2019-11-27 2020-04-03 集奥聚合(北京)人工智能科技有限公司 Method for realizing ordering of oversized files based on low memory
CN112256685A (en) * 2020-10-30 2021-01-22 深圳物讯科技有限公司 Spreadsheet-based segmentation de-duplication import method and related product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339297A1 (en) * 2012-06-18 2013-12-19 Actifio, Inc. System and method for efficient database record replication using different replication strategies based on the database records
CN103699441A (en) * 2013-12-05 2014-04-02 深圳先进技术研究院 MapReduce report task execution method based on task granularity
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN105320773A (en) * 2015-11-03 2016-02-10 中国人民解放军理工大学 Distributed duplicated data deleting system and method based on Hadoop platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339297A1 (en) * 2012-06-18 2013-12-19 Actifio, Inc. System and method for efficient database record replication using different replication strategies based on the database records
CN103699441A (en) * 2013-12-05 2014-04-02 深圳先进技术研究院 MapReduce report task execution method based on task granularity
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN105320773A (en) * 2015-11-03 2016-02-10 中国人民解放军理工大学 Distributed duplicated data deleting system and method based on Hadoop platform

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN106776951B (en) * 2016-12-02 2019-04-26 中科星图股份有限公司 A kind of cleaning comparison storage method
CN108319624A (en) * 2017-01-18 2018-07-24 腾讯科技(深圳)有限公司 Data load method and device
WO2018184418A1 (en) * 2017-04-06 2018-10-11 平安科技(深圳)有限公司 Data cleaning method, terminal and computer readable storage medium
CN110955637A (en) * 2019-11-27 2020-04-03 集奥聚合(北京)人工智能科技有限公司 Method for realizing ordering of oversized files based on low memory
CN112256685A (en) * 2020-10-30 2021-01-22 深圳物讯科技有限公司 Spreadsheet-based segmentation de-duplication import method and related product

Similar Documents

Publication Publication Date Title
CN105787008A (en) Data deduplication cleaning method for large data volume
Liu et al. An effective differential evolution algorithm for permutation flow shop scheduling problem
CN107301504A (en) Leapfroged based on mixing-the production and transport coordinated dispatching method of path relinking and system
CN103309975A (en) Duplicated data deleting method and apparatus
CN105373517A (en) Spark-based distributed matrix inversion parallel operation method
CN110060740A (en) A kind of nonredundancy gene set clustering method, system and electronic equipment
CN105550825B (en) Flexible factory job scheduling method based on MapReduce parallelization in cloud computing environment
CN105488692A (en) Method and device for computing number of online users
CN107016110B (en) OWLHorst rule distributed parallel reasoning algorithm combined with Spark platform
CN106354552B (en) Parallel computation method for allocating tasks and device
CN104090995A (en) Automatic generating method of rebar unit grids in ABAQUS tire model
Huang et al. Tabu search algorithm combined with global perturbation for packing arbitrary sized circles into a circular container
CN116595918B (en) Method, device, equipment and storage medium for verifying quick logical equivalence
CN107291843A (en) Hierarchical clustering improved method based on Distributed Computing Platform
CN103226466A (en) Efficient incremental data capturing method
CN107038260A (en) A kind of efficient parallel loading method for keeping titan Real-time Data Uniforms
CN104392124A (en) Three-stage flexible flow workshop scheduling method based on ST heuristic algorithm
CN111045920A (en) Workload-aware multi-branch software change-level defect prediction method
Smits et al. Scalable symbolic regression by continuous evolution with very small populations
CN113127461B (en) Data cleaning method and device, electronic equipment and storage medium
CN104050079A (en) Real-time system testing method based on time automata
CN103268384B (en) A kind of method of orderly extraction structure outline
CN105893145A (en) Task scheduling method and device based on genetic algorithm
CN106777262B (en) High-throughput sequencing data quality filtering method and filtering device
Sun et al. Using sampling methods to improve binding site predictions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720