CN105787008A

CN105787008A - Data deduplication cleaning method for large data volume

Info

Publication number: CN105787008A
Application number: CN201610098006.6A
Authority: CN
Inventors: 岳现国
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2016-07-20

Abstract

The invention discloses a data deduplication cleaning method with large data volume, which comprises the following concrete implementation processes: decomposing data into a plurality of data blocks, initializing the data in each data block into a task after carrying out deduplication processing, and loading the task into a task pool for parallel execution; after the execution of each data block task is finished, comparing the data block task with another data block which is finished with the duplicate removal processing, removing the duplicate, combining the data blocks into a new data block, and repeatedly executing the process until all the data blocks are finally combined into one data block, namely finishing the data duplicate removal processing. Compared with the prior art, the data deduplication cleaning method with large data volume greatly improves the efficiency of data cleaning by decomposing data, executing in parallel and adopting an MD5 calculation method.

Description

A kind of data deduplication cleaning method of big data quantity

Technical field

The present invention relates to field of computer technology, specifically the data deduplication cleaning method of a kind of big data quantity.

Background technology

, there is a lot of redundant data in the application system of enterprise, such as ERP, CRM etc., not only increases the cost of data management, and had a strong impact on quality and the efficiency that data query is analyzed；It is thus desirable to provide data deduplication processing method that is a kind of efficient and that support big data.Traditional data deduplication processing method, is generally adopted line by line the data that the method inquiry of cell circulation contrast one by one repeats, and efficiency is very low.

Based on this, now provide the data deduplication cleaning method of a kind of big data quantity.

Summary of the invention

The technical assignment of the present invention is for above weak point, it is provided that the data deduplication cleaning method of a kind of big data quantity.

A kind of data deduplication cleaning method of big data quantity, it realizes process and is:

Data are decomposed into several data blocks, after the data deduplication in each data block is processed, are initialized as a task, be loaded into executed in parallel in task pool；

After each data block tasks carrying completes, complete, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, repeat this process, until all data blocks finally merge into a data block, namely complete data deduplication and process.

Data are decomposed into several data blocks according to decomposition strategy, and this decomposition strategy includes data and splits strategy and Data duplication foundation.

Data deduplication in data block processes and comprises the steps:

Increasing by one and calculate row, this calculating row are used for calculating repetition and also generate MD5 code according to the value of row；

Calculate the value of the MD5 row of every a line；

It is ranked up according to the value of MD5 row；

Remove the data row that the value of MD5 row repeats.

Data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.

The detailed process that above-mentioned duplicate removal merges is:

The value of former data block and the latter data block comparison each row one by one MD5；

Data row identical with MD5 value in former data block in the latter's data block is deleted；

The two data block merges into new data block.

Compared to the prior art the data deduplication cleaning method of a kind of big data quantity of the present invention, has the advantages that

The present invention based on iterative parallel computation the data duplicate removal method adopting data summarization technology MD5, decrease greatly recycle ratio compared with number of times；Adopt multithreads computing technology, it is possible to support that the data deduplication of big data quantity is cleaned, be greatly improved the efficiency of data cleansing simultaneously；There is the features such as execution efficiency is high, highly reliable, by decomposition data, executed in parallel, practical, it is easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 is the principle schematic that data deduplication processes.

Accompanying drawing 2 is the schematic diagram that data deduplication processes overall flow.

Accompanying drawing 3 is the schematic diagram of duplicate removal handling process in data block.

Accompanying drawing 4 is the schematic diagram of duplicate removal merging treatment flow process between data block.

Detailed description of the invention

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

A kind of data deduplication cleaning method of the big data quantity of the present invention, it realizes process and is:

S10: data are split as several data blocks according to data deduplication rule；

S11: the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel；

S12: each data block has performed after duplicate removal processes, completes the data block duplicate removal of comparing that duplicate removal processes and merges with another one；Repeat S12 step, until all data blocks finally merge into a data block, namely complete data deduplication and process.

In step slo, according to data deduplication rule, data are split as several data blocks.Further illustrating, data deduplication rule comprises data and splits strategy.

In step s 11, the data deduplication in each data block processes and is initially several tasks, and is loaded into task pool executed in parallel.Further illustrating, the data in data block process and comprise the following steps:

Calculate the value of the MD5 row of every a line；

It is ranked up according to the value of MD5 row；

Remove the data row that the value of MD5 row repeats.

In step s 12, each data block has performed after duplicate removal processes, to nullify current task, completes the data block duplicate removal of comparing that duplicate removal processes with another one and merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool；The processing procedure of this task comprises the following steps:

The two data block merges into new data block；

Repeated execution of steps S12.

Embodiment illustrates: in certain ERP system, it is necessary to the repetition data in cleaning products tables of data.It is provided with repetition processing rule, splits strategy and Data duplication foundation including data.

As shown in Figure 1, data are decomposed into 5 data blocks according to decomposition strategy, the data deduplication in each data block is processed the task that is initialized as, are loaded into executed in parallel in task pool.After each data block tasks carrying completes, completing, with another one, the data block that duplicate removal processes and compare duplicate removal merge into new data block, new data block repeats this process, until finally merging into a data block.

Implement flow process, as shown in Figure 2.

S10 walks, and data to be cleaned, splits strategy according to the data set, is decomposed into 5 data blocks.

S11 walks, and the data deduplication in each data block is processed, is initially a task, and is loaded in task pool executed in parallel.The execution flow process of each task, as it is shown on figure 3, further illustrate, comprises the following steps:

1) increase by calculates row, is used for the value according to repeating to comprise row in foundation and generates MD5 code；

2) use MD5 algorithm to generate every a line and calculate the value of row；

3) according to the value of MD5 row, data are ranked up；

4) the data row that the value of MD5 row repeats is removed.

S12 walks, and each data block has performed after duplicate removal processes, to nullify current task, and the data block duplicate removal of comparing completing duplicate removal process with another one merges.Further illustrate, the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool；As shown in Figure 4, the processing procedure of this task comprises the following steps:

The two data block merges into new data block；

Repeated execution of steps S12.

By detailed description of the invention above, described those skilled in the art can be easy to realize the present invention.It is understood that the present invention is not limited to above-mentioned detailed description of the invention.On the basis of disclosed embodiment, described those skilled in the art can the different technical characteristic of combination in any, thus realizing different technical schemes.

Except the technical characteristic described in description, it is the known technology of those skilled in the art.

Claims

1. the data deduplication cleaning method of a big data quantity, it is characterised in that it realizes process and is:

2. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that data are decomposed into several data blocks according to decomposition strategy, this decomposition strategy includes data and splits strategy and Data duplication foundation.

3. the data deduplication cleaning method of a kind of big data quantity according to claim 1, it is characterised in that the data deduplication in data block processes and comprises the steps:

Calculate the value of the MD5 row of every a line；

It is ranked up according to the value of MD5 row；

Remove the data row that the value of MD5 row repeats.

4. the data deduplication cleaning method of a kind of big data quantity according to claim 3, it is characterized in that, data block has performed to complete after duplicate removal processes the data block that duplicate removal the processes process that duplicate removal merges of comparing with another one: after each data block has performed duplicate removal process, nullify current task, go another is completed the data block that duplicate removal processes, and the comparison duplicate removal merging treatment between the two data block is initially new task and is loaded in task pool.

5. the data deduplication cleaning method of a kind of big data quantity according to claim 4, it is characterised in that the detailed process that above-mentioned duplicate removal merges is:

The two data block merges into new data block.