CN104391894A

CN104391894A - Method for checking and processing repeated data

Info

Publication number: CN104391894A
Application number: CN201410633391.0A
Authority: CN
Inventors: 李爱民; 陈承志; 龙庆麟; 梁国辉; 熊道勇
Original assignee: Guangzhou Ke Teng Information Technology Co ltd
Current assignee: Guangzhou Ke Teng Information Technology Co ltd
Priority date: 2014-11-11
Filing date: 2014-11-11
Publication date: 2015-03-04

Abstract

The invention discloses a method for checking and processing repeated data. The method comprises the following steps: A, acquiring data to be verified, and initializing the data structure of the data to be verified; B, calculating the hash code of each datum in the data to be verified; C, checking whether repeated data exist among the data or not according to the hash code of each datum, and updating a tag code of each datum according to a checking result; D, transmitting each datum of which the tag code is updated to each distributed calculating node in order to determine whether repeated data exist between each datum of which the tag code is updated and local data through each distributed calculating node; E, transmitting each datum compared by each distributed calculating node to a summarizing node. By adopting the method, the comparison time of massive data can be shortened, and the data lookup and cleaning efficiencies are increased.

Description

A kind of check processing method of repeating data

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of check processing method of repeating data.

Background technology

Along with the fast development of mobile internet, Internet of Things and IT application in enterprises, the data that enterprise produces every day are ten hundreds of, and data scale presents magnanimity rising tendency, is proposed higher requirement to enterprise in data Storage and Processing.And to find out that repeating data carries out removing be reduce data store and then reduce a kind of mode of processing requirements.

Repeating data Removal Technology be intended to remove redundancy Backup Data, guarantee to only have " exclusive " data to be stored on disk.Namely capacity optimizes resist technology.The key of repeating data Removal Technology only retains unique data instance, efficiently solves the efficiency of " capacity expansion ".And owing to not having space to split between Chinese data, cause the difficulty in data search identification, therefore, existing most research is all the removing process for English data.

First need data to search to the inspection cleaning of Chinese repeating data, filter out repeating data, then carry out removing or other process.And the method the most intuitively of searching of repeating data is that all the other record and contrast one by one by each record and database, the method degree of accuracy is high.But do not consider that the semantic emphasis of field Length discrepancy, Chinese Fields is to the rear etc. and repeat record feature.When mass data, because its match time is longer, result cannot be obtained in real time, poor practicability.

Existing solution adopts adjacent row sequence algorithm (SNM) to carry out sequence coupling.SNM algorithm effectively can overcome the shortcoming of control methods directly perceived, substantially increase repeat record matching efficiency and repeat record cleaning complete efficiency.But there is following defect in SNM algorithm: its matching result depends critically upon the selection of sort key and choosing of moving window size is difficult to control.Owing to recording in SNM algorithm and can only comparing with the record in window, when moving window is too little or sequence key word select not at that time, can cause and leakage join; And the comparison that much there is no need can be produced when moving window is too large, therefore the moving window of appropriate size is difficult to obtain.Therefore need a kind of solution of massive structured data being carried out to repeating data check processing badly.

Summary of the invention

The embodiment of the present invention proposes a kind of check processing method of repeating data, can shorten the reduced time of mass data, improves the efficiency of data search and cleaning.

Embodiments provide a kind of check processing method of repeating data, comprising:

A, obtain data to be verified, the data structure of data to be verified described in initialization;

The hash code of pieces of data in B, the described data to be verified of calculating acquisition;

C, hash code according to described pieces of data, check whether there is repeating data between described pieces of data, and upgrade the flag code of described pieces of data according to check result;

D, the pieces of data of the described code of update mark is sent in each distributed computational nodes, whether there is repeating data between the pieces of data of update mark code and local data for described in each distributed computational nodes comparison;

E, the pieces of data after each distributed computational nodes comparison is sent to and gathers node.

Further, described steps A is specially:

Obtain data to be verified, the data structure of data to be verified described in initialization, described data to be verified are converted to the data of JSON structure;

Described in every bar, the data of JSON structure comprise: value, hash code and flag code that the field name of each field, each field are corresponding.

Further, described step B is specially:

B1, take out i-th data of described data to be verified; Wherein, described data to be verified comprise N bar data, and the initial value of i is that 1, i and N is positive integer;

B2, each field name of described i-th data and value corresponding to each field are formed a character string;

B3, MD5 algorithm is adopted to the character string of described i-th data, calculate the hash code obtaining described i-th data, and upgrade the hash code of preserving described i-th data;

B4, the value of i is added 1, repeating said steps B2 and B3, until the N bar data of described data to be verified are all updated preservation hash code.

Further, described step C is specially:

C1, check that whether the hash code of m article of data is identical with the hash code of m+n article of data, if so, then the flag code of m article of data is updated to 1, and directly perform step C3, if not, then the flag code of described m article of data is updated to 0, and performs step C2;

C2, the numerical value of n is added 1 repetition step C1, until the value of m+n is greater than N;

C3, the value of m is added 1, and the value of n is set to initial value, repeat step C1 until m=N; Wherein, the initial value of m and n is 1, m and m is positive integer.

Further, after described step C, also comprise before step D:

Flag code after upgrading according to described pieces of data, described pieces of data is divided into the first data acquisition and the second data acquisition, in described first data acquisition, the flag code of pieces of data is 1, and in described second data acquisition, the flag code of pieces of data is 0;

Wherein, described first data acquisition and described second data acquisition are the data of JSON structure.

Further, described step D is specially:

Adopt the mode of distributed transmission, the pieces of data of the described code of update mark is sent to each distributed computational nodes successively.

Further, whether there is repeating data between the pieces of data of update mark code and local data described in the comparison of described each distribution computing node, be specially:

Take out the jth bar data in the second data acquisition, the hash code of described jth bar data is carried out comparison one by one with the hash code of pieces of data in local data successively, if there is identical hash code, then described jth bar data are removed in the second data acquisition, and by described jth bar data stored in the first data acquisition; If the comparison of pieces of data does not all exist same Hash code, then the value of j is added 1, repeats comparison, until the equal comparison of all data completes in described second data acquisition, wherein, described j to be initial value be 1 positive integer.

Further, after described step e, also step F is comprised:

F, gather the pieces of data merging each distribution computing node and send, obtain not duplicate data sets and duplicate data sets.

Further, after described step F, also step G is comprised:

G, not duplicate data sets is stored in the database of each computing node in distributed type assemblies, duplicate data sets is deleted.

Visible, implement the embodiment of the present invention, there is following beneficial effect:

The check processing method of a kind of repeating data that the embodiment of the present invention provides, is the data of unified structure by data initialization to be verified, and compresses data, and unified for the comparison content character length for fixing can be reduced the content reduced time between pieces of data.When Data Comparison, adopt the mode of distributed treatment, make multiple computing node carry out comparing calculation simultaneously, the direct-vision method contrasted one by one is adopted compared to prior art, adopt technical solution of the present invention can substantially reduce the reduced time of mass data, the repetition comparison of mass data is being had on basis simple to operate, has the advantages such as high, the real-time and easily extensible of efficiency.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of embodiment of the check processing method of repeating data provided by the invention;

Fig. 2 is the schematic flow sheet of the another kind of embodiment of the check processing method of repeating data provided by the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Be the schematic flow sheet of a kind of embodiment of the check processing method of repeating data provided by the invention see Fig. 1, Fig. 1, the method comprises the following steps:

Steps A: obtain data to be verified, the data structure of data to be verified described in initialization.

In the present embodiment, obtain data to be verified in source data, data to be verified are converted to the data of JSON structure by the data structure of initialization data to be verified.The data of every bar JSON structure comprise the field name of each field, each field corresponding value, hash code and flag code.

In the present embodiment, JSON structure is as follows:

As shown in above-mentioned form, Fieled represents each field name of data, and Value part represents value corresponding to an each field of data, Hash represents the hash code of data, is initially sky, and IsExist is flag code, represent whether data exist repeating data, are initially sky.

Step B: calculate the hash code obtaining pieces of data in data to be verified.

In the present embodiment, during due to initialization data structure, the hash code of pieces of data is sky, therefore adopts MD5 algorithm to calculate the hash code of pieces of data.Step B is specially:

B1, take out i-th data of data to be verified, wherein, data to be verified comprise N bar data, and the initial value of i is that 1, i and N is positive integer.

B2, each field name of i-th data and value corresponding to each field are formed a character string.

B3, MD5 algorithm is adopted to the character string of i-th data, calculate the hash code of acquisition i-th data, and upgrade the hash code of preservation i-th data.

B4, the value of i is added 1, repeat step B2 and B3, until the N bar data of data to be verified are all updated preservation hash code.

In the present embodiment, the string format of composition is as follows:

"Feled_1":"value_1","Feled_2":"value_2",……,"Feled_M":"value_M"

In this example, adopt MD5 algorithm to calculate the hash code of this character string, have compressed contrast content, reduce the content comparison time between wall scroll data.

Step C: according to the hash code of pieces of data, checks whether there is repeating data between pieces of data, and upgrades the flag code of pieces of data according to check result.

In the present embodiment, step C is specially:

C1, check that whether the hash code of m article of data is identical with the hash code of m+n article of data, if so, then the flag code of m article of data is updated to 1, and directly perform step C3, if not, then the flag code of m article of data is updated to 0, and performs step C2;

In the present embodiment, after step c, also comprise before step D: the flag code after upgrading according to pieces of data, is divided into the first data acquisition and the second data acquisition by pieces of data.In first data acquisition, the flag code of pieces of data is 1, and in the second data acquisition, the flag code of pieces of data is 0.First data acquisition and the second data acquisition are the data of JSON structure.

Whether step D: be sent in each distributed computational nodes by the pieces of data of update mark code, exist repeating data between the pieces of data of each distributed computational nodes comparison update mark code and local data.

In the present embodiment, step D is specially: adopt distributed send mode, and the pieces of data of update mark code is sent to each distribution computing node successively.Each distribution computing node respectively has a database, storage be the part wanting comparison data, the set of the local data of each distribution computing node is total correlation data.For example, the pieces of data of update mark code is defined as X, is sent to by X in first distribution calculation level, and X is sent to second computing node that distributes by first distribution computing node, and after being sent completely, first and second distribution computing nodes all comprise X.Then X is sent to the 3rd computing node that distributes by first distribution computing node, and X is sent to the 4th distribution computing node by second distribution computing node, so analogizes, adopts distributed send mode, X is sent to successively each distribution computing node.

In the present embodiment, distribution computing node, after the pieces of data receiving update mark code, checks whether there is repeating data between itself and local data.Because these data to be verified have been divided into the first data acquisition and the second data acquisition, and the data in the first data acquisition are repeating data, therefore only need check whether there is repeating data between data in the second data acquisition and local data.This inspection is specially:

Take out the jth bar data in the second data acquisition, the hash code of jth bar data is carried out comparison one by one with the hash code of pieces of data in local data successively, if there is identical hash code, then jth bar data are removed in the second data acquisition, and by jth bar data stored in the first data acquisition; If the comparison of pieces of data does not all exist same Hash code, then the value of j is added 1, repeats comparison, until the equal comparison of all data completes in the second data acquisition, wherein, j to be initial value be 1 positive integer.

Step e: the pieces of data after each distributed computational nodes comparison is sent to and gathers node.

In the present embodiment, the pieces of data after each distributed computational nodes comparison being completed sends to and gathers node, gathers merging by gathering node to data, completes repeating data inspection work.

One as the present embodiment is illustrated, and is the schematic flow sheet of the another kind of embodiment of the check processing method of repeating data provided by the invention see Fig. 2, Fig. 2.As shown in Figure 2, the difference of Fig. 2 and Fig. 1 is, also comprises after step e: step F and step G.

Step F: gather the pieces of data merging each distribution computing node and send, obtain not duplicate data sets and duplicate data sets.

Step G: not duplicate data sets is stored in the database of each computing node in distributed type assemblies, duplicate data sets is deleted.

In the present embodiment, deletion is carried out to duplicate data sets and carries out a wherein processing mode, can process accordingly duplicate data sets according to business demand.

Therefore, implement the embodiment of the present invention, there is following beneficial effect:

One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in a computer read/write memory medium, this program, when performing, can comprise the flow process of the embodiment as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.

The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims

1. a check processing method for repeating data, is characterized in that, comprising:

2. the check processing method of repeating data according to claim 1, is characterized in that, described steps A is specially:

3. the check processing method of repeating data according to claim 2, is characterized in that, described step B is specially:

4. the check processing method of repeating data according to claim 3, is characterized in that, described step C is specially:

5. the check processing method of repeating data according to claim 4, is characterized in that, after described step C, also comprises before step D:

6. the check processing method of repeating data according to claim 5, is characterized in that, described step D is specially:

7. the check processing method of repeating data according to claim 6, is characterized in that, whether there is repeating data between the pieces of data of update mark code and local data, be specially described in the comparison of described each distribution computing node:

8. the check processing method of the repeating data according to any one of claim 1 to 7, is characterized in that, after described step e, also comprises step F:

9. the check processing method of repeating data according to claim 8, its feature is being, after described step F, also comprises step G: