CN105488212A

CN105488212A - Data quality detection method and device of duplicated data

Info

Publication number: CN105488212A
Application number: CN201510925893.5A
Authority: CN
Inventors: 许飞月; 李青海; 简宋全; 侯大勇; 邹立斌
Original assignee: Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Current assignee: Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Priority date: 2015-12-11
Filing date: 2015-12-11
Publication date: 2016-04-13
Anticipated expiration: 2035-12-11
Also published as: CN105488212B

Abstract

The invention discloses a data quality detection method and device of duplicated data. The method comprises: step b, generating a model training set; step c, analyzing every combination pair in the model training set, marking as record duplication or record non-duplication; step d, calculating probability of record duplication and screening out field combinations with relatively high probability as sample field combinations; step e, analyzing the values of data to be detected; and step f, carrying out duplicated detection to the data, screening out record combinations in all duplicated fields satisfying the sample field combinations, wherein the device comprises a training set generating unit, a sample record duplication marking unit, a sample combination screening unit, a detection data analyzing unit and a detection data screening unit corresponding to every step. Through calculating the duplication probability of the field combinations, it is unnecessary to compare the duplication probability of any two records; therefore, the time is shortened; the detection efficiency is improved; meanwhile, the condition that parts of two data are the same can be detected.

Description

A kind of data quality checking method of repeating data and device

Technical field

The present invention relates to data quality monitoring technical field, be specifically related to a kind of data quality checking method and device of repeating data.

Background technology

The fast development of infotech makes data become gradually to realize business event to be worth one of most important resource.But along with the continuous increase of data volume, data quality problem is also following.Shortage of data, mistake, the problem such as inconsistent make the application of enterprise to it be hindered, and the serious enterprise that even can cause makes erroneous decision, loss important value and then cause trust crisis.

For these dirty datas, many data quality checkings and cleaning program arise at the historic moment.Repeating data is then compare a kind of data quality problem being difficult to detect wherein.Repeat completely because the Data duplication problem nowadays faced by enterprise is not merely data, also comprise part and repeat.Such as certain social network sites have number with ten million user, but these users may have the situation of repeated registration, the user of these repeated registrations some information may have been only had to occur small difference.How to identify that these user profile repeated are most important for maintaining web quality.

What more representational Data duplication inspection scheme had at present calculates unique Hash codes and check code according to the content of every bar record, then judge whether data repeat according to whether Hash codes and check code be identical, feature is that accuracy is high, efficiency is high, but is only applicable to record situation about repeating completely; Some schemes are based on machine learning to the training of duplicate detection model, and feature is that dirigibility is high, the duplicate detection of various sight is not limited under a kind of method, but every two data all will calculate repetition possibility, and efficiency is low, and accuracy still has to be hoisted.

In view of above-mentioned defect, creator of the present invention is through research and test propose a kind of data quality checking method and device of repeating data finally for a long time.

Summary of the invention

The object of the present invention is to provide a kind of data quality checking method and device of repeating data, in order to overcome above-mentioned technological deficiency, solve problem part repeating data and complete repeating data being detected how accurately, fast.

For achieving the above object, the technical solution used in the present invention is: a kind of data quality checking method first providing repeating data, and it comprises:

Step b, analyzes the data value of the training sample comprising many records, generation model training set;

Step c, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return step b, otherwise enter steps d;

Steps d, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;

Step e, analyzes the value of data to be tested, exports the record number that each different value of each field is corresponding;

Step f, carries out duplicate detection according to described sample field combination to the described data to be tested analyzed, and filters out the record combination that all Repeating Fields meet described sample field combination.

Preferably, described data quality checking method also comprises:

Step a, extracts described training sample from described data to be tested source; Described step a is before described step b.

Preferably, described data quality checking method also comprises:

Step g, export the described record combination and the described probability recording combination and repeat that retain, described step g is after described step f.

Preferably, described step b comprises:

Step b2, analyzes the data value of described training sample, adds up the described record number that each different value of each field is corresponding;

Step b3, the described record number corresponding to each different value of each field processes, and generates described model training collection.

Preferably, described step b3 comprises:

Step b31, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a described combination, and this combination is added field repeating label to record in field one;

Step b32, the value of correspondence more than three or three record of static fields one, the record combination of two of each value correspondence is that a described combination is right, by this combination to recording and adding described field repeating label in field one;

Step b33, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a described combination is right, if this combination to the described combination of recording to identical, then add described field repeating label in the field two that described described combination of having recorded is right; If this combination to from the described combination of recording to different, then by this combination to record and add described field repeating label in field two;

Step b34, processes other fields according to step b33, and all described combination of formation is to the described model training collection of formation.

Preferably, described steps d comprises:

Steps d 1, the described combination repeated with a certain field mark field to number for divisor, the combination centering of repeating with this field mark field is with recording the number of repetition described in tense marker for dividend, repeat for this field with business, record the probability of repetition, calculate described field and repeat, record the probability of repetition;

Steps d 2, repeats according to described field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;

Steps d 3, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.

Preferably, described multiple field repeats, and the computing formula recording the probability of repetition is:

p (1, 2, ..., k) = {Σp}_{i} + {(- 1)}^{1} \underset{i 1 < i 2}{Σ} p_{i 1} p_{i 2} + ... + {(- 1)}^{k - 1} \underset{i 1 < i 2 < ... < i k}{Σ} p_{i 1} p_{i 2} ... p_{i k}

In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition; p _i, p _i1, p _i2, p _ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents the sequence number of field k.

Preferably, described step f comprises:

Step f1, determines the minimum value N of each combined field number in described sample field combination;

Step f2, searches in two records of described data to be tested the described record combination having at least N number of field identical, detects and is retained in the described record combination in described sample field combination;

Step f3, is recorded to the identical described record combination of rare N number of field according to known n-1 bar in the described record combination retained, and searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;

Step f4, detects and is retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the described n bar record of reservation in the described record combination of n-1 bar record combines; Return step f3.

Preferably, in described step f3, described in search the condition that must meet and be:

The described record combination of described n bar record is combined by the described record of described n-1 bar record and is combined between two, has n-2 bar record to be identical in the described record combination of these two described n-1 bar records;

Each of the described record combination of the described n bar record of Combination nova has the subset of n-1 bar record to be recorded in the identical described record combination of rare N number of field at described n-1 bar.

The data quality checking device of next repeating data providing a kind of with described data quality checking method corresponding, it comprises:

Training set generation unit, analyzes the data value of the training sample comprising many records, generation model training set;

Sample record indicated weight unit, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return described training set generation unit, otherwise enter sample combined sorting unit;

Sample combined sorting unit, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;

Detect data analysis unit, the value of data to be tested is analyzed, exports the record number that each different value of each field is corresponding;

Detect data screening unit, according to described sample field combination, duplicate detection is carried out to the described data to be tested analyzed, filter out the record combination that all Repeating Fields meet described sample field combination.

Beneficial effect of the present invention is compared with the prior art: the data quality checking method and the device that provide a kind of repeating data, like this, compare the mode that general duplicate detection method record all will carry out detecting between any two, this is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, the method is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this method, whether identical Data Quality Analysis person can self-defining two record Rule of judgment; This method by the selection of training sample automatically for different field adds weight, can provide certain dirigibility; Can calculate rapidly multi-field by formula to repeat, record the probability of repetition, improve judgement speed, save the time, improve data quality checking efficiency, and formula is simple, saves system resource; After generation model training set, by being converted to the analysis of the same field to record to the analysis of record, subsequent treatment speed can be improved; Can error be eliminated, improve the accuracy that repeating data is judged; Extract training sample from data to be tested source, due to training sample and data to be tested homology, the accuracy of the judgement to repeating data can be improved.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the data quality checking method of repeating data of the present invention;

Fig. 2 is the process flow diagram of step b in the data quality checking method of repeating data of the present invention;

Fig. 3 is the process flow diagram of step b3 in the data quality checking method of repeating data of the present invention;

Fig. 4 is the process flow diagram of steps d in the data quality checking method of repeating data of the present invention;

Fig. 5 is the data quality checking method probability calculation schematic diagram one of repeating data of the present invention;

Fig. 6 is the data quality checking method probability calculation schematic diagram two of repeating data of the present invention;

Fig. 7 is the process flow diagram of step e in the data quality checking method of repeating data of the present invention;

Fig. 8 is the process flow diagram of step f in the data quality checking method of repeating data of the present invention;

Fig. 9 is the process flow diagram of the data quality checking embodiment of the method one of repeating data of the present invention;

Figure 10 is the process flow diagram of the data quality checking embodiment of the method two of repeating data of the present invention;

Figure 11 is the process flow diagram of the data quality checking embodiment of the method three of repeating data of the present invention;

Figure 12 is the data quality checking method EXAMPLEPART data to be tested table of repeating data of the present invention;

Figure 13 is the data quality checking method EXAMPLEPART different value corresponding record number table of repeating data of the present invention;

Figure 14 is that the data quality checking method EXAMPLEPART combination of repeating data of the present invention is to field repeating label table;

Figure 15 is that the data quality checking method EXAMPLEPART combination of repeating data of the present invention is to record repeating label table;

Figure 16 is the data quality checking method example reserved field combination of repeating data of the present invention;

Figure 17 is the structural drawing of the data quality checking device of repeating data of the present invention;

Figure 18 is the structural drawing of the data quality checking device training set generation unit of repeating data of the present invention;

Figure 19 is the structural drawing of the data quality checking device record number processing module of repeating data of the present invention;

Figure 20 is the structural drawing of the data quality checking device sample combined sorting unit of repeating data of the present invention;

Figure 21 is the structural drawing of the data quality checking device detection data analysis unit of repeating data of the present invention;

Figure 22 is the structural drawing of the data quality checking device detection data screening unit of repeating data of the present invention;

Figure 23 is the structural drawing of the data quality checking device embodiment six of repeating data of the present invention;

Figure 24 is the structural drawing of the data quality checking device embodiment seven of repeating data of the present invention;

Figure 25 is the structural drawing of the data quality checking device embodiment eight of repeating data of the present invention.

Embodiment

Below in conjunction with accompanying drawing, to above-mentioned being described in more detail with other technical characteristic and advantage of the present invention.

As shown in Figure 1, it is the process flow diagram of the data quality checking method of repeating data of the present invention; Wherein, the data quality checking method of described repeating data comprises:

Step b, analyzes the data value of training sample, generation model training set;

Have many records in training sample, every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.

Described training sample can be write as the case may be for data analyst, also can extract from data to be tested.

As shown in Figure 2, it is the process flow diagram of step b in the data quality checking method of repeating data of the present invention; Wherein, described step b comprises:

Step b2, analyzes the data value of training sample, adds up the record number that each different value of each field is corresponding;

Same field has multiple value, has identical in these values, also has different; Identical value merged, and add the record number merged, same like this field has multiple different value, and after each value, mark has at least one record number;

Add up all fields according to said method, draw the record number that each different value of each field is corresponding.

Step b3, the record number corresponding to each different value of each field processes, generation model training set;

Described model training collection is any two the record of Repeating Field and the mark of Repeating Field thereof.The record number that each different value of each field of above-mentioned statistics is corresponding, if record number corresponding to value is two, then these two to be recorded as a combination right, and add field repeating label at this field place that combination is right; If record number corresponding to value is more than three or three, is then that a combination is right by record number combination of two corresponding for this value, and adds field repeating label at the corresponding field place that combination is right; Identical combination be combined, the right field repeating label of the combination of merging is the sum of the field repeating label before merging, final generation model training set.

If the record number of value correspondence is two, then obtain a combination right; If record number corresponding to value is three, then to obtain three combinations right for three record combination of two; If record number corresponding to value is four, then to obtain six combinations right for four record combination of two; If the record number of value correspondence is N number of, then N number of record combination of two obtains individual combination is right.

After generation model training set, by being converted to the analysis of the same field to record to the analysis of record, subsequent treatment speed can be improved.

As shown in Figure 3, it is the process flow diagram of step b3 in the data quality checking method of repeating data of the present invention; Wherein, described step b3 concrete steps can be:

Step b31, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a combination, and this combination is added field repeating label to record in field one;

Step b32, correspondence more than three or three values recorded of static fields one, the record combination of two of each value correspondence is that a combination is right, by this combination to recording and adding field repeating label in field one;

Step b33, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a combination is right, if this combination to the combination of recording to identical, then add field repeating label in the field two that described combination of having recorded is right; If this combination to from the combination of recording to different, then by this combination to record and add field repeating label in field two;

Step b34, processes other fields according to step b33, and all combinations of formation are to component model training set.

Step b31-b34 is only wherein a kind of method of generation model training set, and this method can while quick generation model training set, and it is right to avoid omission or repeat certain combination.

Step c, respectively each combination of described model training collection is right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return step b, otherwise enter steps d;

The combination that model training is concentrated is to corresponding two records of difference, right by exporting combination, contrasts the real data of these two records, confirms that whether it is identical, if identical, be labeled as to record and repeat, not identical, is labeled as record and does not repeat.Here judge whether combination repeats two records of correspondence, and the concrete data that can be recorded by observation two by quality analysis person are judged, both also can calculating according to algorithm, similarity is determined.

Then can determine whether to need to continue training or repetition training according to the right contrast situation of output combination, if desired then redefine described training sample and return step b, then whether two right records repeat to determine new all combinations, the result of comprehensively training several times during subsequent analysis, to improve the accuracy rate of judgement; Do not need, carry out steps d.

As shown in Figure 4, it is the process flow diagram of steps d in the data quality checking method of repeating data of the present invention; Wherein, described steps d comprises:

Steps d 1, the combination repeated with a certain field mark field to number for divisor, the number of the combination centering of repeating with this field mark field label record repetition is simultaneously for dividend, and repeat for this field with business, record the probability of repetition, calculated field repeats, and records the probability of repetition;

The combination of the number x and the repetition of each field that first calculate all each field of combination centering repetitions be labeled, to the number y being marked as repetition, calculates the y/x value that each field is corresponding, is interpreted as: the probability that the record that this field is identical repeats.

Each combination that model training is concentrated, to there being multiple field, is field one, field two, field three, field four ..., each combination repeats having a field at least.Meanwhile, each field has multiple combination right, and it at least repeats a combination centering.

Each combination is to corresponding two records; Each combination is recorded repeating label to there being one or recording not repeating label; Like this, each combination repeats having a field at least, and has one record repeating label or record not repeating label simultaneously.

Like this, it is that the combination that field repeats is right at this field mark that each field has multiple, and these combination centering parts are marked as record repetition; The latter, divided by the former, for this field repeats, records the possibility (probability) of repetition.

That repeat as all combination centering fields one be x, and the right number of the combination that this x combination centering is marked as record repetition is that y is individual, then this field repeats, and the probability recording repetition is y/x.

Steps d 2, repeats according to field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;

Multiple field repeats, and the computing formula recording the probability of repetition is:

p (1, 2, ..., k) = {Σp}_{i} + {(- 1)}^{1} \underset{i 1 < i 2}{Σ} p_{i 1} p_{i 2} + ... + {(- 1)}^{k - 1} \underset{i 1 < i 2 < ... < i k}{Σ} p_{i 1} p_{i 2} ... p_{i k}

In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition, if it means field 1,2 in two records ..., k repeat, so these two record repeat possibility be p (1,2 ..., k); p _i, p _i1, p _i2, p _ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents field 1,2 ..., the sequence number of a kth field in k.

The thinking of formula is: to k field of the probability that will calculate, and therefrom takes out one, then has k kind to follow the example of, and often kind is followed the example of corresponding numerical value is single Probability p _i; Therefrom take out two, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is two probability for often kind _i1p _i2; Therefrom take out k, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is k probability for often kind _i1p _i2p _ik; The coefficient of often kind of multiple value sums of following the example of is determined by got field quantity, and therefrom take out odd number, then coefficient is+1; Therefrom take out even number, then coefficient is-1; Like this by these with coefficient and be added, obtain a final k field and repeat, record the probability of repetition.

When k is 2,

p(1,2)＝p ₁+p ₂-p ₁p ₂

As shown in Figure 5, it is the data quality checking method probability calculation schematic diagram one of repeating data of the present invention; Wherein, p ₁p ₂for circle p ₁, p ₂repeat region, need to deduct, just obtain total area p (1,2).

When k is 3, P ₁p ₂p ₃

p(1,2,3)＝p ₁+p ₂+p ₃-p ₁p ₂-p ₁p ₃-p ₂p ₃+p ₁p ₂p ₃

As shown in Figure 6, it is the data quality checking method probability calculation schematic diagram two of repeating data of the present invention; Wherein, p ₁p ₂for circle p ₁, p ₂repeat region, p ₁p ₃for circle p ₁, p ₃repeat region, p ₂p ₃for circle p ₂, p ₃repeat region, need to deduct; p ₁p ₂p ₃for circle p ₁, p ₂, p ₃repeat region, repeatedly deduct, need to add, just obtain total area p (1,2,3).

Beneficial effect: like this, can calculate rapidly multi-field by formula and repeat, record the probability of repetition, improve judgement speed, save the time, improve data quality checking efficiency, and formula is simple, saves system resource.

The multiple fields calculated in steps d 2 are repeated, records the probability of repetition, setting threshold value is needed to screen it, threshold value by manually determining according to actual conditions, also can be determined by calculation element or draw after mass data Statistical Comparison after tightly calculating.

The size of threshold value is relevant with the accuracy of the present invention to the data quality checking of repeating data, and threshold value is larger, and the accuracy of data quality checking of the present invention is higher.

Suppose that data to be tested have n field, then wherein 1<k≤n.After threshold value is set, retain the field combination repeating possibility and be greater than this threshold value.These field combination be retained as sample field combination for follow-up duplicate detection.

By formula, by difference record repeat judge to be converted to the calculating to the probability repeated, thus avoid to record repeat between two judge respectively, only need by legal combination to carrying out probability calculation, substantially increase the efficiency of judgement.

This step is similar to step b, difference be only step b process for training sample, this step process be data to be tested.

Have many records in data to be tested, every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.

As shown in Figure 7, it is the process flow diagram of step e in the data quality checking method of repeating data of the present invention; Wherein, described step e comprises:

Step e1, calculates similarity to the value in the same field of data to be tested, and similar value similarity being met or exceeded threshold value is as identical value;

Value very close to some in each field here adopts certain algorithm to calculate similarity, and is defined when which kind of level is threshold value determination similarity reach by Data Quality Analysis person and these values processed as identical value.

The algorithm calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can be selected according to actual needs.

Step e2, analyzes the data value of data to be tested, adds up the record number that each different value of each field is corresponding;

Step f, carries out duplicate detection according to described sample field combination to the described data to be tested analyzed, and filters out the record combination that all Repeating Fields meet described sample field combination;

This step carries out duplicate detection.First the whether satisfied described sample field combination of Repeating Field of two records is detected according to the analysis result of steps d, then according to satisfy condition two record combination producing three records combination obtained, whether the Repeating Field continuing detection three record meets the sample field combination repeated.Repeat said process until can not find the record combination meeting the sample field combination stated.

Like this, compare the mode that general duplicate detection method record all will carry out detecting between any two, this method is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, the method is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this method, whether identical Data Quality Analysis person can self-defining two record Rule of judgment.

In addition, this method by the selection of training sample automatically for different field adds weight, can provide certain dirigibility.

As shown in Figure 8, it is the process flow diagram of step f in the data quality checking method of repeating data of the present invention; Wherein, described step f comprises:

Generally, the number having the record of Repeating Field to combine can reduce along with the increase of Repeating Field number, therefore the minimum value N determining each combined field number in described sample field combination is needed, so do not need to search for again the record combination that Repeating Field is less than N, decrease the number of the record combination needing search, improve search efficiency.

Such as, minimum in sample field combination have 4 fields to repeat, then only need the record combination that search has at least 4 fields to repeat, which improves search efficiency.

Step f2, searches for the record combination having at least N number of field identical in two records of described data to be tested, detects and is retained in the described record combination in described sample field combination;

The minimum value N of each combined field number in described sample field combination, in the record combination of described data to be tested, if the same field number of record combination is less than N, then this record combination is scarcely in sample field combination, therefore the record combination having at least N number of field identical is only searched for, can search time be reduced, improve search efficiency.

In this step, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, search n bar and be recorded to the identical described record combination of rare N number of field, the condition that wherein must meet is:

1) combination of n bar record is combined by n-1 bar record and is combined between two, has n-2 bar record to be identical in these two n-1 bar record combinations;

2) each of the n bar record combination of Combination nova has the subset of n-1 bar record to be recorded to during the identical described record of rare N number of field combines at n-1 bar.

In these conditions, condition 1) be say that the combination of n bar record must comprise n-1 bar record by two and the identical combination of n-2 bar record is combined into, such as 4 record combinations must be have two 3 combinations of recording to be combined into, and have 2 records to be identical in these two 3 records, just there are 4 records in the combination be combined into like this.

Condition 2) be say that the n bar record combination be combined into has the individual subset having n-1 bar record, each subset can n-1 bar record before record combination in find, that is only n-1 bar record record combination in exist n bar record combination the individual subset having n-1 bar record, just thinks and can form the combination of n bar record, such as 4 record combinations (1,2,3,4) 43 subsets ((1 recording combination are had, 2,3), (1,2,4), (1,3,4), (2,3,4)), these 43 record combinations can be found in the combination of the record of 3 before.

Article 3, the record of record is all the record combination having at least N number of field identical in combining, and the records combination of 4 of being combined into by it record just may have at least N number of field identical; If 4 the record combination (such as 1,2,3 of record, 4) one of them subset (such as 1,2,3) only has N-1 field identical (not namely being recorded in the identical record combination of rare N number of field at 3), so record combination (1 of these 4 records, 2,3,4) N number of field just can not be had identical, also just there is N-1 field identical at most, and this N-1 field is exactly N-1 same field of subset (1,2,3) certainly.Therefore condition 2) must set up.

Only have ready conditions 1) and condition 2) when setting up, being only needs to search n bar and be recorded to the identical described record combination of rare N number of field simultaneously.

Step f4, detects and is retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the n bar record of reservation in the described record combination of n-1 bar record combines; Return step f3.

The described record combination of the n bar record found in detecting step f3, if the field combination of its same field is in described sample field combination, then represents that this n bar record is identical, retains this record combination; If not in described sample field combination, then represent that this n bar record is not identical, delete this record combination.

In addition, the record combination of n bar record is identical, with it individual have the n-1 bar record of each subset in the subset of n-1 bar record identical, and the implication represented by it is all identical, is: this n bar record repeats.That expresses identical meanings only needs reservation one, that is to say the record combination of the individual n-1 of having bar record has been combined into one jointly has the record of n bar record to combine; Therefore, when the described record retaining n bar record combines, need to delete its correspondence individual have the record of n-1 bar record to combine.

Such as, the combination (1,2,3 of 4 records, 4) in, same field is in described sample field combination, then retain combination (1,2,3,, and delete 4 combinations (1,2,3) of 3 of its correspondence records 4), (1,2,4), (1,3,4), (2,3,4).

By step f1-f4, all possible field combination can being added up by step by step calculation, avoiding recording possible omission.

Embodiment one

The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method one of repeating data as of the present invention in Fig. 9; Described data quality checking method also comprises:

Output in this step can adopt multi-form, can represent with visual pattern, also output detections result can be convenient to merge record; It can export all described record combinations of reservation and the probability of described record combination repetition, also can export the described record combination of the part of reservation and the probability of described record combination repetition.

Embodiment two

The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method two of repeating data as of the present invention in Figure 10; Described step b also comprises:

Step b1, calculates similarity to the value in the same field of training sample, and similar value similarity being met or exceeded threshold value is as identical value, and described step b1 is before described step b2.

May there is trickle change because of error in the data in training sample, the value of this same field just making two to record is very similar but not identical, and the interpolation of this step can eliminate this kind of error, improves the accuracy judged repeating data.

Embodiment three

The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method three of repeating data as of the present invention in Figure 11; Described data quality checking method also comprises:

Step a, extracts training sample from data to be tested source; Described step a is before described step b;

Band detects in data source many records, and every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.

The record count extracted can be determined by Data Quality Analysis person oneself, also can determine according to actual needs.

Extract training sample from data to be tested source, due to training sample and data to be tested homology, the accuracy of the judgement to repeating data can be improved.

Embodiment four

The data quality checking method of repeating data as described above, the present embodiment is its quality testing example to concrete data, is specially:

S1: these EXAMPLEPART data to be tested are as Figure 12.Extract training sample from data to be tested source, the record number that described sample packages contains, by Data Quality Analysis person's predefined, is assumed to be 1000.

S2: analyze the data value of training sample, export the record number that each different value of each field is corresponding, partial results as shown in figure 13.

S2.1: some value wherein in field may be very close, just has individual characters inconsistent, as 1aaaa and 1aaab in Col1.The similarity calculating these values someway can be taked, set threshold value to judge that whether these values are identical by Data Quality Analysis person, suppose that 1aaaa and 1aaab is judged as identical here.

S2.2: the above results is analyzed, from each have record number be more than 2 or 2 value export combination right, partial results is as shown in figure 14.This process is specific as follows:

In S2.2.1:Col1,1aaaa/1aaab, 1bbbb, 1eeee tri-values have 2 records, can form 3 combinations to (1,2), (3,5) and (6,7).Col1 repeating label to being a record, and is set to 1 by each combination, and all the other marks are set to 0.

In S2.2.2:Col2,2aaaa and 2eeee can form two combinations to (1,2) and (6,7), Col2 repeating label right for the combination formed to the combination centering previously formed, is then set to 1, shows that these combinations repeat at Col2 by these two combinations.2bbbb can form three combinations to (3,4), and (3,5) and (4,5), wherein (3,5) are the combination centering previously formed, and processing mode is as front.(3,4) and (4,5) are that the new combination produced is right, and form new record, and Col2 repeating label is set to 1, all the other marks are set to 0.

S2.2.3:Col3 ~ Col5 is as aforesaid way process.The all combinations formed are to formation duplication model training set.

S3: export at random above-mentioned training set, Data Quality Analysis person starts model training process simultaneously.Concrete training method is: each combination exporting some to and corresponding data, Data Quality Analysis person according to combination to the content of record to these combinations to marking, namely repeat or do not repeat.

S4: after completing the mark right to the combination exported, Data Quality Analysis person can select whether to continue training pattern.If choosing repeats said process by getting back to S1, as selected otherwise carrying out following process.

S5: to the combination of the mark of model training several times to processing.Suppose that labeled incorporating aspects is to as shown in figure 15.Wherein whether repeat to represent this combination repeats record to finally whether being marked as, and all the other fields are identical with Figure 14 implication.

S5.1: the combination of the number x and the repetition of each field that first calculate all each field of combination centering repetitions be labeled is to the number y being marked as repetition.The combination repeated as Col1 in Figure 15 is 3 to number, and what be marked as repetition is also 3.The combination that Col4 repeats is 7 to number, and what be marked as repetition is 3, the like.Calculate the y/x value that each field is corresponding, be interpreted as: the possibility that the record that field k is identical repeats has much.Suppose to be followed successively by 0.4,0.4,0.4,0.3,0.3 (because the data in figure are the part in training sample data through calculating y/x value corresponding to final Col1 ~ Col5, therefore cannot be calculated by the data in figure and be worth accurately, can only suppose to ensure carrying out smoothly of subsequent step, but last result and correct result can be made like this to differ greatly).

S5.2: Data Quality Analysis person sets threshold value, judges that this is recorded as when the possibility that definition record repeats is much and repeats record, assuming that this threshold value is 0.75.Then calculating the record having k field identical is that the possibility of same record has much, and this value is compared with described threshold value, and the field combination higher than this threshold value is left, as shown in figure 16.

More than for repeating model training process, the model trained next is utilized to carry out duplicate detection.

S6: the field combination receiving data to be tested and finally stay.Then treat detect data value analysis, export the record number that each different value of each field is corresponding, partial results as shown in figure 13.Some value wherein in field may be very close, just has individual characters inconsistent, as 1aaaa and 1aaab in Col1.The similarity calculating these values someway can be taked, set threshold value to judge that whether these values are identical by Data Quality Analysis person, suppose that 1aaaa and 1aaab is judged as identical here.

S7: carry out duplicate detection.Detailed process is as follows:

S7.1: because the field combination finally stayed has three fields at least, therefore finally detect repeat record three field contents also at least will be had identical.First search for two and be recorded to the identical combination of rare three fields.Result is that { { { { { digital watch wherein in the outer brace of round bracket is shown with several field and repeats for (4,5), 3} for (3,4), 4} for (6,7), 5} for (3,5), 4} for (1,2), 3}.

S7.2: detect same field combination in above record combination whether repeating in the field combination that decision condition generation unit 14 finally stays, if do not existed, delete the combination of this record, then { (4,5), 3} is deleted.

S7.3: search n bar and be recorded to the identical combination of rare three fields in the combination of residue record, known back n-1 bar is recorded to the identical combination of rare three fields.Then check whether these Combination nova have at least three fields identical.

S7.4: detect same field combination in the combination of above n bar record and, whether in the field combination finally stayed, if do not existed, delete the combination of this record.If, not only to retain this combination, also will be recorded in the identical combination of rare three fields at back n-1 bar and delete this combination each has the subset of n-1 bar record.

S7.5: when check be recorded to the identical combination of rare three fields less than n bar time, testing process terminates, otherwise gets back to S7.3.

In the present embodiment, detecting step terminates in S7.2.

S8: output detections result, can represent with visual pattern, also output detections result can be convenient to merge record.

S8.1: all more than 3 that can export that S7 step retains to be recorded to the identical combination of rare three fields and these combinations may be the probability repeating to record, and the probability that the record in combination may repeat between any two.

S8.2: 2 that can export the S7 step reservation that S8.1 does not export are recorded to the identical combination of rare three fields, and these combinations may be the probability repeating to record.

As the present embodiment will export (1,2), (3,5), (6,7), (3,4) right record content (this result is realized by the hypothesis of intermediate data, and therefore this result and the right result of reality differ greatly) is combined, and the possibility that these records repeat.

Embodiment five

The data quality checking method of repeating data as described above, the present embodiment is the data quality checking device of the repeating data corresponding with it.

As shown in figure 17, it is the structural drawing of the data quality checking device of repeating data of the present invention; Wherein, the data quality checking device of described repeating data comprises:

Training set generation unit 2, analyzes the data value of training sample, generation model training set;

As shown in figure 18, it is the structural drawing of the data quality checking device training set generation unit of repeating data of the present invention; Wherein, described training set generation unit 2 comprises:

Record number statistical module 22, analyzes the data value of training sample, adds up the record number that each different value of each field is corresponding;

Record number processing module 23, the record number corresponding to each different value of each field processes, generation model training set

Generation model training set, by being converted to the analysis of the same field to record to the analysis of record, can improve subsequent treatment speed.

As shown in figure 19, it is the structural drawing of the data quality checking device record number processing module of repeating data of the present invention; Wherein, described record number processing module 23 comprises:

Field one diadic indicated weight submodule 231, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a combination, and this combination is added field repeating label to record in field one;

The many-valued indicated weight submodule 232 of field one, correspondence more than three or three values recorded of static fields one, the record combination of two of each value correspondence is that a combination is right, by this combination to recording and adding field repeating label in field one;

Field two indicated weight submodule 233, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a combination is right, if this combination to the combination of recording to identical, then add field repeating label in the field two that described combination of having recorded is right; If this combination to from the combination of recording to different, then by this combination to record and add field repeating label in field two;

Multi-field indicated weight submodule 234, process other fields according to field two indicated weight submodule 233, all combinations of formation are to component model training set.

Field one diadic indicated weight submodule 231-b34 is only wherein a kind of device of generation model training set, and this device can while quick generation model training set, and it is right to avoid omission or repeat certain combination.

Sample record indicated weight unit 3, each combination analyzing described model training collection is right, and described combination to be repeated for record two of correspondence recording marks by artificial or algorithm or record does not repeat; Then select whether to continue training, continue then redefine described training sample and return training set generation unit 2, otherwise enter sample combined sorting unit 4.

Then can determine whether to need to continue training or repetition training according to the right contrast situation of output combination, if desired then redefine described training sample and return training set generation unit 2, then whether two right records repeat to determine new all combinations, the result of comprehensively training several times during subsequent analysis, to improve the accuracy rate of judgement; Do not need, carry out sample combined sorting unit 4.

Sample combined sorting unit 4, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;

As shown in figure 20, it is the structural drawing of the data quality checking device sample combined sorting unit of repeating data of the present invention; Wherein, described sample combined sorting unit 4 comprises:

Individual character section double counting module 41, the combination repeated with a certain field mark field to number for divisor, the number of the combination centering of repeating with this field mark field label record repetition is simultaneously for dividend, repeat for this field with business, record the probability of repetition, calculated field repeats, and records the probability of repetition;

Multi-field double counting module 42, repeats according to field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;

p (1, 2, ..., k) = {Σp}_{i} + {(- 1)}^{1} \underset{i 1 < i 2}{Σ} p_{i 1} p_{i 2} + ... + {(- 1)}^{k - 1} \underset{i 1 < i 2 < ... < i k}{Σ} p_{i 1} p_{i 2} ... p_{i k}

In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition, if it means field 1,2 in two records ..., k repeat, so these two record repeat possibility be p (1,2 ..., k); p _i, p _i1, p _i2, p _ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents field 1,2 ..., in k, the sequence number of a kth field, is the sequence number of field k.

Threshold value screening composite module 43, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.

The multiple fields calculated in multi-field double counting module 42 are repeated, records the probability of repetition, setting threshold value is needed to screen it, threshold value by manually determining according to actual conditions, also can be determined by calculation element or draw after mass data Statistical Comparison after tightly calculating.

Detect data analysis unit 5, the value of data to be tested is analyzed, exports the record number that each different value of each field is corresponding;

This element is similar to training set generation unit 2, and what difference was only that training set generation unit 2 processes is training sample, this cell processing be data to be tested.

As shown in figure 21, it is the structural drawing of the data quality checking device detection data analysis unit of repeating data of the present invention; Wherein, described detection data analysis unit 5 comprises:

Data similarity calculation module 51, calculates similarity to the value in the same field of data to be tested, and similar value similarity being met or exceeded threshold value is as identical value.

Data record statistical module 52, analyzes the data value of data to be tested, adds up the record number that each different value of each field is corresponding;

Detect data screening unit 6, according to described sample field combination, duplicate detection is carried out to the described data to be tested analyzed, filter out the record combination that all Repeating Fields meet described sample field combination;

This unit carries out duplicate detection.First the whether satisfied described sample field combination of Repeating Field of two records is detected according to the analysis result of sample combined sorting unit 4, then according to satisfy condition two record combination producing three records combination obtained, whether the Repeating Field continuing detection three record meets the sample field combination repeated.Repeat said process until can not find the record combination meeting the sample field combination stated.

Like this, compare the mode that general duplicate detection record all will carry out detecting between any two, this device is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, this device is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this device, whether identical Data Quality Analysis person can self-defining two record Rule of judgment.

In addition, this device by the selection of training sample automatically for different field adds weight, can provide certain dirigibility.

As shown in figure 22, it is the structural drawing of the data quality checking device detection data screening unit of repeating data of the present invention; Wherein, described detection data screening unit 6 comprises:

Field number confirms module 61, determines the minimum value N of each combined field number in described sample field combination;

Double recording combine detection module 62, searches for the record combination having at least N number of field identical in two records of described data to be tested, detects and is retained in the described record combination in described sample field combination;

Module 63 is searched in the combination of many records, in the described record combination retained, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;

In this module, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, search n bar and be recorded to the identical described record combination of rare N number of field, the condition that wherein must meet is:

Many records combine detection module 64, detect and be retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the n bar record of reservation in the described record combination of n-1 bar record combines; Return the combination of many records and search module 63.

Confirm module 61-f4 by field number, all possible field combination can being added up by step by step calculation, avoiding recording possible omission.

Embodiment six

The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment six of repeating data as of the present invention in Figure 23; Described data quality checking device also comprises:

Testing result output unit 7, export the described record combination and the described probability recording combination and repeat that retain, described testing result output unit 7 is after described detection data screening unit 6.

Output in this unit can adopt multi-form, can represent with visual pattern, also output detections result can be convenient to merge record; It can export all described record combinations of reservation and the probability of described record combination repetition, also can export the described record combination of the part of reservation and the probability of described record combination repetition.

Embodiment seven

The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment seven of repeating data as of the present invention in Figure 24; Described training set generation unit 2 also comprises:

Sample Similarity computing module 21, calculates similarity to the value in the same field of training sample, and similar value similarity being met or exceeded threshold value is as identical value, and described Sample Similarity computing module 21 is before described record number statistical module 22.

May there is trickle change because of error in the data in training sample, the value of this same field just making two to record is very similar but not identical, and the interpolation of this unit can eliminate this kind of error, improves the accuracy judged repeating data.

Embodiment eight

The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment eight of repeating data as of the present invention in Figure 25; Described data quality checking device also comprises:

Training sample extraction unit 1, extracts training sample from data to be tested source;

The foregoing is only preferred embodiment of the present invention, is only illustrative for the purpose of the present invention, and nonrestrictive.Those skilled in the art is understood, and can carry out many changes in the spirit and scope that the claims in the present invention limit to it, amendment, even equivalence, but all will fall within the scope of protection of the present invention.

Claims

1. a data quality checking method for repeating data, is characterized in that, comprising:

2. data quality checking method according to claim 1, is characterized in that, described data quality checking method also comprises:

3. data quality checking method according to claim 2, is characterized in that, described data quality checking method also comprises:

4. the data quality checking method according to claim 1 or 2 or 3, it is characterized in that, described step b comprises:

5. data quality checking method according to claim 4, is characterized in that, described step b3 comprises:

6. the data quality checking method according to claim 1 or 2 or 3, it is characterized in that, described steps d comprises:

7. data quality checking method according to claim 6, is characterized in that, described multiple field repeats, and the computing formula recording the probability of repetition is:

p (1, 2, ..., k) = {Σp}_{i} + {(- 1)}^{1} \underset{i 1 < i 2}{Σ} p_{i 1} p_{i 2} + ... + {(- 1)}^{k - 1} \underset{i 1 < i 2 < ... < i k}{Σ} p_{i 1} p_{i 2} ... p_{i k}

8. the data quality checking method according to claim 1 or 2 or 3, is characterized in that,

Described step f comprises: step f1, determines the minimum value N of each combined field number in described sample field combination;

9. data quality checking method according to claim 8, is characterized in that, in described step f3, described in search the condition that must meet and be:

10. a data quality checking device for the repeating data corresponding with described data quality checking method arbitrary in claim 1-9, is characterized in that, comprising:

Described sample combined sorting unit, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;