CN105488212B

CN105488212B - A kind of data quality checking method and device of repeated data

Info

Publication number: CN105488212B
Application number: CN201510925893.5A
Authority: CN
Inventors: 许飞月; 李青海; 简宋全; 侯大勇; 邹立斌
Original assignee: Guangdong Fine Point Data Polytron Technologies Inc
Current assignee: Guangdong Fine Point Data Polytron Technologies Inc
Priority date: 2015-12-11
Filing date: 2015-12-11
Publication date: 2019-06-14
Anticipated expiration: 2035-12-11
Also published as: CN105488212A

Abstract

The present invention is a kind of data quality checking method and device of repeated data, which comprises step b generates model training collection；Step c analyzes each combination pair that the model training is concentrated, and repeats or record labeled as record not repeat；Step d, calculating records duplicate probability, and filters out the biggish field combination of probability as sample field combination；Step e analyzes the value of data to be tested；Step f filters out the record combination that all Repeating Fields meet the sample field combination according to repetition detection is carried out；Described device includes that training set generation unit corresponding with each step, sample record indicated weight unit, sample combine screening unit, detection data analytical unit and detection data screening unit.In this way, by the calculating for repeat to field combination possibility, the repetition possibility recorded without more any two shortens the time, improves detection efficiency；It also can detecte the identical situation in two data parts simultaneously.

Description

A kind of data quality checking method and device of repeated data

Technical field

The present invention relates to data quality monitoring technical fields, and in particular to a kind of data quality checking method of repeated data And device.

Background technique

The fast development of information technology makes data be increasingly becoming one of most important resource of realization business event value.So And with the continuous increase of data volume, data quality problem is also following.Shortage of data, mistake, it is inconsistent the problems such as make to look forward to Industry is hindered its application, and the serious enterprise that results even in makes erroneous decision, and loss important value causes letter in turn Appoint crisis.

For these dirty datas, many data quality checkings and cleaning program come into being.And repeated data is wherein then It is to compare a kind of data quality problem for being difficult to detect.Because the Data duplication problem that nowadays enterprise faces is not merely that data are complete Full repetition further includes that part repeats.For example some social network sites possesses number with ten million user, but these users may have weight The user of the case where registering again, these repeated registrations may only have certain information small difference occur.How these are identified Duplicate user information is most important for maintaining web quality.

More representational Data duplication checks that the content that scheme some is recorded according to every calculates unique breathe out at present Then uncommon code and check code judge whether data repeat according to whether Hash codes and check code identical, feature be accuracy it is high, It is high-efficient, but be only applicable to record complete the case where repeating；Some schemes are answered detection model based on machine learning counterweight and are instructed Practice, feature is flexibility height, and not the repetition detection limit of various scenes under a kind of method, but every two data will be counted Calculation repetition possibility, low efficiency, and accuracy still have to be hoisted.

In view of the above drawbacks, creator of the present invention proposes a kind of repeated data by prolonged research and test finally Data quality checking method and device.

Summary of the invention

It is above-mentioned to overcome the purpose of the present invention is to provide a kind of data quality checking method and device of repeated data Technological deficiency, solve the problems, such as it is how accurate, be quickly detected part repeated data and complete repeated data.

To achieve the above object, the technical solution adopted by the present invention is: providing a kind of data matter of repeated data first Quantity measuring method comprising:

Step b analyzes the data value of the training sample comprising a plurality of record, generates model training collection；

Step c analyzes each combination pair that the model training is concentrated, and is combined described to right by artificial or algorithm Two recording marks answered are that record repeats or record does not repeat；Then it chooses whether to continue to train, continuation then redefines institute Training sample and return step b are stated, d is otherwise entered step；

Step d calculates one or more fields and repeats then to record duplicate probability, and filters out the biggish field groups of probability Cooperation is sample field combination；

Step e analyzes the value of data to be tested, and the corresponding record of each different value for exporting each field is compiled Number；

Step f carries out the data to be tested analyzed according to the sample field combination to repeat detection, filter out All Repeating Fields meet the record combination of the sample field combination.

Preferably, the data quality checking method further include:

Step a extracts the training sample from the data to be tested source；The step a is before the step b.

Preferably, the data quality checking method further include:

Duplicate probability is combined in step g, the record combination and the record for exporting reservation, and the step g is in institute After stating step f.

Preferably, the step b includes:

Step b2 analyzes the data value of the training sample, and each different value for counting each field is corresponding The record number；

Step b3 handles the corresponding record number of each different value of each field, generates the model Training set.

Preferably, the step b3 includes:

Step b31, the value of correspondence two records of static fields one, corresponding two of each value are recorded as one described group Conjunction pair adds field repeating label by the combination to record and in field one；

Step b32, the correspondence of static fields one three or three or more the values recorded, each value is corresponding to record group two-by-two It is combined into the combination pair, which to record and is added into the field repeating label in field one；

Step b33, the value of two or more record of the correspondence of static fields two, each value is corresponding to record group two-by-two Be combined into the combination pair, if the combination to the combination that has recorded to identical, in described group recorded The field two of conjunction pair adds the field repeating label；If the combination to the combination that has recorded to difference, by the group It closes to record and adds the field repeating label in field two；

Step b34 is handled other fields according to step b33, and all combinations of formation are to the composition mould Type training set.

Preferably, the step d includes:

Step d1, using a certain field mark field it is duplicate it is described combination to number as divisor, with the field mark field Duplicate combination centering mark simultaneously it is described record duplicate number be dividend, with quotient be the field repeat then record it is duplicate Probability calculates the field and repeats then to record duplicate probability；

Step d2, according to the field repeat then to record the multiple fields of duplicate probability calculation repeat then to record it is duplicate general Rate；

Step d3, is arranged threshold value, and filter record recurrence probability is more than or equal to the field combination of the threshold value as sample field Combination.

Preferably, the multiple field repeats the calculation formula for then recording duplicate probability are as follows:

In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k；p₁、p₂、p_i、p_j、p_m、p_k Respectively field 1,2, i, j, m, k repeat then to record duplicate probability.

Preferably, the step f includes:

Step f1 determines the minimum value N that field number is respectively combined in the sample field combination；

Step f2 searches for the identical record of at least N number of field in two of data to be tested records and combines, It detects and is retained in the combination of the record in the sample field combination；

Step f3, it is identical described according at least N number of field of known n-1 item record in the record combination of reservation Record combination searches n item and records the identical record combination of at least N number of field；It searches less than then terminating；

Step f4 is detected and is retained in the record combination that n item records in the sample field combination, while in n-1 The all of record combination that the n item record of reservation is deleted in the record combination of item record have n-1 item to record Subset；Return step f3.

Preferably, in the step f3, it is described to search the condition that must satisfy are as follows:

The record combination of the n item record is combined by the record that the n-1 item records to be combined into two-by-two, should It is identical for having n-2 item record in the record combination of two n-1 item records；

Each subset for having n-1 item to record of the record combination of the n item record is in institute made of Combination nova N-1 item is stated to record in the identical record combination of at least N number of field.

Next provides a kind of data quality checking device of repeated data corresponding with the described data quality checking method, Comprising:

Training set generation unit analyzes the data value of the training sample comprising a plurality of record, generates model training Collection；

Sample record indicated weight unit analyzes each combination pair that the model training is concentrated, and will by artificial or algorithm The combination is that record repeats or record does not repeat to corresponding two recording marks；Then it chooses whether to continue to train, continue It then redefines the training sample and returns to the training set generation unit, otherwise enter sample and combine screening unit；

Sample combines screening unit, calculates one or more fields and repeats then to record duplicate probability, and filters out probability Biggish field combination is as sample field combination；

Detection data analytical unit analyzes the value of data to be tested, exports each different value pair of each field The record number answered；

Detection data screening unit repeats the data to be tested analyzed according to the sample field combination Detection filters out the record combination that all Repeating Fields meet the sample field combination.

Compared with the prior art the beneficial effects of the present invention are: provide a kind of data quality checking side of repeated data Method and device, in this way, recording the mode that will be detected between any two compared to detection method is typically repeated, this is by field Combination repeat the calculating of possibility, and the detection between record is changed into the inspection that corresponding field combines interior identical recordings combination It surveys, the repetition possibility recorded without more any two shortens the time, improves detection efficiency；Meanwhile this method is unlimited In detecting the identical situation of two datas, the identical situation in two data parts also can detecte, by duplicate to its The calculating of probability determines if to repeat according to threshold value；In this method, Data Quality Analysis person can two records of self-defining be No identical Rule of judgment；This method can by training sample select automatically for different field add weight, provide one Fixed flexibility；Multi-field can be calculated rapidly by formula to repeat then to record duplicate probability, improved judgement speed, saved The time has been saved, data quality checking efficiency has been improved, and formula is simple, saves system resource；It, can after generating model training collection The analysis to the same field of record will be converted to the analysis of record, subsequent processing speed is improved；Error can be eliminated, is mentioned The accuracy that height judges repeated data；Training sample is extracted from data to be tested source, due to training sample and data to be tested It is homologous, the accuracy of the judgement to repeated data can be improved.

Detailed description of the invention

Fig. 1 is the flow chart of the data quality checking method of repeated data of the present invention；

Fig. 2 is the flow chart of step b in the data quality checking method of repeated data of the present invention；

Fig. 3 is the flow chart of step b3 in the data quality checking method of repeated data of the present invention；

Fig. 4 is the flow chart of step d in the data quality checking method of repeated data of the present invention；

Fig. 5 is the data quality checking method probability calculation schematic diagram one of repeated data of the present invention；

Fig. 6 is the data quality checking method probability calculation schematic diagram two of repeated data of the present invention；

Fig. 7 is the flow chart of step e in the data quality checking method of repeated data of the present invention；

Fig. 8 is the flow chart of step f in the data quality checking method of repeated data of the present invention；

Fig. 9 is the flow chart of the data quality checking embodiment of the method one of repeated data of the present invention；

Figure 10 is the flow chart of the data quality checking embodiment of the method two of repeated data of the present invention；

Figure 11 is the flow chart of the data quality checking embodiment of the method three of repeated data of the present invention；

Figure 12 is the data quality checking method instance section data to be tested table of repeated data of the present invention；

Figure 13 is the data quality checking method instance section different value corresponding record number table of repeated data of the present invention；

Figure 14 is that the data quality checking method instance section of repeated data of the present invention is combined to field repeating label table；

Figure 15 is the data quality checking method instance section combination of repeated data of the present invention to record repeating label table；

Figure 16 is that the data quality checking method example reserved field of repeated data of the present invention combines；

Figure 17 is the structure chart of the data quality checking device of repeated data of the present invention；

Figure 18 is the structure chart of the data quality checking device training set generation unit of repeated data of the present invention；

Figure 19 is the structure chart of the data quality checking device record number processing module of repeated data of the present invention；

Figure 20 is that the data quality checking device sample of repeated data of the present invention combines the structure chart of screening unit；

Figure 21 is the structure chart of the data quality checking device detection data analytical unit of repeated data of the present invention；

Figure 22 is the structure chart of the data quality checking device detection data screening unit of repeated data of the present invention；

Figure 23 is the structure chart of the data quality checking Installation practice six of repeated data of the present invention；

Figure 24 is the structure chart of the data quality checking Installation practice seven of repeated data of the present invention；

Figure 25 is the structure chart of the data quality checking Installation practice eight of repeated data of the present invention.

Specific embodiment

Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.

As shown in Figure 1, its flow chart for the data quality checking method of repeated data of the present invention；Wherein, the repetition The data quality checking method of data includes:

Step b analyzes the data value of training sample, generates model training collection；

There is a plurality of record in training sample, every record has corresponding number, is record number；Record number is in order Arrangement, it is incremented by successively；Every record is all divided into multiple fields: field 1, field 2, field 3, field 4 ..., such same field Have a value in every record, how many item records, then each field with regard to how many value (value here have it is identical, Also have different), and the number of the value of field is corresponding with the number of record；Here, first value of field 1 and first record Field 1 be it is same, value is naturally identical.

What the training sample can be write as the case may be for data analyst, it can also be mentioned from data to be tested It takes.

As shown in Fig. 2, its flow chart for step b in the data quality checking method of repeated data of the present invention；Wherein, institute Stating step b includes:

Step b2 analyzes the data value of training sample, counts the corresponding record of each different value of each field Number；

Same field has multiple values, has identical in these values, also has different；Identical value is merged, and is added The record number of adduction simultaneously, field same in this way have multiple and different values, at least one record number are labeled with behind each value；

All fields are counted according to the above method, obtain the corresponding record number of each different value of each field.

Step b3 handles the corresponding record number of each different value of each field, generates model training collection；

The model training collection is the label of any two records and its Repeating Field for having Repeating Field.Above-mentioned statistics The corresponding record number of each different value of each field, if being worth corresponding record number is two, this two are recorded as one Combination pair, and field repeating label is added at the field of combination pair；If be worth corresponding record number be three or three with On, then this is worth corresponding record number combination of two is a combination pair, and field weight is added at the corresponding field of combination pair Multiple label；By identical combination to merging, the field repeating label of the combination pair of merging is the field repeating label before merging The sum of, ultimately generate model training collection.

If being worth corresponding record number is two, a combination pair is obtained；If being worth corresponding record number is three, Three record combination of two obtain three combinations pair；If being worth corresponding record number is four, four record combination of two are obtained To six combinations pair；If be worth corresponding record number be it is N number of, N number of record combination of two obtainsA combination pair.

After generating model training collection, the analysis to the same field of record can will be converted to the analysis of record, improve Subsequent processing speed.

As shown in figure 3, its flow chart for step b3 in the data quality checking method of repeated data of the present invention；Wherein, The step b3 specific steps can be with are as follows:

Step b31, the value of correspondence two records of static fields one, corresponding two of each value are recorded as a combination pair, Field repeating label is added by the combination to record and in field one；

Step b32, the correspondence of static fields one three or three or more the values recorded, each value is corresponding to record group two-by-two It is combined into a combination pair, adds field repeating label by the combination to record and in field one；

Step b33, the value of two or more record of the correspondence of static fields two, each value is corresponding to record group two-by-two It is combined into a combination pair, if the combination to identical, adds to the combination that has recorded in the field two of the combination pair recorded Add field repeating label；If the combination to the combination that has recorded to difference, by the combination to recording and added in field two Field repeating label；

Step b34 is handled other fields according to step b33, and all combinations of formation are to composition model training collection.

Step b31-b34 is only one of method for generating model training collection, and this method can quickly generate mould While type training set, avoid omitting or repeating certain combination pair.

Step c, each combination pair of the model training collection respectively, and combined described to correspondence by artificial or algorithm Two recording marks be record repeat or record do not repeat；Then it chooses whether to continue to train, continuation then redefines described Training sample and return step b, otherwise enter step d；

The combination that model training is concentrated, by output combination pair, compares this two records to two records are respectively corresponded Real data confirms whether it is identical, then repeats if they are the same labeled as record, not identical, does not repeat labeled as record.Here Judge combination whether corresponding two records are repeated, can by quality analysis person by observation two record specific data into Row judgement can also calculate the two similarity according to algorithm to determine.

Then it can determine the need for continuing trained or repetition training according to the comparative situation of output combination pair, if desired The training sample and return step b are then redefined, then determines whether two records of new all combinations pair repeat, after Continue that synthesis when analyzing is trained several times as a result, to improve the accuracy rate judged；It does not need, carries out step d.

As shown in figure 4, its flow chart for step d in the data quality checking method of repeated data of the present invention；Wherein, institute Stating step d includes:

Step d1 is combined to number as divisor so that a certain field mark field is duplicate, with field mark field repetition Combination centering simultaneously record duplicate number be dividend, with quotient be the field repeat then record duplicate probability, meter Field is calculated to repeat then to record duplicate probability；

All labeled duplicate number x of each field of combination centering and the duplicate combination of each field are calculated first To duplicate number y is marked as, calculate the corresponding y/x value of each field, explain are as follows: the field is identical record it is duplicate general Rate.

Each combination that model training is concentrated is field one, field two, field three, field to there is multiple fields Four ..., each combination repeats at least one field.Meanwhile each field has multiple combinations pair, at least at one Centering is combined to repeat.

Each combination is to corresponding two records；It is each combination to have a record repeating label or record repeat mark Note；In this way, each combination repeats at least one field, and there are a record repeating label or record not to repeat to mark simultaneously Note.

In this way, each field have it is multiple the field mark be field it is duplicate combination pair, and these combination centerings one Part is marked as record and repeats；The latter repeats a possibility that then recording repetition (probability) divided by the former, for the field.

It is x as all combination centering fields one are duplicate, this x combination is marked as recording duplicate combination in Pair number be y, then the field repeats then to record duplicate probability to be y/x.

Step d2 then records the multiple fields of duplicate probability calculation according to field repetition and repeats then to record duplicate probability；

Multiple fields repeat the calculation formula for then recording duplicate probability are as follows:

In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k, if meaning two records Middle field 1,2 ..., k are repeated, then this two records are p (1,2 ..., k) a possibility that repetition；p₁、p₂、p_i、p_j、p_m、p_k Respectively field 1,2, i, j, m, k repeat then to record duplicate probability.

The thinking of formula are as follows: to k field of the probability to be calculated, be taken out one, then there is k kind to follow the example of, every kind takes The corresponding numerical value of method is single Probability p_i；Two are taken out, then is hadA to follow the example of, it is two that every kind, which is followed the example of corresponding numerical value, The product p of a probability_ip_j；…；K are taken out, then is hadA to follow the example of, it is multiplying for k probability that every kind, which is followed the example of corresponding numerical value, Product p₁p₂…p_k；The coefficient of the sum of every kind of multiple values followed the example of is determined by the field quantity taken, is taken out odd number, is then Number is+1；It is taken out even number, then coefficient is -1；In this way by these with coefficient be added, obtain k final field Repetition then records duplicate probability.

When k is 2,

P (1,2)=p₁+p₂-p₁p₂

As shown in figure 5, it is the data quality checking method probability calculation schematic diagram one of repeated data of the present invention；Wherein, p₁p₂For circle p₁、p₂Repeat region, need to subtract, just obtain gross area p (1,2).

When k is 3, P₁P₂P₃

P (1,2,3)=p₁+p₂+p₃-p₁p₂-p₁p₃-p₂p₃+p₁p₂p₃

As shown in fig. 6, it is the data quality checking method probability calculation schematic diagram two of repeated data of the present invention；Wherein, p₁p₂For circle p₁、p₂Repeat region, p₁p₃For circle p₁、p₃Repeat region, p₂p₃For circle p₂、p₃Repeat region, need to subtract； p₁p₂p₃For circle p₁、p₂、p₃Repeat region, repeatedly subtract, need to add, just obtain gross area p (1,2,3).

The utility model has the advantages that repeating then to record duplicate probability in this way, multi-field can be calculated rapidly by formula, improve Judge speed, save the time, improves data quality checking efficiency, and formula is simple, save system resource.

The multiple fields being calculated in step d2 are repeated then to record duplicate probability, need to set a threshold value to it Screened, threshold value can by manually determines according to actual conditions, can also be determined after tightly calculating by computing device or It is obtained after mass data Statistical Comparison.

The size of threshold value and the present invention are related to the accuracy of the data quality checking of repeated data, and threshold value is bigger, this hair The accuracy of bright data quality checking is higher.

Assuming that data to be tested have n field, then wherein 1 < k≤n.After threshold value is arranged, reservation repetition possibility is greater than should The field combination of threshold value.These retained field combinations repeat to detect as sample field combination for subsequent.

By formula, the calculating to duplicate probability will be converted to the repetition of different records judgement, so as to avoid right Record repeats to judge respectively two-by-two, it is only necessary to by the way that legal combination to probability calculation is carried out, is substantially increased and sentenced Disconnected efficiency.

This step is similar to step b, the difference is that only that step b processing is training sample, this step process is Data to be tested.

There is a plurality of record in data to be tested, every record has corresponding number, is record number；Record number is by suitable Sequence arrangement, it is incremented by successively；Every record is all divided into multiple fields: field 1, field 2, field 3, field 4 ..., such same word Section has a value in every record, how many item record, then with regard to how many value, (value here has identical each field , also have different), and the number of the value of field is corresponding with the number of record；Here, first of field 1 is worth and first The field 1 of record be it is same, value is naturally identical.

As shown in fig. 7, its flow chart for step e in the data quality checking method of repeated data of the present invention；Wherein, institute Stating step e includes:

Step e1 calculates similarity to the value in the same field of data to be tested, and similarity is met or exceeded threshold The similar value of value is as identical value；

Similarity is calculated using certain algorithm to certain very similar values in each field herein, and by the quality of data Analyst define threshold value determine similarity reach which kind of level when using these values as identical value handle.

The algorithm for calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can To be selected according to actual needs.

Step e2 analyzes the data value of data to be tested, counts the corresponding note of each different value of each field Record number；

Step f carries out the data to be tested analyzed according to the sample field combination to repeat detection, filter out All Repeating Fields meet the record combination of the sample field combination；

This step carries out repeating detection.First according to the analysis result of step d detect two record Repeating Fields whether Meet the sample field combination, then combined according to three records of obtained two record combination producings for meeting condition, Whether the Repeating Field for continuing to test three records meets the sample field combination repeated.It repeats the above process full until can not find The record for the sample field combination stated enough combines.

In this way, record the mode that will be detected between any two compared to detection method is typically repeated, this method by pair Field combination repeat the calculating of possibility, and the detection between record is changed into corresponding field and combines interior identical recordings combination Detection, without more any two record repetition possibility, shorten the time, improve detection efficiency；Meanwhile this method It is not limited to the identical situation of two datas of detection, also can detecte the identical situation in two data parts, by heavy to its The calculating of multiple probability determines if to repeat according to threshold value；In this method, Data Quality Analysis person can self-defining two notes Record whether identical Rule of judgment.

In addition, this method can select to provide certain for different field addition weight automatically by training sample Flexibility.

As shown in figure 8, its flow chart for step f in the data quality checking method of repeated data of the present invention；Wherein, institute Stating step f includes:

Under normal circumstances, the number for having the record of Repeating Field combined can be reduced with the increase of Repeating Field number, Therefore it needs to be determined that respectively combining the minimum value N of field number in the sample field combination, so there is no need to search again for repeating Record of the field less than N combines, and reduces the combined number of the record for needing to search for, improves search efficiency.

For example, then only needing to search at least 4 fields if at least thering are 4 fields to repeat in sample field combination Duplicate record combination, which improves search efficiencies.

Step f2 searches for the identical record combination of at least N number of field in two of data to be tested records, detects And it is retained in the combination of the record in the sample field combination；

The minimum value N that field number is respectively combined in the sample field combination is combined in the record of the data to be tested In, if recording combined same field number is less than N, the unification of this record group is fixed not to search in sample field combination, therefore only The identical record combination of at least N number of field of rope, it is possible to reduce search time improves search efficiency.

In this step, the identical record of at least N number of field is recorded according to known n-1 item and is combined, n item record is searched The identical record combination of at least N number of field, wherein the condition that must satisfy are as follows:

1) combination of n item record is combined into two-by-two by the record combination of n-1 item, has n-2 item in the record combination of the two n-1 items Record is identical；

2) each subset for having n-1 item to record of the record of n item made of Combination nova combination records at least in n-1 item In the identical record combination of N number of field.

In these conditions, condition 1) it is to say that the record combination of n item is must to include n-1 item record and n-2 item record by two Identical combine is combined into, such as 4 record combinations must be that there are two the combinations of 3 records to be combined into, and this two It is a 3 record in have 2 record be it is identical, just have 4 records in the combination being combined into this way.

Condition 2) it is to say that the n item being combined into record combination hasA subset for thering is n-1 item to record, each subset It can be found in the record combination that n-1 item before records, that is to say, that only exist in the record combination of n-1 item record The record combination of n itemA subset for having n-1 item to record just thinks that the record combination of n item can be formed, such as 4 record The subset ((1,2,3), (1,2,4), (1,3,4), (2,3,4)) that combination (1,2,3,4) has 43 records combined, this 43 Record combination can be found in the combination that 3 before record.

It is all the identical record combination of at least N number of field, 4 be combined by it in the record combination of 3 records It is identical that the record combination of record is likely at least N number of field；If its of the record combination (such as 1,2,3,4) of 4 records Middle a subset (such as 1,2,3) only has N-1 field identical (i.e. not in the identical record group of 3 at least N number of fields of record In conjunction), then to be impossible to N number of field identical for the records combination (1,2,3,4) of this 4 records, at most also just there is N-1 word Duan Xiangtong, and this N-1 field is exactly N-1 same field of subset (1,2,3) certainly.Therefore condition 2) it is that must set up 's.

Only have ready conditions 1) and condition 2) set up simultaneously in the case where, be only and require to look up at least N number of field phase of n item record Same record combination.

Step f4 is detected and is retained in the record combination that n item records in the sample field combination, while in n-1 All subsets for having n-1 item to record of the record combination of the n item record of reservation are deleted in the record combination of item record； Return step f3.

The record combination of the n item record found in detecting step f3, if the field combination of its same field is in institute State in sample field combination, then it represents that this n item record it is identical, retain this record combination；If not in the sample field combination, It then indicates that this n item record is not identical, deletes this record combination.

In addition, the record combination of n item record is identical, with itThe n- of each subset in a subset for thering is n-1 item to record 1 record is all identical, represented by meaning it is all the same, are as follows: this n item record repeat.Expression identical meanings only need to protect One is stayed, that is to sayA record group for having n-1 item to record, which amounts to, has been combined into together the record for having n item to record Combination；Therefore it in the case where retaining the record combination of n item record, needs to delete its correspondingIt is a to have n-1 item note The record of record combines.

For example, same field then retains combination in the sample field combination in the combination (1,2,3,4) of 4 records (1,2,3,4), and delete 4 combinations (1,2,3) of its corresponding 3 record, (1,2,4), (1,3,4), (2,3,4).

By step f1-f4, all possible field combination can be counted by step by step calculation, avoided possible to recording It omits.

Embodiment one

The data quality checking method of repeated data as described above, the present embodiment is different from place and is, such as Fig. 9 Shown in the flow chart of the data quality checking embodiment of the method one of repeated data of the present invention；The data quality checking method is also Include:

Output in this step can use different form, can be showed with visual pattern, can also export detection knot Fruit records convenient for merging；Its all record combination that can export reservation and the record combine duplicate probability, Duplicate probability is combined in the record combination and the record that the part of reservation can also be exported.

Embodiment two

The data quality checking method of repeated data as described above, the present embodiment are different from place and are, such as scheme Shown in the flow chart of the data quality checking embodiment of the method two of 10 repeated datas of the present invention；The step b further include:

Step b1 calculates similarity to the value in the same field of training sample, and similarity is met or exceeded threshold value Similar value as identical value, the step b1 is before the step b2.

Subtle variation may occur because of error for the data in training sample, this allows for the same of two records The value of field is much like but not identical, and the addition of this step can eliminate such error, and it is accurate that raising judges repeated data Property.

Embodiment three

The data quality checking method of repeated data as described above, the present embodiment are different from place and are, such as scheme Shown in the flow chart of the data quality checking embodiment of the method three of 11 repeated datas of the present invention；The data quality checking method Further include:

Step a extracts training sample from data to be tested source；The step a is before the step b；

Band has a plurality of record in detection data source, and every record has corresponding number, is record number；Record number is pressed Sequence arranges, incremented by successively；Every record is all divided into multiple fields: field 1, field 2, field 3, and field 4 ... is identical in this way Field has a value in every record, how many item record, then with regard to how many value, (value here has identical each field , also have different), and the number of the value of field is corresponding with the number of record；Here, first of field 1 is worth and first The field 1 of record be it is same, value is naturally identical.

The record count of extraction can determine by Data Quality Analysis person oneself, can also be determine according to actual needs.

Extracting training sample from data to be tested source can be improved counterweight since training sample and data to be tested are homologous The accuracy of the judgement of complex data.

Example IV

The data quality checking method of repeated data as described above, the present embodiment examine the quality of specific data for it Example is surveyed, specifically:

S1: the instance section data to be tested such as Figure 12.Training sample is extracted from data to be tested source, the sample includes Record number by Data Quality Analysis person's predefined, it is assumed that be 1000.

S2: analyzing the data value of training sample, exports the corresponding record number of each different value of each field, Partial results are as shown in figure 13.

S2.1: wherein certain values in field may be very close, only has individual characters inconsistent, in Col1 1aaaa and 1aaab.Some way can be taken to calculate the similarity of these values, by Data Quality Analysis person's given threshold Lai Judge whether these values are identical, it is assumed here that 1aaaa and 1aaab is judged as identical.

S2.2: analyzing the above results, possesses the value output combination pair that record number is 2 or 2 or more, part from each As a result as shown in figure 14.The process is specific as follows:

In S2.2.1:Col1,1aaaa/1aaab, 1bbbb, tri- values of 1eeee have 2 records, can form 3 combinations To (1,2), (3,5) and (6,7).Each combination is recorded to for one, and Col1 repeating label is set to 1, remaining label is set to 0。

In S2.2.2:Col2,2aaaa and 2eeee can form two combinations to (1,2) and (6,7), the two combinations pair The combination centering being previously formed, then the Col2 repeating label by the combination pair formed is set to 1, shows these combinations It is repeated in Col2.2bbbb can form three combinations to (3,4), (3,5) and (4,5), wherein (3,5) are in previous shape At combination centering, processing mode as before.(3,4) and (4,5) are newly generated combinations pair, form new record, and by Col2 Repeating label is set to 1, remaining label is set to 0.

S2.2.3:Col3~Col5 is handled in the manner described above.All combinations formed are to composition duplication model training set.

S3: to above-mentioned training set random output, while Data Quality Analysis person starts model training process.Specific training side Method are as follows: the combination pair of output certain amount and its corresponding data every time, Data Quality Analysis person is according to combination to the content of record To these combinations to being marked, that is, repeats or do not repeat.

S4: after completing the label of the combination pair to output, Data Quality Analysis person can choose whether to continue to train mould Type.If choosing is to will be returned to S1 to repeat the above process, otherwise such as choosing carries out following process.

S5: to the combination of the label of model training several times to handling.Assuming that labeled part is combined to such as Figure 15 institute Show.Wherein whether repeat to indicate that the combination is recorded to repetition finally whether is marked as, remaining field is identical as Figure 14 meaning.

S5.1: calculating all labeled duplicate number x of each field of combination centering first and each field repeats Combination to being marked as duplicate number y.If the duplicate combination of Col1 is 3 to number in Figure 15, it is marked as duplicate It is 3.The duplicate combination of Col4 is 7 to number, is marked as duplicate being 3, and so on.Calculate the corresponding y/x of each field Value is explained are as follows: a possibility that identical record of field k repeats has much.Assuming that corresponding by calculating final Col1~Col5 Y/x value be followed successively by 0.4,0.4,0.4,0.3,0.3 (because the data in figure are a part in training sample data, Accurate value can not be calculated by the data in figure, can only assumed to guarantee going on smoothly for subsequent step, but meeting in this way So that last result and correct result differ greatly).

S5.2: Data Quality Analysis person's given threshold, definition record repeat a possibility that for it is much when judge that this is recorded as It repeats to record, it is assumed that this threshold value is 0.75.Then calculating has a possibility that identical record of k field is same record to have more Greatly, and by this value and the threshold value comparison, the field combination higher than the threshold value is left, as shown in figure 16.

The above are duplication model training process, next carry out repeating detection using trained model.

S6: the field combination for receiving data to be tested and finally leaving.Then data to be tested value is analyzed, is exported The corresponding record number of each different value of each field, partial results are as shown in figure 13.Wherein certain values in field may It is very close, only there are individual characters inconsistent, such as the 1aaaa and 1aaab in Col1.Some way can be taken to calculate this The similarity being worth a bit is judged whether these values identical by Data Quality Analysis person's given threshold, it is assumed here that 1aaaa and 1aaab is judged as identical.

S7: it carries out repeating detection.Detailed process is as follows:

S7.1: since the field combination finally left is at least there are three field, the repetition finally detected is recorded It is at least identical there are three field contents.Searching for two records first, at least there are three the identical combinations of field.As a result be (1, 2), 3 }, { (3,5), 4 }, { (6,7), 5 }, { (3,4), 4 }, { (4,5), 3 }, the wherein digital representation in the outer brace of round bracket There are several fields to repeat.

S7.2: detect whether same field combination in the above record combination is finally stayed in repetition decision condition generation unit 14 Under field combination in, if do not deleted if the record combination, { (4,5), 3 } be deleted.

S7.3: at least there are three the identical combinations of field for lookup n item record in residue record combination, it is known that back n- At least there are three the identical combinations of field for 1 record.Then checking these Combination novas, at least whether there are three field is identical.

Whether S7.4: detecting same field in the above n item record combination and combine in the field combination finally left, if Record combination is not being deleted then.If, not only to retain the combination, will also back n-1 item record at least there are three The combination is deleted in the identical combination of field, and each has the subset of n-1 item record.

S7.5: when checking less than at least combination identical there are three field of n item record, detection process terminates, and otherwise returns To S7.3.

In the present embodiment, detecting step is terminated in S7.2.

S8: output test result can be showed with visual pattern, can also be recorded with output test result convenient for merging.

S8.1: can export S7 step reservation all 3 or more record at least there are three field it is identical combination and These combinations are repeated the record in the probability of record, and combination between any two may duplicate probability.

S8.2: can exporting 2 records that the S7 step that S8.1 is not exported retains, at least there are three identical group of field It closes and these combines the probability for being repeated record.

If the present embodiment will export (1,2), (3,5), (6,7), (this is the result is that logical for the record content of (3,4) combination pair Cross the hypothesis of intermediate data to realize, therefore this result and practical right result differ greatly) and these record weights A possibility that multiple.

Embodiment five

The data quality checking method of repeated data as described above, the present embodiment are corresponding repeated data Data quality checking device.

It as shown in figure 17, is the structure chart of the data quality checking device of repeated data of the present invention；Wherein, the repetition The data quality checking device of data includes:

Training set generation unit 2 analyzes the data value of training sample, generates model training collection；

It as shown in figure 18, is the structure of the data quality checking device training set generation unit of repeated data of the present invention Figure；Wherein, the training set generation unit 2 includes:

Record number statistical module 22 analyzes the data value of training sample, counts each difference of each field It is worth corresponding record number；

Record number processing module 23 handles the corresponding record number of each different value of each field, generates Model training collection

Model training collection is generated, the analysis to the same field of record can will be converted to the analysis of record, after raising Continuous processing speed.

It as shown in figure 19, is the structure of the data quality checking device record number processing module of repeated data of the present invention Figure；Wherein, the record number processing module 23 includes:

One diadic indicated weight submodule 231 of field, the value of correspondence two records of static fields one, each value are two corresponding It is recorded as a combination pair, adds field repeating label by the combination to record and in field one；

One multivalue indicated weight submodule 232 of field, the correspondence of static fields one three or three or more the values recorded, each value Corresponding record combination of two is a combination pair, adds field repeating label by the combination to record and in field one；

Two indicated weight submodule 233 of field, the value of two or more record of the correspondence of static fields two, each value correspond to Record combination of two be a combination pair, if the combination to the combination that has recorded to identical, in the combination recorded Pair field two add field repeating label；If the combination to the combination that has recorded to difference, simultaneously to record by the combination Field repeating label is added in field two；

Multi-field indicated weight submodule 234 is handled other fields according to two indicated weight submodule 233 of field, formation All combinations are to composition model training collection.

One diadic indicated weight submodule 231-b34 of field is only one of device for generating model training collection, this device It can avoid omitting or repeating certain combination pair while quickly generating model training collection.

Sample record indicated weight unit 3 analyzes each combination pair of the model training collection, and passes through artificial or algorithm for institute Combination is stated corresponding two recording marks are repeated or recorded for record not repeat；Then it chooses whether to continue to train, continue then It redefines the training sample and returns to training set generation unit 2, otherwise enter sample and combine screening unit 4.

Then it can determine the need for continuing trained or repetition training according to the comparative situation of output combination pair, if desired It then redefines the training sample and returns to training set generation unit 2, then determine two records of new all combinations pair Whether repeat, comprehensive training several times as a result, to improve the accuracy rate of judgement when subsequent analysis；It does not need, carries out sample combination Screening unit 4.

Sample combines screening unit 4, calculates one or more fields and repeats then to record duplicate probability, and filters out probability Biggish field combination is as sample field combination；

As shown in figure 20, the structure of screening unit is combined for the data quality checking device sample of repeated data of the present invention Figure；Wherein, the sample combination screening unit 4 includes:

Individual character section computes repeatedly module 41, combines to number so that a certain field mark field is duplicate as divisor, with the word It is dividend that the duplicate combination centering of segment flag field records duplicate number simultaneously, is that the field repeats then to record with quotient Duplicate probability, calculated field repeat then to record duplicate probability；

Multi-field computes repeatedly module 42, then records the multiple fields of duplicate probability calculation according to field repetition and repeats then to remember Record duplicate probability；

The thinking of formula are as follows: to k field of the probability to be calculated, be taken out one, then there is k kind to follow the example of, every kind takes The corresponding numerical value of method is single Probability p_i；Two are taken out, then is hadA to follow the example of, it is two that every kind, which is followed the example of corresponding numerical value, The product p of probability_ip_j；…；K are taken out, then is hadA to follow the example of, every kind is followed the example of the product that corresponding numerical value is k probability p₁p₂…p_k；The coefficient of the sum of every kind of multiple values followed the example of is determined by the field quantity taken, is taken out odd number, then coefficient It is+1；It is taken out even number, then coefficient is -1；In this way by these with coefficient be added, obtain k final field weight It is multiple then record duplicate probability.

Threshold value screens composite module 43, and threshold value is arranged, and filter record recurrence probability is more than or equal to the field combination of the threshold value As sample field combination.

The multiple fields being calculated in module 42 are computed repeatedly to multi-field to repeat then to record duplicate probability, need to set A fixed threshold value screens it, and threshold value determines according to actual conditions, can also be passed through tight by manually by computing device It determines after close calculating or is obtained after mass data Statistical Comparison.

Detection data analytical unit 5 analyzes the value of data to be tested, exports each different value pair of each field The record number answered；

This element is similar to training set generation unit 2, the difference is that only that the processing of training set generation unit 2 is training Sample, this cell processing are data to be tested.

It as shown in figure 21, is the structure of the data quality checking device detection data analytical unit of repeated data of the present invention Figure；Wherein, the detection data analytical unit 5 includes:

Data similarity calculation module 51 calculates similarity to the value in the same field of data to be tested, and will be similar Degree meets or exceeds the similar value of threshold value as identical value.

Data record statistical module 52 analyzes the data value of data to be tested, counts each of each field no The corresponding record number with value；

Detection data screening unit 6 carries out weight to the data to be tested analyzed according to the sample field combination Reinspection is surveyed, and the record combination that all Repeating Fields meet the sample field combination is filtered out；

This unit carries out repeating detection.Two records are detected according to the analysis result that sample combines screening unit 4 first Whether Repeating Field meets the sample field combination, then according to obtained two record combination producings three for meeting condition Whether item record combination, the Repeating Field for continuing to test three records meet the sample field combination repeated.It repeats the above process Meet the record combination for the sample field combination stated until can not find.

In this way, recording the mode that will be detected between any two compared to detection is typically repeated, the present apparatus passes through to field Combination repeat the calculating of possibility, and the detection between record is changed into the inspection that corresponding field combines interior identical recordings combination It surveys, the repetition possibility recorded without more any two shortens the time, improves detection efficiency；Meanwhile the device is unlimited In detecting the identical situation of two datas, the identical situation in two data parts also can detecte, by duplicate to its The calculating of probability determines if to repeat according to threshold value；In the present apparatus, Data Quality Analysis person can two records of self-defining be No identical Rule of judgment.

In addition, the present apparatus can select to provide certain for different field addition weight automatically by training sample Flexibility.

It as shown in figure 22, is the structure of the data quality checking device detection data screening unit of repeated data of the present invention Figure；Wherein, the detection data screening unit 6 includes:

Field number confirmation module 61 determines the minimum value N that field number is respectively combined in the sample field combination；

It is identical to search at least N number of field in two records of the data to be tested for double recording combine detection module 62 Record combination, detect and be retained in the record in the sample field combination combination；

Mostly record combination searching modules 63, it is at least N number of according to known n-1 item record in the record combination of reservation The identical record combination of field searches n item and records the identical record combination of at least N number of field；It searches less than then Terminate；

In this module, the identical record of at least N number of field is recorded according to known n-1 item and is combined, n item record is searched The identical record combination of at least N number of field, wherein the condition that must satisfy are as follows:

More record combination detection modules 64, detect and are retained in the record that n item records in the sample field combination Combination, while the record combination for deleting in the record combination of n-1 item record the n item record of reservation all has n- The subset of 1 record；Return to more record combination searching modules 63.

By field number confirmation module 61-f4, all possible field combination can be counted by step by step calculation, is avoided To recording possible omission.

Embodiment six

The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme Shown in the structure chart of the data quality checking Installation practice six of 23 repeated datas of the present invention；The data quality checking device Further include:

Duplicate probability, institute are combined in testing result output unit 7, the record combination and the record for exporting reservation Testing result output unit 7 is stated after the detection data screening unit 6.

Output in this unit can use different form, can be showed with visual pattern, can also export detection knot Fruit records convenient for merging；Its all record combination that can export reservation and the record combine duplicate probability, Duplicate probability is combined in the record combination and the record that the part of reservation can also be exported.

Embodiment seven

The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme Shown in the structure chart of the data quality checking Installation practice seven of 24 repeated datas of the present invention；The training set generation unit 2 Further include:

Sample Similarity computing module 21 calculates similarity to the value in the same field of training sample, and by similarity The similar value of threshold value is met or exceeded as identical value, the Sample Similarity computing module 21 counts mould in the record number Before block 22.

Subtle variation may occur because of error for the data in training sample, this allows for the same of two records The value of field is much like but not identical, and the addition of this unit can eliminate such error, and it is accurate that raising judges repeated data Property.

Embodiment eight

The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme Shown in the structure chart of the data quality checking Installation practice eight of 25 repeated datas of the present invention；The data quality checking device Further include:

Training sample extraction unit 1 extracts training sample from data to be tested source；

The foregoing is merely presently preferred embodiments of the present invention, is merely illustrative for the purpose of the present invention, and not restrictive 's.Those skilled in the art understand that in the spirit and scope defined by the claims in the present invention many changes can be carried out to it, It modifies or even equivalent, but falls in protection scope of the present invention.

Claims

1. a kind of data quality checking method of repeated data characterized by comprising

Step b analyzes the data value of the training sample comprising a plurality of record, generates model training collection, the model instruction Practicing collection has the label of record and its Repeating Field of Repeating Field, corresponding two notes of the same value of each field for any two Record number is a combination pair；

Step d calculates one or more fields and repeats then to record duplicate probability, and filters out the biggish field groups cooperation of probability For sample set of fields；

Step e analyzes the value of data to be tested, exports the corresponding record number of each different value of each field；

Step f carries out the data to be tested analyzed according to the sample set of fields to repeat detection, filter out all Repeating Field meets the record combination of the sample set of fields.

2. data quality checking method according to claim 1, which is characterized in that the data quality checking method is also wrapped It includes:

Step a extracts the training sample from data to be tested source；The step a is before the step b.

3. data quality checking method according to claim 2, which is characterized in that the data quality checking method is also wrapped It includes:

Step g, exports the record combination filtered out and the record combines duplicate probability, and the step g is described After step f.

4. data quality checking method according to claim 1 or 2 or 3, which is characterized in that the step b includes:

Step b2 analyzes the data value of the training sample, and each different value for counting each field is corresponding described Record number；

Step b3 handles the corresponding record number of each different value of each field, generates the model training Collection.

5. data quality checking method according to claim 4, which is characterized in that the step b3 includes:

Step b31, the value of correspondence two records of static fields one, corresponding two of each value are recorded as the combination It is right, field repeating label is added by the combination to record and in field one；

Step b32, the correspondence of static fields one three or three or more the values recorded, the corresponding record combination of two of each value are One combination pair, to record and adds the field repeating label in field one for the combination；

Step b33, the value of two or more record of the correspondences of static fields two, the corresponding record combination of two of each value are One combination pair, if the combination to the combination that has recorded to identical, in the combination pair recorded Field two add the field repeating label；If the combination to the combination that has recorded to difference, by the combination pair It records and adds the field repeating label in field two；

Step b34 is handled other fields according to step b33, and all combinations of formation are instructed to the model is constituted Practice collection.

6. data quality checking method according to claim 1 or 2 or 3, which is characterized in that the step d includes:

Step d1, using a certain field mark field it is duplicate it is described combination to number as divisor, with the field mark field repetition Combination centering mark simultaneously it is described to record duplicate number be dividend, it is duplicate general to be that the field repeats then to record with quotient Rate calculates the field and repeats then to record duplicate probability；

Step d3, is arranged threshold value, and filter record recurrence probability is more than or equal to the field combination of the threshold value as sample set of fields.

7. data quality checking method according to claim 6, which is characterized in that the multiple field repeats then to record weight The calculation formula of multiple probability are as follows:

In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k；p₁、p₂、p_i、p_j、p_m、p_kRespectively It repeats then to record duplicate probability for field 1,2, i, j, m, k.

8. data quality checking method according to claim 1 or 2 or 3, which is characterized in that

The step f includes: step f1, determines the minimum value N that field number is respectively combined in the sample set of fields；

Step f2 searches for the identical record combination of at least N number of field in two of data to be tested records, detects And it is stored in the combination of the record in the sample set of fields；

Step f3 records the identical record of at least N number of field according to known n-1 item in the record combination of preservation Combination searches n item and records the identical record combination of at least N number of field；It searches less than then terminating；

Step f4 detects and is stored in the record combination that n item records in the sample set of fields, while remembering in n-1 item All subsets for having n-1 item to record of the record combination of the n item record of preservation are deleted in the record combination of record； Return step f3.

9. data quality checking method according to claim 8, which is characterized in that in the step f3, the lookup must The condition that must meet are as follows:

The record combination of the n item record is combined by the record that the n-1 item records to be combined into two-by-two, and this two It is identical for having n-2 item record in the record combination of the n-1 item record；

Each subset for having n-1 item to record of the record combination of the n item record is in the n-1 made of Combination nova Item records in the identical record combination of at least N number of field.

10. a kind of quality of data of repeated data corresponding with the data quality checking method any in claim 1-9 Detection device characterized by comprising

Training set generation unit analyzes the data value of the training sample comprising a plurality of record, generates model training collection, institute The label that model training collection is any two records and its Repeating Field for having Repeating Field is stated, the same value of each field is corresponding Two record numbers be one combination pair；

Sample record indicated weight unit analyzes each combination pair that the model training is concentrated, and will be described by artificial or algorithm Combination is that record repeats or record does not repeat to corresponding two recording marks；Then it chooses whether to continue to train, continuation then weighs It newly determines the training sample and returns to the training set generation unit, otherwise enter sample and combine screening unit；

The sample combines screening unit, calculates one or more fields and repeats then to record duplicate probability, and filters out probability Biggish field combination is as sample set of fields；

Detection data analytical unit analyzes the value of data to be tested, and each different value for exporting each field is corresponding Record number；

Detection data screening unit carries out the data to be tested analyzed according to the sample set of fields to repeat inspection It surveys, filters out the record combination that all Repeating Fields meet the sample set of fields.