A kind of data quality checking method and device of repeated data
Technical field
The present invention relates to data quality monitoring technical fields, and in particular to a kind of data quality checking method of repeated data
And device.
Background technique
The fast development of information technology makes data be increasingly becoming one of most important resource of realization business event value.So
And with the continuous increase of data volume, data quality problem is also following.Shortage of data, mistake, it is inconsistent the problems such as make to look forward to
Industry is hindered its application, and the serious enterprise that results even in makes erroneous decision, and loss important value causes letter in turn
Appoint crisis.
For these dirty datas, many data quality checkings and cleaning program come into being.And repeated data is wherein then
It is to compare a kind of data quality problem for being difficult to detect.Because the Data duplication problem that nowadays enterprise faces is not merely that data are complete
Full repetition further includes that part repeats.For example some social network sites possesses number with ten million user, but these users may have weight
The user of the case where registering again, these repeated registrations may only have certain information small difference occur.How these are identified
Duplicate user information is most important for maintaining web quality.
More representational Data duplication checks that the content that scheme some is recorded according to every calculates unique breathe out at present
Then uncommon code and check code judge whether data repeat according to whether Hash codes and check code identical, feature be accuracy it is high,
It is high-efficient, but be only applicable to record complete the case where repeating;Some schemes are answered detection model based on machine learning counterweight and are instructed
Practice, feature is flexibility height, and not the repetition detection limit of various scenes under a kind of method, but every two data will be counted
Calculation repetition possibility, low efficiency, and accuracy still have to be hoisted.
In view of the above drawbacks, creator of the present invention proposes a kind of repeated data by prolonged research and test finally
Data quality checking method and device.
Summary of the invention
It is above-mentioned to overcome the purpose of the present invention is to provide a kind of data quality checking method and device of repeated data
Technological deficiency, solve the problems, such as it is how accurate, be quickly detected part repeated data and complete repeated data.
To achieve the above object, the technical solution adopted by the present invention is: providing a kind of data matter of repeated data first
Quantity measuring method comprising:
Step b analyzes the data value of the training sample comprising a plurality of record, generates model training collection;
Step c analyzes each combination pair that the model training is concentrated, and is combined described to right by artificial or algorithm
Two recording marks answered are that record repeats or record does not repeat;Then it chooses whether to continue to train, continuation then redefines institute
Training sample and return step b are stated, d is otherwise entered step;
Step d calculates one or more fields and repeats then to record duplicate probability, and filters out the biggish field groups of probability
Cooperation is sample field combination;
Step e analyzes the value of data to be tested, and the corresponding record of each different value for exporting each field is compiled
Number;
Step f carries out the data to be tested analyzed according to the sample field combination to repeat detection, filter out
All Repeating Fields meet the record combination of the sample field combination.
Preferably, the data quality checking method further include:
Step a extracts the training sample from the data to be tested source;The step a is before the step b.
Preferably, the data quality checking method further include:
Duplicate probability is combined in step g, the record combination and the record for exporting reservation, and the step g is in institute
After stating step f.
Preferably, the step b includes:
Step b2 analyzes the data value of the training sample, and each different value for counting each field is corresponding
The record number;
Step b3 handles the corresponding record number of each different value of each field, generates the model
Training set.
Preferably, the step b3 includes:
Step b31, the value of correspondence two records of static fields one, corresponding two of each value are recorded as one described group
Conjunction pair adds field repeating label by the combination to record and in field one;
Step b32, the correspondence of static fields one three or three or more the values recorded, each value is corresponding to record group two-by-two
It is combined into the combination pair, which to record and is added into the field repeating label in field one;
Step b33, the value of two or more record of the correspondence of static fields two, each value is corresponding to record group two-by-two
Be combined into the combination pair, if the combination to the combination that has recorded to identical, in described group recorded
The field two of conjunction pair adds the field repeating label;If the combination to the combination that has recorded to difference, by the group
It closes to record and adds the field repeating label in field two;
Step b34 is handled other fields according to step b33, and all combinations of formation are to the composition mould
Type training set.
Preferably, the step d includes:
Step d1, using a certain field mark field it is duplicate it is described combination to number as divisor, with the field mark field
Duplicate combination centering mark simultaneously it is described record duplicate number be dividend, with quotient be the field repeat then record it is duplicate
Probability calculates the field and repeats then to record duplicate probability;
Step d2, according to the field repeat then to record the multiple fields of duplicate probability calculation repeat then to record it is duplicate general
Rate;
Step d3, is arranged threshold value, and filter record recurrence probability is more than or equal to the field combination of the threshold value as sample field
Combination.
Preferably, the multiple field repeats the calculation formula for then recording duplicate probability are as follows:
In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k;p1、p2、pi、pj、pm、pk
Respectively field 1,2, i, j, m, k repeat then to record duplicate probability.
Preferably, the step f includes:
Step f1 determines the minimum value N that field number is respectively combined in the sample field combination;
Step f2 searches for the identical record of at least N number of field in two of data to be tested records and combines,
It detects and is retained in the combination of the record in the sample field combination;
Step f3, it is identical described according at least N number of field of known n-1 item record in the record combination of reservation
Record combination searches n item and records the identical record combination of at least N number of field;It searches less than then terminating;
Step f4 is detected and is retained in the record combination that n item records in the sample field combination, while in n-1
The all of record combination that the n item record of reservation is deleted in the record combination of item record have n-1 item to record
Subset;Return step f3.
Preferably, in the step f3, it is described to search the condition that must satisfy are as follows:
The record combination of the n item record is combined by the record that the n-1 item records to be combined into two-by-two, should
It is identical for having n-2 item record in the record combination of two n-1 item records;
Each subset for having n-1 item to record of the record combination of the n item record is in institute made of Combination nova
N-1 item is stated to record in the identical record combination of at least N number of field.
Next provides a kind of data quality checking device of repeated data corresponding with the described data quality checking method,
Comprising:
Training set generation unit analyzes the data value of the training sample comprising a plurality of record, generates model training
Collection;
Sample record indicated weight unit analyzes each combination pair that the model training is concentrated, and will by artificial or algorithm
The combination is that record repeats or record does not repeat to corresponding two recording marks;Then it chooses whether to continue to train, continue
It then redefines the training sample and returns to the training set generation unit, otherwise enter sample and combine screening unit;
Sample combines screening unit, calculates one or more fields and repeats then to record duplicate probability, and filters out probability
Biggish field combination is as sample field combination;
Detection data analytical unit analyzes the value of data to be tested, exports each different value pair of each field
The record number answered;
Detection data screening unit repeats the data to be tested analyzed according to the sample field combination
Detection filters out the record combination that all Repeating Fields meet the sample field combination.
Compared with the prior art the beneficial effects of the present invention are: provide a kind of data quality checking side of repeated data
Method and device, in this way, recording the mode that will be detected between any two compared to detection method is typically repeated, this is by field
Combination repeat the calculating of possibility, and the detection between record is changed into the inspection that corresponding field combines interior identical recordings combination
It surveys, the repetition possibility recorded without more any two shortens the time, improves detection efficiency;Meanwhile this method is unlimited
In detecting the identical situation of two datas, the identical situation in two data parts also can detecte, by duplicate to its
The calculating of probability determines if to repeat according to threshold value;In this method, Data Quality Analysis person can two records of self-defining be
No identical Rule of judgment;This method can by training sample select automatically for different field add weight, provide one
Fixed flexibility;Multi-field can be calculated rapidly by formula to repeat then to record duplicate probability, improved judgement speed, saved
The time has been saved, data quality checking efficiency has been improved, and formula is simple, saves system resource;It, can after generating model training collection
The analysis to the same field of record will be converted to the analysis of record, subsequent processing speed is improved;Error can be eliminated, is mentioned
The accuracy that height judges repeated data;Training sample is extracted from data to be tested source, due to training sample and data to be tested
It is homologous, the accuracy of the judgement to repeated data can be improved.
Detailed description of the invention
Fig. 1 is the flow chart of the data quality checking method of repeated data of the present invention;
Fig. 2 is the flow chart of step b in the data quality checking method of repeated data of the present invention;
Fig. 3 is the flow chart of step b3 in the data quality checking method of repeated data of the present invention;
Fig. 4 is the flow chart of step d in the data quality checking method of repeated data of the present invention;
Fig. 5 is the data quality checking method probability calculation schematic diagram one of repeated data of the present invention;
Fig. 6 is the data quality checking method probability calculation schematic diagram two of repeated data of the present invention;
Fig. 7 is the flow chart of step e in the data quality checking method of repeated data of the present invention;
Fig. 8 is the flow chart of step f in the data quality checking method of repeated data of the present invention;
Fig. 9 is the flow chart of the data quality checking embodiment of the method one of repeated data of the present invention;
Figure 10 is the flow chart of the data quality checking embodiment of the method two of repeated data of the present invention;
Figure 11 is the flow chart of the data quality checking embodiment of the method three of repeated data of the present invention;
Figure 12 is the data quality checking method instance section data to be tested table of repeated data of the present invention;
Figure 13 is the data quality checking method instance section different value corresponding record number table of repeated data of the present invention;
Figure 14 is that the data quality checking method instance section of repeated data of the present invention is combined to field repeating label table;
Figure 15 is the data quality checking method instance section combination of repeated data of the present invention to record repeating label table;
Figure 16 is that the data quality checking method example reserved field of repeated data of the present invention combines;
Figure 17 is the structure chart of the data quality checking device of repeated data of the present invention;
Figure 18 is the structure chart of the data quality checking device training set generation unit of repeated data of the present invention;
Figure 19 is the structure chart of the data quality checking device record number processing module of repeated data of the present invention;
Figure 20 is that the data quality checking device sample of repeated data of the present invention combines the structure chart of screening unit;
Figure 21 is the structure chart of the data quality checking device detection data analytical unit of repeated data of the present invention;
Figure 22 is the structure chart of the data quality checking device detection data screening unit of repeated data of the present invention;
Figure 23 is the structure chart of the data quality checking Installation practice six of repeated data of the present invention;
Figure 24 is the structure chart of the data quality checking Installation practice seven of repeated data of the present invention;
Figure 25 is the structure chart of the data quality checking Installation practice eight of repeated data of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.
As shown in Figure 1, its flow chart for the data quality checking method of repeated data of the present invention;Wherein, the repetition
The data quality checking method of data includes:
Step b analyzes the data value of training sample, generates model training collection;
There is a plurality of record in training sample, every record has corresponding number, is record number;Record number is in order
Arrangement, it is incremented by successively;Every record is all divided into multiple fields: field 1, field 2, field 3, field 4 ..., such same field
Have a value in every record, how many item records, then each field with regard to how many value (value here have it is identical,
Also have different), and the number of the value of field is corresponding with the number of record;Here, first value of field 1 and first record
Field 1 be it is same, value is naturally identical.
What the training sample can be write as the case may be for data analyst, it can also be mentioned from data to be tested
It takes.
As shown in Fig. 2, its flow chart for step b in the data quality checking method of repeated data of the present invention;Wherein, institute
Stating step b includes:
Step b2 analyzes the data value of training sample, counts the corresponding record of each different value of each field
Number;
Same field has multiple values, has identical in these values, also has different;Identical value is merged, and is added
The record number of adduction simultaneously, field same in this way have multiple and different values, at least one record number are labeled with behind each value;
All fields are counted according to the above method, obtain the corresponding record number of each different value of each field.
Step b3 handles the corresponding record number of each different value of each field, generates model training collection;
The model training collection is the label of any two records and its Repeating Field for having Repeating Field.Above-mentioned statistics
The corresponding record number of each different value of each field, if being worth corresponding record number is two, this two are recorded as one
Combination pair, and field repeating label is added at the field of combination pair;If be worth corresponding record number be three or three with
On, then this is worth corresponding record number combination of two is a combination pair, and field weight is added at the corresponding field of combination pair
Multiple label;By identical combination to merging, the field repeating label of the combination pair of merging is the field repeating label before merging
The sum of, ultimately generate model training collection.
If being worth corresponding record number is two, a combination pair is obtained;If being worth corresponding record number is three,
Three record combination of two obtain three combinations pair;If being worth corresponding record number is four, four record combination of two are obtained
To six combinations pair;If be worth corresponding record number be it is N number of, N number of record combination of two obtainsA combination pair.
After generating model training collection, the analysis to the same field of record can will be converted to the analysis of record, improve
Subsequent processing speed.
As shown in figure 3, its flow chart for step b3 in the data quality checking method of repeated data of the present invention;Wherein,
The step b3 specific steps can be with are as follows:
Step b31, the value of correspondence two records of static fields one, corresponding two of each value are recorded as a combination pair,
Field repeating label is added by the combination to record and in field one;
Step b32, the correspondence of static fields one three or three or more the values recorded, each value is corresponding to record group two-by-two
It is combined into a combination pair, adds field repeating label by the combination to record and in field one;
Step b33, the value of two or more record of the correspondence of static fields two, each value is corresponding to record group two-by-two
It is combined into a combination pair, if the combination to identical, adds to the combination that has recorded in the field two of the combination pair recorded
Add field repeating label;If the combination to the combination that has recorded to difference, by the combination to recording and added in field two
Field repeating label;
Step b34 is handled other fields according to step b33, and all combinations of formation are to composition model training collection.
Step b31-b34 is only one of method for generating model training collection, and this method can quickly generate mould
While type training set, avoid omitting or repeating certain combination pair.
Step c, each combination pair of the model training collection respectively, and combined described to correspondence by artificial or algorithm
Two recording marks be record repeat or record do not repeat;Then it chooses whether to continue to train, continuation then redefines described
Training sample and return step b, otherwise enter step d;
The combination that model training is concentrated, by output combination pair, compares this two records to two records are respectively corresponded
Real data confirms whether it is identical, then repeats if they are the same labeled as record, not identical, does not repeat labeled as record.Here
Judge combination whether corresponding two records are repeated, can by quality analysis person by observation two record specific data into
Row judgement can also calculate the two similarity according to algorithm to determine.
Then it can determine the need for continuing trained or repetition training according to the comparative situation of output combination pair, if desired
The training sample and return step b are then redefined, then determines whether two records of new all combinations pair repeat, after
Continue that synthesis when analyzing is trained several times as a result, to improve the accuracy rate judged;It does not need, carries out step d.
Step d calculates one or more fields and repeats then to record duplicate probability, and filters out the biggish field groups of probability
Cooperation is sample field combination;
As shown in figure 4, its flow chart for step d in the data quality checking method of repeated data of the present invention;Wherein, institute
Stating step d includes:
Step d1 is combined to number as divisor so that a certain field mark field is duplicate, with field mark field repetition
Combination centering simultaneously record duplicate number be dividend, with quotient be the field repeat then record duplicate probability, meter
Field is calculated to repeat then to record duplicate probability;
All labeled duplicate number x of each field of combination centering and the duplicate combination of each field are calculated first
To duplicate number y is marked as, calculate the corresponding y/x value of each field, explain are as follows: the field is identical record it is duplicate general
Rate.
Each combination that model training is concentrated is field one, field two, field three, field to there is multiple fields
Four ..., each combination repeats at least one field.Meanwhile each field has multiple combinations pair, at least at one
Centering is combined to repeat.
Each combination is to corresponding two records;It is each combination to have a record repeating label or record repeat mark
Note;In this way, each combination repeats at least one field, and there are a record repeating label or record not to repeat to mark simultaneously
Note.
In this way, each field have it is multiple the field mark be field it is duplicate combination pair, and these combination centerings one
Part is marked as record and repeats;The latter repeats a possibility that then recording repetition (probability) divided by the former, for the field.
It is x as all combination centering fields one are duplicate, this x combination is marked as recording duplicate combination in
Pair number be y, then the field repeats then to record duplicate probability to be y/x.
Step d2 then records the multiple fields of duplicate probability calculation according to field repetition and repeats then to record duplicate probability;
Multiple fields repeat the calculation formula for then recording duplicate probability are as follows:
In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k, if meaning two records
Middle field 1,2 ..., k are repeated, then this two records are p (1,2 ..., k) a possibility that repetition;p1、p2、pi、pj、pm、pk
Respectively field 1,2, i, j, m, k repeat then to record duplicate probability.
The thinking of formula are as follows: to k field of the probability to be calculated, be taken out one, then there is k kind to follow the example of, every kind takes
The corresponding numerical value of method is single Probability pi;Two are taken out, then is hadA to follow the example of, it is two that every kind, which is followed the example of corresponding numerical value,
The product p of a probabilityipj;…;K are taken out, then is hadA to follow the example of, it is multiplying for k probability that every kind, which is followed the example of corresponding numerical value,
Product p1p2…pk;The coefficient of the sum of every kind of multiple values followed the example of is determined by the field quantity taken, is taken out odd number, is then
Number is+1;It is taken out even number, then coefficient is -1;In this way by these with coefficient be added, obtain k final field
Repetition then records duplicate probability.
When k is 2,
P (1,2)=p1+p2-p1p2
As shown in figure 5, it is the data quality checking method probability calculation schematic diagram one of repeated data of the present invention;Wherein,
p1p2For circle p1、p2Repeat region, need to subtract, just obtain gross area p (1,2).
When k is 3, P1P2P3
P (1,2,3)=p1+p2+p3-p1p2-p1p3-p2p3+p1p2p3
As shown in fig. 6, it is the data quality checking method probability calculation schematic diagram two of repeated data of the present invention;Wherein,
p1p2For circle p1、p2Repeat region, p1p3For circle p1、p3Repeat region, p2p3For circle p2、p3Repeat region, need to subtract;
p1p2p3For circle p1、p2、p3Repeat region, repeatedly subtract, need to add, just obtain gross area p (1,2,3).
The utility model has the advantages that repeating then to record duplicate probability in this way, multi-field can be calculated rapidly by formula, improve
Judge speed, save the time, improves data quality checking efficiency, and formula is simple, save system resource.
Step d3, is arranged threshold value, and filter record recurrence probability is more than or equal to the field combination of the threshold value as sample field
Combination.
The multiple fields being calculated in step d2 are repeated then to record duplicate probability, need to set a threshold value to it
Screened, threshold value can by manually determines according to actual conditions, can also be determined after tightly calculating by computing device or
It is obtained after mass data Statistical Comparison.
The size of threshold value and the present invention are related to the accuracy of the data quality checking of repeated data, and threshold value is bigger, this hair
The accuracy of bright data quality checking is higher.
Assuming that data to be tested have n field, then wherein 1 < k≤n.After threshold value is arranged, reservation repetition possibility is greater than should
The field combination of threshold value.These retained field combinations repeat to detect as sample field combination for subsequent.
By formula, the calculating to duplicate probability will be converted to the repetition of different records judgement, so as to avoid right
Record repeats to judge respectively two-by-two, it is only necessary to by the way that legal combination to probability calculation is carried out, is substantially increased and sentenced
Disconnected efficiency.
Step e analyzes the value of data to be tested, and the corresponding record of each different value for exporting each field is compiled
Number;
This step is similar to step b, the difference is that only that step b processing is training sample, this step process is
Data to be tested.
There is a plurality of record in data to be tested, every record has corresponding number, is record number;Record number is by suitable
Sequence arrangement, it is incremented by successively;Every record is all divided into multiple fields: field 1, field 2, field 3, field 4 ..., such same word
Section has a value in every record, how many item record, then with regard to how many value, (value here has identical each field
, also have different), and the number of the value of field is corresponding with the number of record;Here, first of field 1 is worth and first
The field 1 of record be it is same, value is naturally identical.
As shown in fig. 7, its flow chart for step e in the data quality checking method of repeated data of the present invention;Wherein, institute
Stating step e includes:
Step e1 calculates similarity to the value in the same field of data to be tested, and similarity is met or exceeded threshold
The similar value of value is as identical value;
Similarity is calculated using certain algorithm to certain very similar values in each field herein, and by the quality of data
Analyst define threshold value determine similarity reach which kind of level when using these values as identical value handle.
The algorithm for calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can
To be selected according to actual needs.
Step e2 analyzes the data value of data to be tested, counts the corresponding note of each different value of each field
Record number;
Same field has multiple values, has identical in these values, also has different;Identical value is merged, and is added
The record number of adduction simultaneously, field same in this way have multiple and different values, at least one record number are labeled with behind each value;
All fields are counted according to the above method, obtain the corresponding record number of each different value of each field.
Step f carries out the data to be tested analyzed according to the sample field combination to repeat detection, filter out
All Repeating Fields meet the record combination of the sample field combination;
This step carries out repeating detection.First according to the analysis result of step d detect two record Repeating Fields whether
Meet the sample field combination, then combined according to three records of obtained two record combination producings for meeting condition,
Whether the Repeating Field for continuing to test three records meets the sample field combination repeated.It repeats the above process full until can not find
The record for the sample field combination stated enough combines.
In this way, record the mode that will be detected between any two compared to detection method is typically repeated, this method by pair
Field combination repeat the calculating of possibility, and the detection between record is changed into corresponding field and combines interior identical recordings combination
Detection, without more any two record repetition possibility, shorten the time, improve detection efficiency;Meanwhile this method
It is not limited to the identical situation of two datas of detection, also can detecte the identical situation in two data parts, by heavy to its
The calculating of multiple probability determines if to repeat according to threshold value;In this method, Data Quality Analysis person can self-defining two notes
Record whether identical Rule of judgment.
In addition, this method can select to provide certain for different field addition weight automatically by training sample
Flexibility.
As shown in figure 8, its flow chart for step f in the data quality checking method of repeated data of the present invention;Wherein, institute
Stating step f includes:
Step f1 determines the minimum value N that field number is respectively combined in the sample field combination;
Under normal circumstances, the number for having the record of Repeating Field combined can be reduced with the increase of Repeating Field number,
Therefore it needs to be determined that respectively combining the minimum value N of field number in the sample field combination, so there is no need to search again for repeating
Record of the field less than N combines, and reduces the combined number of the record for needing to search for, improves search efficiency.
For example, then only needing to search at least 4 fields if at least thering are 4 fields to repeat in sample field combination
Duplicate record combination, which improves search efficiencies.
Step f2 searches for the identical record combination of at least N number of field in two of data to be tested records, detects
And it is retained in the combination of the record in the sample field combination;
The minimum value N that field number is respectively combined in the sample field combination is combined in the record of the data to be tested
In, if recording combined same field number is less than N, the unification of this record group is fixed not to search in sample field combination, therefore only
The identical record combination of at least N number of field of rope, it is possible to reduce search time improves search efficiency.
Step f3, it is identical described according at least N number of field of known n-1 item record in the record combination of reservation
Record combination searches n item and records the identical record combination of at least N number of field;It searches less than then terminating;
In this step, the identical record of at least N number of field is recorded according to known n-1 item and is combined, n item record is searched
The identical record combination of at least N number of field, wherein the condition that must satisfy are as follows:
1) combination of n item record is combined into two-by-two by the record combination of n-1 item, has n-2 item in the record combination of the two n-1 items
Record is identical;
2) each subset for having n-1 item to record of the record of n item made of Combination nova combination records at least in n-1 item
In the identical record combination of N number of field.
In these conditions, condition 1) it is to say that the record combination of n item is must to include n-1 item record and n-2 item record by two
Identical combine is combined into, such as 4 record combinations must be that there are two the combinations of 3 records to be combined into, and this two
It is a 3 record in have 2 record be it is identical, just have 4 records in the combination being combined into this way.
Condition 2) it is to say that the n item being combined into record combination hasA subset for thering is n-1 item to record, each subset
It can be found in the record combination that n-1 item before records, that is to say, that only exist in the record combination of n-1 item record
The record combination of n itemA subset for having n-1 item to record just thinks that the record combination of n item can be formed, such as 4 record
The subset ((1,2,3), (1,2,4), (1,3,4), (2,3,4)) that combination (1,2,3,4) has 43 records combined, this 43
Record combination can be found in the combination that 3 before record.
It is all the identical record combination of at least N number of field, 4 be combined by it in the record combination of 3 records
It is identical that the record combination of record is likely at least N number of field;If its of the record combination (such as 1,2,3,4) of 4 records
Middle a subset (such as 1,2,3) only has N-1 field identical (i.e. not in the identical record group of 3 at least N number of fields of record
In conjunction), then to be impossible to N number of field identical for the records combination (1,2,3,4) of this 4 records, at most also just there is N-1 word
Duan Xiangtong, and this N-1 field is exactly N-1 same field of subset (1,2,3) certainly.Therefore condition 2) it is that must set up
's.
Only have ready conditions 1) and condition 2) set up simultaneously in the case where, be only and require to look up at least N number of field phase of n item record
Same record combination.
Step f4 is detected and is retained in the record combination that n item records in the sample field combination, while in n-1
All subsets for having n-1 item to record of the record combination of the n item record of reservation are deleted in the record combination of item record;
Return step f3.
The record combination of the n item record found in detecting step f3, if the field combination of its same field is in institute
State in sample field combination, then it represents that this n item record it is identical, retain this record combination;If not in the sample field combination,
It then indicates that this n item record is not identical, deletes this record combination.
In addition, the record combination of n item record is identical, with itThe n- of each subset in a subset for thering is n-1 item to record
1 record is all identical, represented by meaning it is all the same, are as follows: this n item record repeat.Expression identical meanings only need to protect
One is stayed, that is to sayA record group for having n-1 item to record, which amounts to, has been combined into together the record for having n item to record
Combination;Therefore it in the case where retaining the record combination of n item record, needs to delete its correspondingIt is a to have n-1 item note
The record of record combines.
For example, same field then retains combination in the sample field combination in the combination (1,2,3,4) of 4 records
(1,2,3,4), and delete 4 combinations (1,2,3) of its corresponding 3 record, (1,2,4), (1,3,4), (2,3,4).
By step f1-f4, all possible field combination can be counted by step by step calculation, avoided possible to recording
It omits.
Embodiment one
The data quality checking method of repeated data as described above, the present embodiment is different from place and is, such as Fig. 9
Shown in the flow chart of the data quality checking embodiment of the method one of repeated data of the present invention;The data quality checking method is also
Include:
Duplicate probability is combined in step g, the record combination and the record for exporting reservation, and the step g is in institute
After stating step f.
Output in this step can use different form, can be showed with visual pattern, can also export detection knot
Fruit records convenient for merging;Its all record combination that can export reservation and the record combine duplicate probability,
Duplicate probability is combined in the record combination and the record that the part of reservation can also be exported.
Embodiment two
The data quality checking method of repeated data as described above, the present embodiment are different from place and are, such as scheme
Shown in the flow chart of the data quality checking embodiment of the method two of 10 repeated datas of the present invention;The step b further include:
Step b1 calculates similarity to the value in the same field of training sample, and similarity is met or exceeded threshold value
Similar value as identical value, the step b1 is before the step b2.
Similarity is calculated using certain algorithm to certain very similar values in each field herein, and by the quality of data
Analyst define threshold value determine similarity reach which kind of level when using these values as identical value handle.
The algorithm for calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can
To be selected according to actual needs.
Subtle variation may occur because of error for the data in training sample, this allows for the same of two records
The value of field is much like but not identical, and the addition of this step can eliminate such error, and it is accurate that raising judges repeated data
Property.
Embodiment three
The data quality checking method of repeated data as described above, the present embodiment are different from place and are, such as scheme
Shown in the flow chart of the data quality checking embodiment of the method three of 11 repeated datas of the present invention;The data quality checking method
Further include:
Step a extracts training sample from data to be tested source;The step a is before the step b;
Band has a plurality of record in detection data source, and every record has corresponding number, is record number;Record number is pressed
Sequence arranges, incremented by successively;Every record is all divided into multiple fields: field 1, field 2, field 3, and field 4 ... is identical in this way
Field has a value in every record, how many item record, then with regard to how many value, (value here has identical each field
, also have different), and the number of the value of field is corresponding with the number of record;Here, first of field 1 is worth and first
The field 1 of record be it is same, value is naturally identical.
The record count of extraction can determine by Data Quality Analysis person oneself, can also be determine according to actual needs.
Extracting training sample from data to be tested source can be improved counterweight since training sample and data to be tested are homologous
The accuracy of the judgement of complex data.
Example IV
The data quality checking method of repeated data as described above, the present embodiment examine the quality of specific data for it
Example is surveyed, specifically:
S1: the instance section data to be tested such as Figure 12.Training sample is extracted from data to be tested source, the sample includes
Record number by Data Quality Analysis person's predefined, it is assumed that be 1000.
S2: analyzing the data value of training sample, exports the corresponding record number of each different value of each field,
Partial results are as shown in figure 13.
S2.1: wherein certain values in field may be very close, only has individual characters inconsistent, in Col1
1aaaa and 1aaab.Some way can be taken to calculate the similarity of these values, by Data Quality Analysis person's given threshold Lai
Judge whether these values are identical, it is assumed here that 1aaaa and 1aaab is judged as identical.
S2.2: analyzing the above results, possesses the value output combination pair that record number is 2 or 2 or more, part from each
As a result as shown in figure 14.The process is specific as follows:
In S2.2.1:Col1,1aaaa/1aaab, 1bbbb, tri- values of 1eeee have 2 records, can form 3 combinations
To (1,2), (3,5) and (6,7).Each combination is recorded to for one, and Col1 repeating label is set to 1, remaining label is set to
0。
In S2.2.2:Col2,2aaaa and 2eeee can form two combinations to (1,2) and (6,7), the two combinations pair
The combination centering being previously formed, then the Col2 repeating label by the combination pair formed is set to 1, shows these combinations
It is repeated in Col2.2bbbb can form three combinations to (3,4), (3,5) and (4,5), wherein (3,5) are in previous shape
At combination centering, processing mode as before.(3,4) and (4,5) are newly generated combinations pair, form new record, and by Col2
Repeating label is set to 1, remaining label is set to 0.
S2.2.3:Col3~Col5 is handled in the manner described above.All combinations formed are to composition duplication model training set.
S3: to above-mentioned training set random output, while Data Quality Analysis person starts model training process.Specific training side
Method are as follows: the combination pair of output certain amount and its corresponding data every time, Data Quality Analysis person is according to combination to the content of record
To these combinations to being marked, that is, repeats or do not repeat.
S4: after completing the label of the combination pair to output, Data Quality Analysis person can choose whether to continue to train mould
Type.If choosing is to will be returned to S1 to repeat the above process, otherwise such as choosing carries out following process.
S5: to the combination of the label of model training several times to handling.Assuming that labeled part is combined to such as Figure 15 institute
Show.Wherein whether repeat to indicate that the combination is recorded to repetition finally whether is marked as, remaining field is identical as Figure 14 meaning.
S5.1: calculating all labeled duplicate number x of each field of combination centering first and each field repeats
Combination to being marked as duplicate number y.If the duplicate combination of Col1 is 3 to number in Figure 15, it is marked as duplicate
It is 3.The duplicate combination of Col4 is 7 to number, is marked as duplicate being 3, and so on.Calculate the corresponding y/x of each field
Value is explained are as follows: a possibility that identical record of field k repeats has much.Assuming that corresponding by calculating final Col1~Col5
Y/x value be followed successively by 0.4,0.4,0.4,0.3,0.3 (because the data in figure are a part in training sample data,
Accurate value can not be calculated by the data in figure, can only assumed to guarantee going on smoothly for subsequent step, but meeting in this way
So that last result and correct result differ greatly).
S5.2: Data Quality Analysis person's given threshold, definition record repeat a possibility that for it is much when judge that this is recorded as
It repeats to record, it is assumed that this threshold value is 0.75.Then calculating has a possibility that identical record of k field is same record to have more
Greatly, and by this value and the threshold value comparison, the field combination higher than the threshold value is left, as shown in figure 16.
The above are duplication model training process, next carry out repeating detection using trained model.
S6: the field combination for receiving data to be tested and finally leaving.Then data to be tested value is analyzed, is exported
The corresponding record number of each different value of each field, partial results are as shown in figure 13.Wherein certain values in field may
It is very close, only there are individual characters inconsistent, such as the 1aaaa and 1aaab in Col1.Some way can be taken to calculate this
The similarity being worth a bit is judged whether these values identical by Data Quality Analysis person's given threshold, it is assumed here that 1aaaa and
1aaab is judged as identical.
S7: it carries out repeating detection.Detailed process is as follows:
S7.1: since the field combination finally left is at least there are three field, the repetition finally detected is recorded
It is at least identical there are three field contents.Searching for two records first, at least there are three the identical combinations of field.As a result be (1,
2), 3 }, { (3,5), 4 }, { (6,7), 5 }, { (3,4), 4 }, { (4,5), 3 }, the wherein digital representation in the outer brace of round bracket
There are several fields to repeat.
S7.2: detect whether same field combination in the above record combination is finally stayed in repetition decision condition generation unit 14
Under field combination in, if do not deleted if the record combination, { (4,5), 3 } be deleted.
S7.3: at least there are three the identical combinations of field for lookup n item record in residue record combination, it is known that back n-
At least there are three the identical combinations of field for 1 record.Then checking these Combination novas, at least whether there are three field is identical.
Whether S7.4: detecting same field in the above n item record combination and combine in the field combination finally left, if
Record combination is not being deleted then.If, not only to retain the combination, will also back n-1 item record at least there are three
The combination is deleted in the identical combination of field, and each has the subset of n-1 item record.
S7.5: when checking less than at least combination identical there are three field of n item record, detection process terminates, and otherwise returns
To S7.3.
In the present embodiment, detecting step is terminated in S7.2.
S8: output test result can be showed with visual pattern, can also be recorded with output test result convenient for merging.
S8.1: can export S7 step reservation all 3 or more record at least there are three field it is identical combination and
These combinations are repeated the record in the probability of record, and combination between any two may duplicate probability.
S8.2: can exporting 2 records that the S7 step that S8.1 is not exported retains, at least there are three identical group of field
It closes and these combines the probability for being repeated record.
If the present embodiment will export (1,2), (3,5), (6,7), (this is the result is that logical for the record content of (3,4) combination pair
Cross the hypothesis of intermediate data to realize, therefore this result and practical right result differ greatly) and these record weights
A possibility that multiple.
Embodiment five
The data quality checking method of repeated data as described above, the present embodiment are corresponding repeated data
Data quality checking device.
It as shown in figure 17, is the structure chart of the data quality checking device of repeated data of the present invention;Wherein, the repetition
The data quality checking device of data includes:
Training set generation unit 2 analyzes the data value of training sample, generates model training collection;
It as shown in figure 18, is the structure of the data quality checking device training set generation unit of repeated data of the present invention
Figure;Wherein, the training set generation unit 2 includes:
Record number statistical module 22 analyzes the data value of training sample, counts each difference of each field
It is worth corresponding record number;
Record number processing module 23 handles the corresponding record number of each different value of each field, generates
Model training collection
Model training collection is generated, the analysis to the same field of record can will be converted to the analysis of record, after raising
Continuous processing speed.
It as shown in figure 19, is the structure of the data quality checking device record number processing module of repeated data of the present invention
Figure;Wherein, the record number processing module 23 includes:
One diadic indicated weight submodule 231 of field, the value of correspondence two records of static fields one, each value are two corresponding
It is recorded as a combination pair, adds field repeating label by the combination to record and in field one;
One multivalue indicated weight submodule 232 of field, the correspondence of static fields one three or three or more the values recorded, each value
Corresponding record combination of two is a combination pair, adds field repeating label by the combination to record and in field one;
Two indicated weight submodule 233 of field, the value of two or more record of the correspondence of static fields two, each value correspond to
Record combination of two be a combination pair, if the combination to the combination that has recorded to identical, in the combination recorded
Pair field two add field repeating label;If the combination to the combination that has recorded to difference, simultaneously to record by the combination
Field repeating label is added in field two;
Multi-field indicated weight submodule 234 is handled other fields according to two indicated weight submodule 233 of field, formation
All combinations are to composition model training collection.
One diadic indicated weight submodule 231-b34 of field is only one of device for generating model training collection, this device
It can avoid omitting or repeating certain combination pair while quickly generating model training collection.
Sample record indicated weight unit 3 analyzes each combination pair of the model training collection, and passes through artificial or algorithm for institute
Combination is stated corresponding two recording marks are repeated or recorded for record not repeat;Then it chooses whether to continue to train, continue then
It redefines the training sample and returns to training set generation unit 2, otherwise enter sample and combine screening unit 4.
The combination that model training is concentrated, by output combination pair, compares this two records to two records are respectively corresponded
Real data confirms whether it is identical, then repeats if they are the same labeled as record, not identical, does not repeat labeled as record.Here
Judge combination whether corresponding two records are repeated, can by quality analysis person by observation two record specific data into
Row judgement can also calculate the two similarity according to algorithm to determine.
Then it can determine the need for continuing trained or repetition training according to the comparative situation of output combination pair, if desired
It then redefines the training sample and returns to training set generation unit 2, then determine two records of new all combinations pair
Whether repeat, comprehensive training several times as a result, to improve the accuracy rate of judgement when subsequent analysis;It does not need, carries out sample combination
Screening unit 4.
Sample combines screening unit 4, calculates one or more fields and repeats then to record duplicate probability, and filters out probability
Biggish field combination is as sample field combination;
As shown in figure 20, the structure of screening unit is combined for the data quality checking device sample of repeated data of the present invention
Figure;Wherein, the sample combination screening unit 4 includes:
Individual character section computes repeatedly module 41, combines to number so that a certain field mark field is duplicate as divisor, with the word
It is dividend that the duplicate combination centering of segment flag field records duplicate number simultaneously, is that the field repeats then to record with quotient
Duplicate probability, calculated field repeat then to record duplicate probability;
All labeled duplicate number x of each field of combination centering and the duplicate combination of each field are calculated first
To duplicate number y is marked as, calculate the corresponding y/x value of each field, explain are as follows: the field is identical record it is duplicate general
Rate.
Multi-field computes repeatedly module 42, then records the multiple fields of duplicate probability calculation according to field repetition and repeats then to remember
Record duplicate probability;
Multiple fields repeat the calculation formula for then recording duplicate probability are as follows:
In formula, p (1,2 ..., k) repeats to record duplicate probability then for field 1,2 ..., k, if meaning two records
Middle field 1,2 ..., k are repeated, then this two records are p (1,2 ..., k) a possibility that repetition;p1、p2、pi、pj、pm、pk
Respectively field 1,2, i, j, m, k repeat then to record duplicate probability.
The thinking of formula are as follows: to k field of the probability to be calculated, be taken out one, then there is k kind to follow the example of, every kind takes
The corresponding numerical value of method is single Probability pi;Two are taken out, then is hadA to follow the example of, it is two that every kind, which is followed the example of corresponding numerical value,
The product p of probabilityipj;…;K are taken out, then is hadA to follow the example of, every kind is followed the example of the product that corresponding numerical value is k probability
p1p2…pk;The coefficient of the sum of every kind of multiple values followed the example of is determined by the field quantity taken, is taken out odd number, then coefficient
It is+1;It is taken out even number, then coefficient is -1;In this way by these with coefficient be added, obtain k final field weight
It is multiple then record duplicate probability.
The utility model has the advantages that repeating then to record duplicate probability in this way, multi-field can be calculated rapidly by formula, improve
Judge speed, save the time, improves data quality checking efficiency, and formula is simple, save system resource.
Threshold value screens composite module 43, and threshold value is arranged, and filter record recurrence probability is more than or equal to the field combination of the threshold value
As sample field combination.
The multiple fields being calculated in module 42 are computed repeatedly to multi-field to repeat then to record duplicate probability, need to set
A fixed threshold value screens it, and threshold value determines according to actual conditions, can also be passed through tight by manually by computing device
It determines after close calculating or is obtained after mass data Statistical Comparison.
The size of threshold value and the present invention are related to the accuracy of the data quality checking of repeated data, and threshold value is bigger, this hair
The accuracy of bright data quality checking is higher.
By formula, the calculating to duplicate probability will be converted to the repetition of different records judgement, so as to avoid right
Record repeats to judge respectively two-by-two, it is only necessary to by the way that legal combination to probability calculation is carried out, is substantially increased and sentenced
Disconnected efficiency.
Detection data analytical unit 5 analyzes the value of data to be tested, exports each different value pair of each field
The record number answered;
This element is similar to training set generation unit 2, the difference is that only that the processing of training set generation unit 2 is training
Sample, this cell processing are data to be tested.
It as shown in figure 21, is the structure of the data quality checking device detection data analytical unit of repeated data of the present invention
Figure;Wherein, the detection data analytical unit 5 includes:
Data similarity calculation module 51 calculates similarity to the value in the same field of data to be tested, and will be similar
Degree meets or exceeds the similar value of threshold value as identical value.
Data record statistical module 52 analyzes the data value of data to be tested, counts each of each field no
The corresponding record number with value;
Detection data screening unit 6 carries out weight to the data to be tested analyzed according to the sample field combination
Reinspection is surveyed, and the record combination that all Repeating Fields meet the sample field combination is filtered out;
This unit carries out repeating detection.Two records are detected according to the analysis result that sample combines screening unit 4 first
Whether Repeating Field meets the sample field combination, then according to obtained two record combination producings three for meeting condition
Whether item record combination, the Repeating Field for continuing to test three records meet the sample field combination repeated.It repeats the above process
Meet the record combination for the sample field combination stated until can not find.
In this way, recording the mode that will be detected between any two compared to detection is typically repeated, the present apparatus passes through to field
Combination repeat the calculating of possibility, and the detection between record is changed into the inspection that corresponding field combines interior identical recordings combination
It surveys, the repetition possibility recorded without more any two shortens the time, improves detection efficiency;Meanwhile the device is unlimited
In detecting the identical situation of two datas, the identical situation in two data parts also can detecte, by duplicate to its
The calculating of probability determines if to repeat according to threshold value;In the present apparatus, Data Quality Analysis person can two records of self-defining be
No identical Rule of judgment.
In addition, the present apparatus can select to provide certain for different field addition weight automatically by training sample
Flexibility.
It as shown in figure 22, is the structure of the data quality checking device detection data screening unit of repeated data of the present invention
Figure;Wherein, the detection data screening unit 6 includes:
Field number confirmation module 61 determines the minimum value N that field number is respectively combined in the sample field combination;
Under normal circumstances, the number for having the record of Repeating Field combined can be reduced with the increase of Repeating Field number,
Therefore it needs to be determined that respectively combining the minimum value N of field number in the sample field combination, so there is no need to search again for repeating
Record of the field less than N combines, and reduces the combined number of the record for needing to search for, improves search efficiency.
For example, then only needing to search at least 4 fields if at least thering are 4 fields to repeat in sample field combination
Duplicate record combination, which improves search efficiencies.
It is identical to search at least N number of field in two records of the data to be tested for double recording combine detection module 62
Record combination, detect and be retained in the record in the sample field combination combination;
The minimum value N that field number is respectively combined in the sample field combination is combined in the record of the data to be tested
In, if recording combined same field number is less than N, the unification of this record group is fixed not to search in sample field combination, therefore only
The identical record combination of at least N number of field of rope, it is possible to reduce search time improves search efficiency.
Mostly record combination searching modules 63, it is at least N number of according to known n-1 item record in the record combination of reservation
The identical record combination of field searches n item and records the identical record combination of at least N number of field;It searches less than then
Terminate;
In this module, the identical record of at least N number of field is recorded according to known n-1 item and is combined, n item record is searched
The identical record combination of at least N number of field, wherein the condition that must satisfy are as follows:
1) combination of n item record is combined into two-by-two by the record combination of n-1 item, has n-2 item in the record combination of the two n-1 items
Record is identical;
2) each subset for having n-1 item to record of the record of n item made of Combination nova combination records at least in n-1 item
In the identical record combination of N number of field.
More record combination detection modules 64, detect and are retained in the record that n item records in the sample field combination
Combination, while the record combination for deleting in the record combination of n-1 item record the n item record of reservation all has n-
The subset of 1 record;Return to more record combination searching modules 63.
By field number confirmation module 61-f4, all possible field combination can be counted by step by step calculation, is avoided
To recording possible omission.
Embodiment six
The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme
Shown in the structure chart of the data quality checking Installation practice six of 23 repeated datas of the present invention;The data quality checking device
Further include:
Duplicate probability, institute are combined in testing result output unit 7, the record combination and the record for exporting reservation
Testing result output unit 7 is stated after the detection data screening unit 6.
Output in this unit can use different form, can be showed with visual pattern, can also export detection knot
Fruit records convenient for merging;Its all record combination that can export reservation and the record combine duplicate probability,
Duplicate probability is combined in the record combination and the record that the part of reservation can also be exported.
Embodiment seven
The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme
Shown in the structure chart of the data quality checking Installation practice seven of 24 repeated datas of the present invention;The training set generation unit 2
Further include:
Sample Similarity computing module 21 calculates similarity to the value in the same field of training sample, and by similarity
The similar value of threshold value is met or exceeded as identical value, the Sample Similarity computing module 21 counts mould in the record number
Before block 22.
Subtle variation may occur because of error for the data in training sample, this allows for the same of two records
The value of field is much like but not identical, and the addition of this unit can eliminate such error, and it is accurate that raising judges repeated data
Property.
Embodiment eight
The data quality checking device of repeated data as described above, the present embodiment are different from place and are, such as scheme
Shown in the structure chart of the data quality checking Installation practice eight of 25 repeated datas of the present invention;The data quality checking device
Further include:
Training sample extraction unit 1 extracts training sample from data to be tested source;
Extracting training sample from data to be tested source can be improved counterweight since training sample and data to be tested are homologous
The accuracy of the judgement of complex data.
The foregoing is merely presently preferred embodiments of the present invention, is merely illustrative for the purpose of the present invention, and not restrictive
's.Those skilled in the art understand that in the spirit and scope defined by the claims in the present invention many changes can be carried out to it,
It modifies or even equivalent, but falls in protection scope of the present invention.