CN105488212A - Data quality detection method and device of duplicated data - Google Patents

Data quality detection method and device of duplicated data Download PDF

Info

Publication number
CN105488212A
CN105488212A CN201510925893.5A CN201510925893A CN105488212A CN 105488212 A CN105488212 A CN 105488212A CN 201510925893 A CN201510925893 A CN 201510925893A CN 105488212 A CN105488212 A CN 105488212A
Authority
CN
China
Prior art keywords
record
combination
field
data
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510925893.5A
Other languages
Chinese (zh)
Other versions
CN105488212B (en
Inventor
许飞月
李青海
简宋全
侯大勇
邹立斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Original Assignee
Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jing Dian Computing Machine Science And Technology Ltd filed Critical Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Priority to CN201510925893.5A priority Critical patent/CN105488212B/en
Publication of CN105488212A publication Critical patent/CN105488212A/en
Application granted granted Critical
Publication of CN105488212B publication Critical patent/CN105488212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data quality detection method and device of duplicated data. The method comprises: step b, generating a model training set; step c, analyzing every combination pair in the model training set, marking as record duplication or record non-duplication; step d, calculating probability of record duplication and screening out field combinations with relatively high probability as sample field combinations; step e, analyzing the values of data to be detected; and step f, carrying out duplicated detection to the data, screening out record combinations in all duplicated fields satisfying the sample field combinations, wherein the device comprises a training set generating unit, a sample record duplication marking unit, a sample combination screening unit, a detection data analyzing unit and a detection data screening unit corresponding to every step. Through calculating the duplication probability of the field combinations, it is unnecessary to compare the duplication probability of any two records; therefore, the time is shortened; the detection efficiency is improved; meanwhile, the condition that parts of two data are the same can be detected.

Description

A kind of data quality checking method of repeating data and device
Technical field
The present invention relates to data quality monitoring technical field, be specifically related to a kind of data quality checking method and device of repeating data.
Background technology
The fast development of infotech makes data become gradually to realize business event to be worth one of most important resource.But along with the continuous increase of data volume, data quality problem is also following.Shortage of data, mistake, the problem such as inconsistent make the application of enterprise to it be hindered, and the serious enterprise that even can cause makes erroneous decision, loss important value and then cause trust crisis.
For these dirty datas, many data quality checkings and cleaning program arise at the historic moment.Repeating data is then compare a kind of data quality problem being difficult to detect wherein.Repeat completely because the Data duplication problem nowadays faced by enterprise is not merely data, also comprise part and repeat.Such as certain social network sites have number with ten million user, but these users may have the situation of repeated registration, the user of these repeated registrations some information may have been only had to occur small difference.How to identify that these user profile repeated are most important for maintaining web quality.
What more representational Data duplication inspection scheme had at present calculates unique Hash codes and check code according to the content of every bar record, then judge whether data repeat according to whether Hash codes and check code be identical, feature is that accuracy is high, efficiency is high, but is only applicable to record situation about repeating completely; Some schemes are based on machine learning to the training of duplicate detection model, and feature is that dirigibility is high, the duplicate detection of various sight is not limited under a kind of method, but every two data all will calculate repetition possibility, and efficiency is low, and accuracy still has to be hoisted.
In view of above-mentioned defect, creator of the present invention is through research and test propose a kind of data quality checking method and device of repeating data finally for a long time.
Summary of the invention
The object of the present invention is to provide a kind of data quality checking method and device of repeating data, in order to overcome above-mentioned technological deficiency, solve problem part repeating data and complete repeating data being detected how accurately, fast.
For achieving the above object, the technical solution used in the present invention is: a kind of data quality checking method first providing repeating data, and it comprises:
Step b, analyzes the data value of the training sample comprising many records, generation model training set;
Step c, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return step b, otherwise enter steps d;
Steps d, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
Step e, analyzes the value of data to be tested, exports the record number that each different value of each field is corresponding;
Step f, carries out duplicate detection according to described sample field combination to the described data to be tested analyzed, and filters out the record combination that all Repeating Fields meet described sample field combination.
Preferably, described data quality checking method also comprises:
Step a, extracts described training sample from described data to be tested source; Described step a is before described step b.
Preferably, described data quality checking method also comprises:
Step g, export the described record combination and the described probability recording combination and repeat that retain, described step g is after described step f.
Preferably, described step b comprises:
Step b2, analyzes the data value of described training sample, adds up the described record number that each different value of each field is corresponding;
Step b3, the described record number corresponding to each different value of each field processes, and generates described model training collection.
Preferably, described step b3 comprises:
Step b31, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a described combination, and this combination is added field repeating label to record in field one;
Step b32, the value of correspondence more than three or three record of static fields one, the record combination of two of each value correspondence is that a described combination is right, by this combination to recording and adding described field repeating label in field one;
Step b33, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a described combination is right, if this combination to the described combination of recording to identical, then add described field repeating label in the field two that described described combination of having recorded is right; If this combination to from the described combination of recording to different, then by this combination to record and add described field repeating label in field two;
Step b34, processes other fields according to step b33, and all described combination of formation is to the described model training collection of formation.
Preferably, described steps d comprises:
Steps d 1, the described combination repeated with a certain field mark field to number for divisor, the combination centering of repeating with this field mark field is with recording the number of repetition described in tense marker for dividend, repeat for this field with business, record the probability of repetition, calculate described field and repeat, record the probability of repetition;
Steps d 2, repeats according to described field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;
Steps d 3, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.
Preferably, described multiple field repeats, and the computing formula recording the probability of repetition is:
p ( 1 , 2 , ... , k ) = &Sigma;p i + ( - 1 ) 1 &Sigma; i 1 < i 2 p i 1 p i 2 + ... + ( - 1 ) k - 1 &Sigma; i 1 < i 2 < ... < i k p i 1 p i 2 ... p i k
In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition; p i, p i1, p i2, p ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents the sequence number of field k.
Preferably, described step f comprises:
Step f1, determines the minimum value N of each combined field number in described sample field combination;
Step f2, searches in two records of described data to be tested the described record combination having at least N number of field identical, detects and is retained in the described record combination in described sample field combination;
Step f3, is recorded to the identical described record combination of rare N number of field according to known n-1 bar in the described record combination retained, and searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;
Step f4, detects and is retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the described n bar record of reservation in the described record combination of n-1 bar record combines; Return step f3.
Preferably, in described step f3, described in search the condition that must meet and be:
The described record combination of described n bar record is combined by the described record of described n-1 bar record and is combined between two, has n-2 bar record to be identical in the described record combination of these two described n-1 bar records;
Each of the described record combination of the described n bar record of Combination nova has the subset of n-1 bar record to be recorded in the identical described record combination of rare N number of field at described n-1 bar.
The data quality checking device of next repeating data providing a kind of with described data quality checking method corresponding, it comprises:
Training set generation unit, analyzes the data value of the training sample comprising many records, generation model training set;
Sample record indicated weight unit, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return described training set generation unit, otherwise enter sample combined sorting unit;
Sample combined sorting unit, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
Detect data analysis unit, the value of data to be tested is analyzed, exports the record number that each different value of each field is corresponding;
Detect data screening unit, according to described sample field combination, duplicate detection is carried out to the described data to be tested analyzed, filter out the record combination that all Repeating Fields meet described sample field combination.
Beneficial effect of the present invention is compared with the prior art: the data quality checking method and the device that provide a kind of repeating data, like this, compare the mode that general duplicate detection method record all will carry out detecting between any two, this is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, the method is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this method, whether identical Data Quality Analysis person can self-defining two record Rule of judgment; This method by the selection of training sample automatically for different field adds weight, can provide certain dirigibility; Can calculate rapidly multi-field by formula to repeat, record the probability of repetition, improve judgement speed, save the time, improve data quality checking efficiency, and formula is simple, saves system resource; After generation model training set, by being converted to the analysis of the same field to record to the analysis of record, subsequent treatment speed can be improved; Can error be eliminated, improve the accuracy that repeating data is judged; Extract training sample from data to be tested source, due to training sample and data to be tested homology, the accuracy of the judgement to repeating data can be improved.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the data quality checking method of repeating data of the present invention;
Fig. 2 is the process flow diagram of step b in the data quality checking method of repeating data of the present invention;
Fig. 3 is the process flow diagram of step b3 in the data quality checking method of repeating data of the present invention;
Fig. 4 is the process flow diagram of steps d in the data quality checking method of repeating data of the present invention;
Fig. 5 is the data quality checking method probability calculation schematic diagram one of repeating data of the present invention;
Fig. 6 is the data quality checking method probability calculation schematic diagram two of repeating data of the present invention;
Fig. 7 is the process flow diagram of step e in the data quality checking method of repeating data of the present invention;
Fig. 8 is the process flow diagram of step f in the data quality checking method of repeating data of the present invention;
Fig. 9 is the process flow diagram of the data quality checking embodiment of the method one of repeating data of the present invention;
Figure 10 is the process flow diagram of the data quality checking embodiment of the method two of repeating data of the present invention;
Figure 11 is the process flow diagram of the data quality checking embodiment of the method three of repeating data of the present invention;
Figure 12 is the data quality checking method EXAMPLEPART data to be tested table of repeating data of the present invention;
Figure 13 is the data quality checking method EXAMPLEPART different value corresponding record number table of repeating data of the present invention;
Figure 14 is that the data quality checking method EXAMPLEPART combination of repeating data of the present invention is to field repeating label table;
Figure 15 is that the data quality checking method EXAMPLEPART combination of repeating data of the present invention is to record repeating label table;
Figure 16 is the data quality checking method example reserved field combination of repeating data of the present invention;
Figure 17 is the structural drawing of the data quality checking device of repeating data of the present invention;
Figure 18 is the structural drawing of the data quality checking device training set generation unit of repeating data of the present invention;
Figure 19 is the structural drawing of the data quality checking device record number processing module of repeating data of the present invention;
Figure 20 is the structural drawing of the data quality checking device sample combined sorting unit of repeating data of the present invention;
Figure 21 is the structural drawing of the data quality checking device detection data analysis unit of repeating data of the present invention;
Figure 22 is the structural drawing of the data quality checking device detection data screening unit of repeating data of the present invention;
Figure 23 is the structural drawing of the data quality checking device embodiment six of repeating data of the present invention;
Figure 24 is the structural drawing of the data quality checking device embodiment seven of repeating data of the present invention;
Figure 25 is the structural drawing of the data quality checking device embodiment eight of repeating data of the present invention.
Embodiment
Below in conjunction with accompanying drawing, to above-mentioned being described in more detail with other technical characteristic and advantage of the present invention.
As shown in Figure 1, it is the process flow diagram of the data quality checking method of repeating data of the present invention; Wherein, the data quality checking method of described repeating data comprises:
Step b, analyzes the data value of training sample, generation model training set;
Have many records in training sample, every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.
Described training sample can be write as the case may be for data analyst, also can extract from data to be tested.
As shown in Figure 2, it is the process flow diagram of step b in the data quality checking method of repeating data of the present invention; Wherein, described step b comprises:
Step b2, analyzes the data value of training sample, adds up the record number that each different value of each field is corresponding;
Same field has multiple value, has identical in these values, also has different; Identical value merged, and add the record number merged, same like this field has multiple different value, and after each value, mark has at least one record number;
Add up all fields according to said method, draw the record number that each different value of each field is corresponding.
Step b3, the record number corresponding to each different value of each field processes, generation model training set;
Described model training collection is any two the record of Repeating Field and the mark of Repeating Field thereof.The record number that each different value of each field of above-mentioned statistics is corresponding, if record number corresponding to value is two, then these two to be recorded as a combination right, and add field repeating label at this field place that combination is right; If record number corresponding to value is more than three or three, is then that a combination is right by record number combination of two corresponding for this value, and adds field repeating label at the corresponding field place that combination is right; Identical combination be combined, the right field repeating label of the combination of merging is the sum of the field repeating label before merging, final generation model training set.
If the record number of value correspondence is two, then obtain a combination right; If record number corresponding to value is three, then to obtain three combinations right for three record combination of two; If record number corresponding to value is four, then to obtain six combinations right for four record combination of two; If the record number of value correspondence is N number of, then N number of record combination of two obtains individual combination is right.
After generation model training set, by being converted to the analysis of the same field to record to the analysis of record, subsequent treatment speed can be improved.
As shown in Figure 3, it is the process flow diagram of step b3 in the data quality checking method of repeating data of the present invention; Wherein, described step b3 concrete steps can be:
Step b31, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a combination, and this combination is added field repeating label to record in field one;
Step b32, correspondence more than three or three values recorded of static fields one, the record combination of two of each value correspondence is that a combination is right, by this combination to recording and adding field repeating label in field one;
Step b33, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a combination is right, if this combination to the combination of recording to identical, then add field repeating label in the field two that described combination of having recorded is right; If this combination to from the combination of recording to different, then by this combination to record and add field repeating label in field two;
Step b34, processes other fields according to step b33, and all combinations of formation are to component model training set.
Step b31-b34 is only wherein a kind of method of generation model training set, and this method can while quick generation model training set, and it is right to avoid omission or repeat certain combination.
Step c, respectively each combination of described model training collection is right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return step b, otherwise enter steps d;
The combination that model training is concentrated is to corresponding two records of difference, right by exporting combination, contrasts the real data of these two records, confirms that whether it is identical, if identical, be labeled as to record and repeat, not identical, is labeled as record and does not repeat.Here judge whether combination repeats two records of correspondence, and the concrete data that can be recorded by observation two by quality analysis person are judged, both also can calculating according to algorithm, similarity is determined.
Then can determine whether to need to continue training or repetition training according to the right contrast situation of output combination, if desired then redefine described training sample and return step b, then whether two right records repeat to determine new all combinations, the result of comprehensively training several times during subsequent analysis, to improve the accuracy rate of judgement; Do not need, carry out steps d.
Steps d, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
As shown in Figure 4, it is the process flow diagram of steps d in the data quality checking method of repeating data of the present invention; Wherein, described steps d comprises:
Steps d 1, the combination repeated with a certain field mark field to number for divisor, the number of the combination centering of repeating with this field mark field label record repetition is simultaneously for dividend, and repeat for this field with business, record the probability of repetition, calculated field repeats, and records the probability of repetition;
The combination of the number x and the repetition of each field that first calculate all each field of combination centering repetitions be labeled, to the number y being marked as repetition, calculates the y/x value that each field is corresponding, is interpreted as: the probability that the record that this field is identical repeats.
Each combination that model training is concentrated, to there being multiple field, is field one, field two, field three, field four ..., each combination repeats having a field at least.Meanwhile, each field has multiple combination right, and it at least repeats a combination centering.
Each combination is to corresponding two records; Each combination is recorded repeating label to there being one or recording not repeating label; Like this, each combination repeats having a field at least, and has one record repeating label or record not repeating label simultaneously.
Like this, it is that the combination that field repeats is right at this field mark that each field has multiple, and these combination centering parts are marked as record repetition; The latter, divided by the former, for this field repeats, records the possibility (probability) of repetition.
That repeat as all combination centering fields one be x, and the right number of the combination that this x combination centering is marked as record repetition is that y is individual, then this field repeats, and the probability recording repetition is y/x.
Steps d 2, repeats according to field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;
Multiple field repeats, and the computing formula recording the probability of repetition is:
p ( 1 , 2 , ... , k ) = &Sigma;p i + ( - 1 ) 1 &Sigma; i 1 < i 2 p i 1 p i 2 + ... + ( - 1 ) k - 1 &Sigma; i 1 < i 2 < ... < i k p i 1 p i 2 ... p i k
In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition, if it means field 1,2 in two records ..., k repeat, so these two record repeat possibility be p (1,2 ..., k); p i, p i1, p i2, p ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents field 1,2 ..., the sequence number of a kth field in k.
The thinking of formula is: to k field of the probability that will calculate, and therefrom takes out one, then has k kind to follow the example of, and often kind is followed the example of corresponding numerical value is single Probability p i; Therefrom take out two, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is two probability for often kind i1p i2; Therefrom take out k, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is k probability for often kind i1p i2p ik; The coefficient of often kind of multiple value sums of following the example of is determined by got field quantity, and therefrom take out odd number, then coefficient is+1; Therefrom take out even number, then coefficient is-1; Like this by these with coefficient and be added, obtain a final k field and repeat, record the probability of repetition.
When k is 2,
p(1,2)=p 1+p 2-p 1p 2
As shown in Figure 5, it is the data quality checking method probability calculation schematic diagram one of repeating data of the present invention; Wherein, p 1p 2for circle p 1, p 2repeat region, need to deduct, just obtain total area p (1,2).
When k is 3, P 1p 2p 3
p(1,2,3)=p 1+p 2+p 3-p 1p 2-p 1p 3-p 2p 3+p 1p 2p 3
As shown in Figure 6, it is the data quality checking method probability calculation schematic diagram two of repeating data of the present invention; Wherein, p 1p 2for circle p 1, p 2repeat region, p 1p 3for circle p 1, p 3repeat region, p 2p 3for circle p 2, p 3repeat region, need to deduct; p 1p 2p 3for circle p 1, p 2, p 3repeat region, repeatedly deduct, need to add, just obtain total area p (1,2,3).
Beneficial effect: like this, can calculate rapidly multi-field by formula and repeat, record the probability of repetition, improve judgement speed, save the time, improve data quality checking efficiency, and formula is simple, saves system resource.
Steps d 3, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.
The multiple fields calculated in steps d 2 are repeated, records the probability of repetition, setting threshold value is needed to screen it, threshold value by manually determining according to actual conditions, also can be determined by calculation element or draw after mass data Statistical Comparison after tightly calculating.
The size of threshold value is relevant with the accuracy of the present invention to the data quality checking of repeating data, and threshold value is larger, and the accuracy of data quality checking of the present invention is higher.
Suppose that data to be tested have n field, then wherein 1<k≤n.After threshold value is set, retain the field combination repeating possibility and be greater than this threshold value.These field combination be retained as sample field combination for follow-up duplicate detection.
By formula, by difference record repeat judge to be converted to the calculating to the probability repeated, thus avoid to record repeat between two judge respectively, only need by legal combination to carrying out probability calculation, substantially increase the efficiency of judgement.
Step e, analyzes the value of data to be tested, exports the record number that each different value of each field is corresponding;
This step is similar to step b, difference be only step b process for training sample, this step process be data to be tested.
Have many records in data to be tested, every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.
As shown in Figure 7, it is the process flow diagram of step e in the data quality checking method of repeating data of the present invention; Wherein, described step e comprises:
Step e1, calculates similarity to the value in the same field of data to be tested, and similar value similarity being met or exceeded threshold value is as identical value;
Value very close to some in each field here adopts certain algorithm to calculate similarity, and is defined when which kind of level is threshold value determination similarity reach by Data Quality Analysis person and these values processed as identical value.
The algorithm calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can be selected according to actual needs.
Step e2, analyzes the data value of data to be tested, adds up the record number that each different value of each field is corresponding;
Same field has multiple value, has identical in these values, also has different; Identical value merged, and add the record number merged, same like this field has multiple different value, and after each value, mark has at least one record number;
Add up all fields according to said method, draw the record number that each different value of each field is corresponding.
Step f, carries out duplicate detection according to described sample field combination to the described data to be tested analyzed, and filters out the record combination that all Repeating Fields meet described sample field combination;
This step carries out duplicate detection.First the whether satisfied described sample field combination of Repeating Field of two records is detected according to the analysis result of steps d, then according to satisfy condition two record combination producing three records combination obtained, whether the Repeating Field continuing detection three record meets the sample field combination repeated.Repeat said process until can not find the record combination meeting the sample field combination stated.
Like this, compare the mode that general duplicate detection method record all will carry out detecting between any two, this method is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, the method is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this method, whether identical Data Quality Analysis person can self-defining two record Rule of judgment.
In addition, this method by the selection of training sample automatically for different field adds weight, can provide certain dirigibility.
As shown in Figure 8, it is the process flow diagram of step f in the data quality checking method of repeating data of the present invention; Wherein, described step f comprises:
Step f1, determines the minimum value N of each combined field number in described sample field combination;
Generally, the number having the record of Repeating Field to combine can reduce along with the increase of Repeating Field number, therefore the minimum value N determining each combined field number in described sample field combination is needed, so do not need to search for again the record combination that Repeating Field is less than N, decrease the number of the record combination needing search, improve search efficiency.
Such as, minimum in sample field combination have 4 fields to repeat, then only need the record combination that search has at least 4 fields to repeat, which improves search efficiency.
Step f2, searches for the record combination having at least N number of field identical in two records of described data to be tested, detects and is retained in the described record combination in described sample field combination;
The minimum value N of each combined field number in described sample field combination, in the record combination of described data to be tested, if the same field number of record combination is less than N, then this record combination is scarcely in sample field combination, therefore the record combination having at least N number of field identical is only searched for, can search time be reduced, improve search efficiency.
Step f3, is recorded to the identical described record combination of rare N number of field according to known n-1 bar in the described record combination retained, and searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;
In this step, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, search n bar and be recorded to the identical described record combination of rare N number of field, the condition that wherein must meet is:
1) combination of n bar record is combined by n-1 bar record and is combined between two, has n-2 bar record to be identical in these two n-1 bar record combinations;
2) each of the n bar record combination of Combination nova has the subset of n-1 bar record to be recorded to during the identical described record of rare N number of field combines at n-1 bar.
In these conditions, condition 1) be say that the combination of n bar record must comprise n-1 bar record by two and the identical combination of n-2 bar record is combined into, such as 4 record combinations must be have two 3 combinations of recording to be combined into, and have 2 records to be identical in these two 3 records, just there are 4 records in the combination be combined into like this.
Condition 2) be say that the n bar record combination be combined into has the individual subset having n-1 bar record, each subset can n-1 bar record before record combination in find, that is only n-1 bar record record combination in exist n bar record combination the individual subset having n-1 bar record, just thinks and can form the combination of n bar record, such as 4 record combinations (1,2,3,4) 43 subsets ((1 recording combination are had, 2,3), (1,2,4), (1,3,4), (2,3,4)), these 43 record combinations can be found in the combination of the record of 3 before.
Article 3, the record of record is all the record combination having at least N number of field identical in combining, and the records combination of 4 of being combined into by it record just may have at least N number of field identical; If 4 the record combination (such as 1,2,3 of record, 4) one of them subset (such as 1,2,3) only has N-1 field identical (not namely being recorded in the identical record combination of rare N number of field at 3), so record combination (1 of these 4 records, 2,3,4) N number of field just can not be had identical, also just there is N-1 field identical at most, and this N-1 field is exactly N-1 same field of subset (1,2,3) certainly.Therefore condition 2) must set up.
Only have ready conditions 1) and condition 2) when setting up, being only needs to search n bar and be recorded to the identical described record combination of rare N number of field simultaneously.
Step f4, detects and is retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the n bar record of reservation in the described record combination of n-1 bar record combines; Return step f3.
The described record combination of the n bar record found in detecting step f3, if the field combination of its same field is in described sample field combination, then represents that this n bar record is identical, retains this record combination; If not in described sample field combination, then represent that this n bar record is not identical, delete this record combination.
In addition, the record combination of n bar record is identical, with it individual have the n-1 bar record of each subset in the subset of n-1 bar record identical, and the implication represented by it is all identical, is: this n bar record repeats.That expresses identical meanings only needs reservation one, that is to say the record combination of the individual n-1 of having bar record has been combined into one jointly has the record of n bar record to combine; Therefore, when the described record retaining n bar record combines, need to delete its correspondence individual have the record of n-1 bar record to combine.
Such as, the combination (1,2,3 of 4 records, 4) in, same field is in described sample field combination, then retain combination (1,2,3,, and delete 4 combinations (1,2,3) of 3 of its correspondence records 4), (1,2,4), (1,3,4), (2,3,4).
By step f1-f4, all possible field combination can being added up by step by step calculation, avoiding recording possible omission.
Embodiment one
The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method one of repeating data as of the present invention in Fig. 9; Described data quality checking method also comprises:
Step g, export the described record combination and the described probability recording combination and repeat that retain, described step g is after described step f.
Output in this step can adopt multi-form, can represent with visual pattern, also output detections result can be convenient to merge record; It can export all described record combinations of reservation and the probability of described record combination repetition, also can export the described record combination of the part of reservation and the probability of described record combination repetition.
Embodiment two
The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method two of repeating data as of the present invention in Figure 10; Described step b also comprises:
Step b1, calculates similarity to the value in the same field of training sample, and similar value similarity being met or exceeded threshold value is as identical value, and described step b1 is before described step b2.
Value very close to some in each field here adopts certain algorithm to calculate similarity, and is defined when which kind of level is threshold value determination similarity reach by Data Quality Analysis person and these values processed as identical value.
The algorithm calculating similarity is Levenshtein algorithm, and longest common subsequence algorithm scheduling algorithm, specific algorithm can be selected according to actual needs.
May there is trickle change because of error in the data in training sample, the value of this same field just making two to record is very similar but not identical, and the interpolation of this step can eliminate this kind of error, improves the accuracy judged repeating data.
Embodiment three
The data quality checking method of repeating data as described above, the present embodiment and its difference are, shown in the process flow diagram of the data quality checking embodiment of the method three of repeating data as of the present invention in Figure 11; Described data quality checking method also comprises:
Step a, extracts training sample from data to be tested source; Described step a is before described step b;
Band detects in data source many records, and every bar record has corresponding numbering, is record number; Record number arranges in order, increases progressively successively; Every bar record is all divided into multiple field: field 1, field 2, field 3, field 4 ..., such same field has a value in every bar record, there are how many records, then each field just has how many values (value here has identical, also has different), and the numbering of the value of field and the numbering of record corresponding; Here, first value of field 1 and the field 1 of Article 1 record are same, and its value is naturally identical.
The record count extracted can be determined by Data Quality Analysis person oneself, also can determine according to actual needs.
Extract training sample from data to be tested source, due to training sample and data to be tested homology, the accuracy of the judgement to repeating data can be improved.
Embodiment four
The data quality checking method of repeating data as described above, the present embodiment is its quality testing example to concrete data, is specially:
S1: these EXAMPLEPART data to be tested are as Figure 12.Extract training sample from data to be tested source, the record number that described sample packages contains, by Data Quality Analysis person's predefined, is assumed to be 1000.
S2: analyze the data value of training sample, export the record number that each different value of each field is corresponding, partial results as shown in figure 13.
S2.1: some value wherein in field may be very close, just has individual characters inconsistent, as 1aaaa and 1aaab in Col1.The similarity calculating these values someway can be taked, set threshold value to judge that whether these values are identical by Data Quality Analysis person, suppose that 1aaaa and 1aaab is judged as identical here.
S2.2: the above results is analyzed, from each have record number be more than 2 or 2 value export combination right, partial results is as shown in figure 14.This process is specific as follows:
In S2.2.1:Col1,1aaaa/1aaab, 1bbbb, 1eeee tri-values have 2 records, can form 3 combinations to (1,2), (3,5) and (6,7).Col1 repeating label to being a record, and is set to 1 by each combination, and all the other marks are set to 0.
In S2.2.2:Col2,2aaaa and 2eeee can form two combinations to (1,2) and (6,7), Col2 repeating label right for the combination formed to the combination centering previously formed, is then set to 1, shows that these combinations repeat at Col2 by these two combinations.2bbbb can form three combinations to (3,4), and (3,5) and (4,5), wherein (3,5) are the combination centering previously formed, and processing mode is as front.(3,4) and (4,5) are that the new combination produced is right, and form new record, and Col2 repeating label is set to 1, all the other marks are set to 0.
S2.2.3:Col3 ~ Col5 is as aforesaid way process.The all combinations formed are to formation duplication model training set.
S3: export at random above-mentioned training set, Data Quality Analysis person starts model training process simultaneously.Concrete training method is: each combination exporting some to and corresponding data, Data Quality Analysis person according to combination to the content of record to these combinations to marking, namely repeat or do not repeat.
S4: after completing the mark right to the combination exported, Data Quality Analysis person can select whether to continue training pattern.If choosing repeats said process by getting back to S1, as selected otherwise carrying out following process.
S5: to the combination of the mark of model training several times to processing.Suppose that labeled incorporating aspects is to as shown in figure 15.Wherein whether repeat to represent this combination repeats record to finally whether being marked as, and all the other fields are identical with Figure 14 implication.
S5.1: the combination of the number x and the repetition of each field that first calculate all each field of combination centering repetitions be labeled is to the number y being marked as repetition.The combination repeated as Col1 in Figure 15 is 3 to number, and what be marked as repetition is also 3.The combination that Col4 repeats is 7 to number, and what be marked as repetition is 3, the like.Calculate the y/x value that each field is corresponding, be interpreted as: the possibility that the record that field k is identical repeats has much.Suppose to be followed successively by 0.4,0.4,0.4,0.3,0.3 (because the data in figure are the part in training sample data through calculating y/x value corresponding to final Col1 ~ Col5, therefore cannot be calculated by the data in figure and be worth accurately, can only suppose to ensure carrying out smoothly of subsequent step, but last result and correct result can be made like this to differ greatly).
S5.2: Data Quality Analysis person sets threshold value, judges that this is recorded as when the possibility that definition record repeats is much and repeats record, assuming that this threshold value is 0.75.Then calculating the record having k field identical is that the possibility of same record has much, and this value is compared with described threshold value, and the field combination higher than this threshold value is left, as shown in figure 16.
More than for repeating model training process, the model trained next is utilized to carry out duplicate detection.
S6: the field combination receiving data to be tested and finally stay.Then treat detect data value analysis, export the record number that each different value of each field is corresponding, partial results as shown in figure 13.Some value wherein in field may be very close, just has individual characters inconsistent, as 1aaaa and 1aaab in Col1.The similarity calculating these values someway can be taked, set threshold value to judge that whether these values are identical by Data Quality Analysis person, suppose that 1aaaa and 1aaab is judged as identical here.
S7: carry out duplicate detection.Detailed process is as follows:
S7.1: because the field combination finally stayed has three fields at least, therefore finally detect repeat record three field contents also at least will be had identical.First search for two and be recorded to the identical combination of rare three fields.Result is that { { { { { digital watch wherein in the outer brace of round bracket is shown with several field and repeats for (4,5), 3} for (3,4), 4} for (6,7), 5} for (3,5), 4} for (1,2), 3}.
S7.2: detect same field combination in above record combination whether repeating in the field combination that decision condition generation unit 14 finally stays, if do not existed, delete the combination of this record, then { (4,5), 3} is deleted.
S7.3: search n bar and be recorded to the identical combination of rare three fields in the combination of residue record, known back n-1 bar is recorded to the identical combination of rare three fields.Then check whether these Combination nova have at least three fields identical.
S7.4: detect same field combination in the combination of above n bar record and, whether in the field combination finally stayed, if do not existed, delete the combination of this record.If, not only to retain this combination, also will be recorded in the identical combination of rare three fields at back n-1 bar and delete this combination each has the subset of n-1 bar record.
S7.5: when check be recorded to the identical combination of rare three fields less than n bar time, testing process terminates, otherwise gets back to S7.3.
In the present embodiment, detecting step terminates in S7.2.
S8: output detections result, can represent with visual pattern, also output detections result can be convenient to merge record.
S8.1: all more than 3 that can export that S7 step retains to be recorded to the identical combination of rare three fields and these combinations may be the probability repeating to record, and the probability that the record in combination may repeat between any two.
S8.2: 2 that can export the S7 step reservation that S8.1 does not export are recorded to the identical combination of rare three fields, and these combinations may be the probability repeating to record.
As the present embodiment will export (1,2), (3,5), (6,7), (3,4) right record content (this result is realized by the hypothesis of intermediate data, and therefore this result and the right result of reality differ greatly) is combined, and the possibility that these records repeat.
Embodiment five
The data quality checking method of repeating data as described above, the present embodiment is the data quality checking device of the repeating data corresponding with it.
As shown in figure 17, it is the structural drawing of the data quality checking device of repeating data of the present invention; Wherein, the data quality checking device of described repeating data comprises:
Training set generation unit 2, analyzes the data value of training sample, generation model training set;
As shown in figure 18, it is the structural drawing of the data quality checking device training set generation unit of repeating data of the present invention; Wherein, described training set generation unit 2 comprises:
Record number statistical module 22, analyzes the data value of training sample, adds up the record number that each different value of each field is corresponding;
Record number processing module 23, the record number corresponding to each different value of each field processes, generation model training set
Generation model training set, by being converted to the analysis of the same field to record to the analysis of record, can improve subsequent treatment speed.
As shown in figure 19, it is the structural drawing of the data quality checking device record number processing module of repeating data of the present invention; Wherein, described record number processing module 23 comprises:
Field one diadic indicated weight submodule 231, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a combination, and this combination is added field repeating label to record in field one;
The many-valued indicated weight submodule 232 of field one, correspondence more than three or three values recorded of static fields one, the record combination of two of each value correspondence is that a combination is right, by this combination to recording and adding field repeating label in field one;
Field two indicated weight submodule 233, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a combination is right, if this combination to the combination of recording to identical, then add field repeating label in the field two that described combination of having recorded is right; If this combination to from the combination of recording to different, then by this combination to record and add field repeating label in field two;
Multi-field indicated weight submodule 234, process other fields according to field two indicated weight submodule 233, all combinations of formation are to component model training set.
Field one diadic indicated weight submodule 231-b34 is only wherein a kind of device of generation model training set, and this device can while quick generation model training set, and it is right to avoid omission or repeat certain combination.
Sample record indicated weight unit 3, each combination analyzing described model training collection is right, and described combination to be repeated for record two of correspondence recording marks by artificial or algorithm or record does not repeat; Then select whether to continue training, continue then redefine described training sample and return training set generation unit 2, otherwise enter sample combined sorting unit 4.
The combination that model training is concentrated is to corresponding two records of difference, right by exporting combination, contrasts the real data of these two records, confirms that whether it is identical, if identical, be labeled as to record and repeat, not identical, is labeled as record and does not repeat.Here judge whether combination repeats two records of correspondence, and the concrete data that can be recorded by observation two by quality analysis person are judged, both also can calculating according to algorithm, similarity is determined.
Then can determine whether to need to continue training or repetition training according to the right contrast situation of output combination, if desired then redefine described training sample and return training set generation unit 2, then whether two right records repeat to determine new all combinations, the result of comprehensively training several times during subsequent analysis, to improve the accuracy rate of judgement; Do not need, carry out sample combined sorting unit 4.
Sample combined sorting unit 4, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
As shown in figure 20, it is the structural drawing of the data quality checking device sample combined sorting unit of repeating data of the present invention; Wherein, described sample combined sorting unit 4 comprises:
Individual character section double counting module 41, the combination repeated with a certain field mark field to number for divisor, the number of the combination centering of repeating with this field mark field label record repetition is simultaneously for dividend, repeat for this field with business, record the probability of repetition, calculated field repeats, and records the probability of repetition;
The combination of the number x and the repetition of each field that first calculate all each field of combination centering repetitions be labeled, to the number y being marked as repetition, calculates the y/x value that each field is corresponding, is interpreted as: the probability that the record that this field is identical repeats.
Multi-field double counting module 42, repeats according to field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;
Multiple field repeats, and the computing formula recording the probability of repetition is:
p ( 1 , 2 , ... , k ) = &Sigma;p i + ( - 1 ) 1 &Sigma; i 1 < i 2 p i 1 p i 2 + ... + ( - 1 ) k - 1 &Sigma; i 1 < i 2 < ... < i k p i 1 p i 2 ... p i k
In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition, if it means field 1,2 in two records ..., k repeat, so these two record repeat possibility be p (1,2 ..., k); p i, p i1, p i2, p ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents field 1,2 ..., in k, the sequence number of a kth field, is the sequence number of field k.
The thinking of formula is: to k field of the probability that will calculate, and therefrom takes out one, then has k kind to follow the example of, and often kind is followed the example of corresponding numerical value is single Probability p i; Therefrom take out two, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is two probability for often kind i1p i2; Therefrom take out k, then have individually to follow the example of, follow the example of the product p that corresponding numerical value is k probability for often kind i1p i2p ik; The coefficient of often kind of multiple value sums of following the example of is determined by got field quantity, and therefrom take out odd number, then coefficient is+1; Therefrom take out even number, then coefficient is-1; Like this by these with coefficient and be added, obtain a final k field and repeat, record the probability of repetition.
Beneficial effect: like this, can calculate rapidly multi-field by formula and repeat, record the probability of repetition, improve judgement speed, save the time, improve data quality checking efficiency, and formula is simple, saves system resource.
Threshold value screening composite module 43, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.
The multiple fields calculated in multi-field double counting module 42 are repeated, records the probability of repetition, setting threshold value is needed to screen it, threshold value by manually determining according to actual conditions, also can be determined by calculation element or draw after mass data Statistical Comparison after tightly calculating.
The size of threshold value is relevant with the accuracy of the present invention to the data quality checking of repeating data, and threshold value is larger, and the accuracy of data quality checking of the present invention is higher.
By formula, by difference record repeat judge to be converted to the calculating to the probability repeated, thus avoid to record repeat between two judge respectively, only need by legal combination to carrying out probability calculation, substantially increase the efficiency of judgement.
Detect data analysis unit 5, the value of data to be tested is analyzed, exports the record number that each different value of each field is corresponding;
This element is similar to training set generation unit 2, and what difference was only that training set generation unit 2 processes is training sample, this cell processing be data to be tested.
As shown in figure 21, it is the structural drawing of the data quality checking device detection data analysis unit of repeating data of the present invention; Wherein, described detection data analysis unit 5 comprises:
Data similarity calculation module 51, calculates similarity to the value in the same field of data to be tested, and similar value similarity being met or exceeded threshold value is as identical value.
Data record statistical module 52, analyzes the data value of data to be tested, adds up the record number that each different value of each field is corresponding;
Detect data screening unit 6, according to described sample field combination, duplicate detection is carried out to the described data to be tested analyzed, filter out the record combination that all Repeating Fields meet described sample field combination;
This unit carries out duplicate detection.First the whether satisfied described sample field combination of Repeating Field of two records is detected according to the analysis result of sample combined sorting unit 4, then according to satisfy condition two record combination producing three records combination obtained, whether the Repeating Field continuing detection three record meets the sample field combination repeated.Repeat said process until can not find the record combination meeting the sample field combination stated.
Like this, compare the mode that general duplicate detection record all will carry out detecting between any two, this device is by carrying out the calculating of repetition possibility to field combination, detection between record is changed into the detection of identical recordings combination in corresponding field combination, without the need to the repetition possibility of more any two records, shorten the time, improve detection efficiency; Meanwhile, this device is not limited to detection two identical situations of data, also can detect the identical situation of two data divisions, by the calculating of probability repeated it, determines whether it repeats according to threshold value; In this device, whether identical Data Quality Analysis person can self-defining two record Rule of judgment.
In addition, this device by the selection of training sample automatically for different field adds weight, can provide certain dirigibility.
As shown in figure 22, it is the structural drawing of the data quality checking device detection data screening unit of repeating data of the present invention; Wherein, described detection data screening unit 6 comprises:
Field number confirms module 61, determines the minimum value N of each combined field number in described sample field combination;
Generally, the number having the record of Repeating Field to combine can reduce along with the increase of Repeating Field number, therefore the minimum value N determining each combined field number in described sample field combination is needed, so do not need to search for again the record combination that Repeating Field is less than N, decrease the number of the record combination needing search, improve search efficiency.
Such as, minimum in sample field combination have 4 fields to repeat, then only need the record combination that search has at least 4 fields to repeat, which improves search efficiency.
Double recording combine detection module 62, searches for the record combination having at least N number of field identical in two records of described data to be tested, detects and is retained in the described record combination in described sample field combination;
The minimum value N of each combined field number in described sample field combination, in the record combination of described data to be tested, if the same field number of record combination is less than N, then this record combination is scarcely in sample field combination, therefore the record combination having at least N number of field identical is only searched for, can search time be reduced, improve search efficiency.
Module 63 is searched in the combination of many records, in the described record combination retained, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;
In this module, be recorded to the identical described record combination of rare N number of field according to known n-1 bar, search n bar and be recorded to the identical described record combination of rare N number of field, the condition that wherein must meet is:
1) combination of n bar record is combined by n-1 bar record and is combined between two, has n-2 bar record to be identical in these two n-1 bar record combinations;
2) each of the n bar record combination of Combination nova has the subset of n-1 bar record to be recorded to during the identical described record of rare N number of field combines at n-1 bar.
Many records combine detection module 64, detect and be retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the n bar record of reservation in the described record combination of n-1 bar record combines; Return the combination of many records and search module 63.
Confirm module 61-f4 by field number, all possible field combination can being added up by step by step calculation, avoiding recording possible omission.
Embodiment six
The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment six of repeating data as of the present invention in Figure 23; Described data quality checking device also comprises:
Testing result output unit 7, export the described record combination and the described probability recording combination and repeat that retain, described testing result output unit 7 is after described detection data screening unit 6.
Output in this unit can adopt multi-form, can represent with visual pattern, also output detections result can be convenient to merge record; It can export all described record combinations of reservation and the probability of described record combination repetition, also can export the described record combination of the part of reservation and the probability of described record combination repetition.
Embodiment seven
The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment seven of repeating data as of the present invention in Figure 24; Described training set generation unit 2 also comprises:
Sample Similarity computing module 21, calculates similarity to the value in the same field of training sample, and similar value similarity being met or exceeded threshold value is as identical value, and described Sample Similarity computing module 21 is before described record number statistical module 22.
May there is trickle change because of error in the data in training sample, the value of this same field just making two to record is very similar but not identical, and the interpolation of this unit can eliminate this kind of error, improves the accuracy judged repeating data.
Embodiment eight
The data quality checking device of repeating data as described above, the present embodiment and its difference are, shown in the structural drawing of the data quality checking device embodiment eight of repeating data as of the present invention in Figure 25; Described data quality checking device also comprises:
Training sample extraction unit 1, extracts training sample from data to be tested source;
Extract training sample from data to be tested source, due to training sample and data to be tested homology, the accuracy of the judgement to repeating data can be improved.
The foregoing is only preferred embodiment of the present invention, is only illustrative for the purpose of the present invention, and nonrestrictive.Those skilled in the art is understood, and can carry out many changes in the spirit and scope that the claims in the present invention limit to it, amendment, even equivalence, but all will fall within the scope of protection of the present invention.

Claims (10)

1. a data quality checking method for repeating data, is characterized in that, comprising:
Step b, analyzes the data value of the training sample comprising many records, generation model training set;
Step c, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return step b, otherwise enter steps d;
Steps d, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
Step e, analyzes the value of data to be tested, exports the record number that each different value of each field is corresponding;
Step f, carries out duplicate detection according to described sample field combination to the described data to be tested analyzed, and filters out the record combination that all Repeating Fields meet described sample field combination.
2. data quality checking method according to claim 1, is characterized in that, described data quality checking method also comprises:
Step a, extracts described training sample from described data to be tested source; Described step a is before described step b.
3. data quality checking method according to claim 2, is characterized in that, described data quality checking method also comprises:
Step g, export the described record combination and the described probability recording combination and repeat that retain, described step g is after described step f.
4. the data quality checking method according to claim 1 or 2 or 3, it is characterized in that, described step b comprises:
Step b2, analyzes the data value of described training sample, adds up the described record number that each different value of each field is corresponding;
Step b3, the described record number corresponding to each different value of each field processes, and generates described model training collection.
5. data quality checking method according to claim 4, is characterized in that, described step b3 comprises:
Step b31, the value of correspondence two record of static fields one, it is right that two of each value correspondence are recorded as a described combination, and this combination is added field repeating label to record in field one;
Step b32, the value of correspondence more than three or three record of static fields one, the record combination of two of each value correspondence is that a described combination is right, by this combination to recording and adding described field repeating label in field one;
Step b33, the value of two or more record of the correspondence of static fields two, record combination of two corresponding to each value is that a described combination is right, if this combination to the described combination of recording to identical, then add described field repeating label in the field two that described described combination of having recorded is right; If this combination to from the described combination of recording to different, then by this combination to record and add described field repeating label in field two;
Step b34, processes other fields according to step b33, and all described combination of formation is to the described model training collection of formation.
6. the data quality checking method according to claim 1 or 2 or 3, it is characterized in that, described steps d comprises:
Steps d 1, the described combination repeated with a certain field mark field to number for divisor, the combination centering of repeating with this field mark field is with recording the number of repetition described in tense marker for dividend, repeat for this field with business, record the probability of repetition, calculate described field and repeat, record the probability of repetition;
Steps d 2, repeats according to described field, and the multiple field of probability calculation recording repetition repeats, and records the probability of repetition;
Steps d 3, arranges threshold value, and screening record recurrence probability is more than or equal to the field combination of this threshold value as sample field combination.
7. data quality checking method according to claim 6, is characterized in that, described multiple field repeats, and the computing formula recording the probability of repetition is:
p ( 1 , 2 , ... , k ) = &Sigma;p i + ( - 1 ) 1 &Sigma; i 1 < i 2 p i 1 p i 2 + ... + ( - 1 ) k - 1 &Sigma; i 1 < i 2 < ... < i k p i 1 p i 2 ... p i k
In formula, p (1,2 ..., k) be field 1,2 ..., k repeats, and records the probability of repetition; p i, p i1, p i2, p ikbe respectively field i, i1, i2, ik repeat, record the probability of repetition; I1, i2 represent field 1,2 respectively ..., the sequence number of any two fields in k; Ik represents the sequence number of field k.
8. the data quality checking method according to claim 1 or 2 or 3, is characterized in that,
Described step f comprises: step f1, determines the minimum value N of each combined field number in described sample field combination;
Step f2, searches in two records of described data to be tested the described record combination having at least N number of field identical, detects and is retained in the described record combination in described sample field combination;
Step f3, is recorded to the identical described record combination of rare N number of field according to known n-1 bar in the described record combination retained, and searches n bar and is recorded to the identical described record combination of rare N number of field; Search less than then terminating;
Step f4, detects and is retained in the described record of n bar record in described sample field combination and combine, all subsets having n-1 bar record that the described record simultaneously deleting the described n bar record of reservation in the described record combination of n-1 bar record combines; Return step f3.
9. data quality checking method according to claim 8, is characterized in that, in described step f3, described in search the condition that must meet and be:
The described record combination of described n bar record is combined by the described record of described n-1 bar record and is combined between two, has n-2 bar record to be identical in the described record combination of these two described n-1 bar records;
Each of the described record combination of the described n bar record of Combination nova has the subset of n-1 bar record to be recorded in the identical described record combination of rare N number of field at described n-1 bar.
10. a data quality checking device for the repeating data corresponding with described data quality checking method arbitrary in claim 1-9, is characterized in that, comprising:
Training set generation unit, analyzes the data value of the training sample comprising many records, generation model training set;
Sample record indicated weight unit, analyzes each combination that described model training concentrates right, and described combination is repeated or record not repeat for record to two of correspondence recording marks by artificial or algorithm; Then select whether to continue training, continue then redefine described training sample and return described training set generation unit, otherwise enter sample combined sorting unit;
Described sample combined sorting unit, calculates one or more field and repeats, record the probability of repetition, and filters out the larger field combination of probability as sample field combination;
Detect data analysis unit, the value of data to be tested is analyzed, exports the record number that each different value of each field is corresponding;
Detect data screening unit, according to described sample field combination, duplicate detection is carried out to the described data to be tested analyzed, filter out the record combination that all Repeating Fields meet described sample field combination.
CN201510925893.5A 2015-12-11 2015-12-11 A kind of data quality checking method and device of repeated data Active CN105488212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510925893.5A CN105488212B (en) 2015-12-11 2015-12-11 A kind of data quality checking method and device of repeated data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510925893.5A CN105488212B (en) 2015-12-11 2015-12-11 A kind of data quality checking method and device of repeated data

Publications (2)

Publication Number Publication Date
CN105488212A true CN105488212A (en) 2016-04-13
CN105488212B CN105488212B (en) 2019-06-14

Family

ID=55675187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510925893.5A Active CN105488212B (en) 2015-12-11 2015-12-11 A kind of data quality checking method and device of repeated data

Country Status (1)

Country Link
CN (1) CN105488212B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156315A (en) * 2016-07-01 2016-11-23 中国人民解放军装备学院 A kind of data quality monitoring method judged based on disaggregated model
CN106357275A (en) * 2016-08-30 2017-01-25 国网冀北电力有限公司信息通信分公司 Huffman compression method and device
CN107577549A (en) * 2017-08-24 2018-01-12 郑州云海信息技术有限公司 It is a kind of to store the method for testing for deleting function again
CN109257694A (en) * 2018-08-23 2019-01-22 东南大学 A kind of vehicle OD matrix division methods based on RFID data
CN111949641A (en) * 2020-08-06 2020-11-17 武汉理工光科股份有限公司 Method and system for cleaning and synchronizing data between multi-stage platforms
CN112347320A (en) * 2020-11-05 2021-02-09 杭州数梦工场科技有限公司 Associated field recommendation method and device for data table field
EP4197131A4 (en) * 2020-08-12 2024-10-23 D Fend Solutions Ad Ltd Detection of repetitive data signals

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040019593A1 (en) * 2002-04-11 2004-01-29 Borthwick Andrew E. Automated database blocking and record matching
US20040107189A1 (en) * 2002-12-03 2004-06-03 Lockheed Martin Corporation System for identifying similarities in record fields
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN104424202A (en) * 2013-08-21 2015-03-18 北大方正集团有限公司 Method and system for performing duplication checking on customer information in customer relationship management (CRM) system
CN104850624A (en) * 2015-05-20 2015-08-19 华东师范大学 Similarity evaluation method of approximately duplicate records

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040019593A1 (en) * 2002-04-11 2004-01-29 Borthwick Andrew E. Automated database blocking and record matching
US20040107189A1 (en) * 2002-12-03 2004-06-03 Lockheed Martin Corporation System for identifying similarities in record fields
CN104424202A (en) * 2013-08-21 2015-03-18 北大方正集团有限公司 Method and system for performing duplication checking on customer information in customer relationship management (CRM) system
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN104850624A (en) * 2015-05-20 2015-08-19 华东师范大学 Similarity evaluation method of approximately duplicate records

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156315A (en) * 2016-07-01 2016-11-23 中国人民解放军装备学院 A kind of data quality monitoring method judged based on disaggregated model
CN106156315B (en) * 2016-07-01 2019-05-17 中国人民解放军装备学院 A kind of data quality monitoring method based on disaggregated model judgement
CN106357275A (en) * 2016-08-30 2017-01-25 国网冀北电力有限公司信息通信分公司 Huffman compression method and device
CN106357275B (en) * 2016-08-30 2019-12-17 国网冀北电力有限公司信息通信分公司 Huffman compression method and device
CN107577549A (en) * 2017-08-24 2018-01-12 郑州云海信息技术有限公司 It is a kind of to store the method for testing for deleting function again
CN109257694A (en) * 2018-08-23 2019-01-22 东南大学 A kind of vehicle OD matrix division methods based on RFID data
CN111949641A (en) * 2020-08-06 2020-11-17 武汉理工光科股份有限公司 Method and system for cleaning and synchronizing data between multi-stage platforms
CN111949641B (en) * 2020-08-06 2023-07-14 武汉理工光科股份有限公司 Method and system for cleaning and synchronizing data among multiple stages of platforms
EP4197131A4 (en) * 2020-08-12 2024-10-23 D Fend Solutions Ad Ltd Detection of repetitive data signals
CN112347320A (en) * 2020-11-05 2021-02-09 杭州数梦工场科技有限公司 Associated field recommendation method and device for data table field
CN112347320B (en) * 2020-11-05 2024-08-06 杭州数梦工场科技有限公司 Associated field recommendation method and device for data table field

Also Published As

Publication number Publication date
CN105488212B (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN105488212A (en) Data quality detection method and device of duplicated data
CN106708966B (en) Junk comment detection method based on similarity calculation
CN103365998B (en) A kind of similar character string search method
CN104216349B (en) Utilize the yield analysis system and method for the sensing data of manufacturing equipment
CN106021545B (en) Method for vehicle remote diagnosis and spare part retrieval
CN103309984B (en) The method and apparatus that data process
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
CN102982416A (en) Universal implementation model for performance assessment
CN106447075B (en) Industrial electricity demand prediction method and system
CN106527757A (en) Input error correction method and apparatus
CN103226714B (en) Based on the sparse coding method strengthened compared with unitary Item coefficient
CN109325510B (en) Image feature point matching method based on grid statistics
CN107301210A (en) A kind of data processing method
CN107797147A (en) A kind of quick elimination method of earthquake first arrival exceptional value
CN103077228B (en) A kind of Fast Speed Clustering based on set feature vector and device
CN104463601A (en) Method for detecting users who score maliciously in online social media system
CN111507260A (en) Video similarity rapid detection method and detection device
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context
KR20200019741A (en) Data Analysis Support System and Data Analysis Support Method
CN105740434A (en) Network information scoring method and device
CN104133836B (en) A kind of method and device realizing change Data Detection
CN103902798A (en) Data preprocessing method
CN106815209A (en) A kind of Uighur agricultural technology term recognition methods
CN107944946A (en) Commercial goods labels generation method and device
CN106250917A (en) A kind of based on the time-sequence rating rejecting outliers method accelerating near-end gradient PCA

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510630 A-701, 906 hi tech building, Tianhe North Road, Guangzhou, Guangdong

Applicant after: GUANGDONG KINGPOINT DATA SCIENCE AND TECHNOLOGY Co.,Ltd.

Address before: 510630 A-701, 906 hi tech building, Tianhe North Road, Guangzhou, Guangdong

Applicant before: GUANGZHOU KINGPOINT COMPUTER TECHNOLOGY CO.,LTD.

GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A data quality detection method and device for repeated data

Effective date of registration: 20211022

Granted publication date: 20190614

Pledgee: Agricultural Bank of China Limited Dongcheng Branch of Guangzhou

Pledgor: GUANGDONG KINGPOINT DATA SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2021440000320

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20221230

Granted publication date: 20190614

Pledgee: Agricultural Bank of China Limited Dongcheng Branch of Guangzhou

Pledgor: GUANGDONG KINGPOINT DATA SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2021440000320

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A data quality detection method and device for duplicate data

Effective date of registration: 20230131

Granted publication date: 20190614

Pledgee: Agricultural Bank of China Limited Dongcheng Branch of Guangzhou

Pledgor: GUANGDONG KINGPOINT DATA SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2023440020017