CN109947752A

CN109947752A - A kind of automaticdata cleaning method based on DeepDive

Info

Publication number: CN109947752A
Application number: CN201910077102.6A
Authority: CN
Inventors: 李卫榜; 李玲; 谈文蓉; 崔梦天
Original assignee: Southwest Minzu University
Current assignee: Southwest Minzu University
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2019-06-28

Abstract

The invention discloses a kind of data auto-cleaning method based on DeepDive, comprising: (1) be compared according to initial data scale with given threshold, scale is more than that then stochastical sampling obtains sampled data to threshold value；(2) from initial data or the acquistion of sampled data middle school to Bayesian network attribute；(3) Bayesian network that study obtains is converted into first-order predicate logic rule；(4) weight that first-order predicate logic rule is calculated using Mutual Information Theory, is converted into Markov Logic Network for the first-order predicate logic rule of Weight；(5) DeepDive rule is generated based on Markov logical network；(6) probability inference of mistake/missing data is carried out based on DeepDive, the attribute for obtaining tuple takes the probability of different value；(7) the reasoning results original dirty data is used to clean.The present invention can be used in cleaning automatically without the data in the case of ready-made quality of data mode/rule and manpower intervention, can effectively improve the efficiency and quality of data cleansing.

Description

A kind of automaticdata cleaning method based on DeepDive

Technical field

The present invention relates to technical field of data processing, and in particular to a kind of data based on the DeepDive side of cleaning automatically Method.

Background technique

Dirty data is very universal in real world, aiming at the problem that cleaning of dirty data is a long-term existence.Big data The importance in epoch, data cleansing more highlights.Detecting mistake from dirty data and repairing is that data analysis field is faced One of significant challenge, the inferior quality of data may cause precision of analysis and have a greatly reduced quality.In general, data cleansing Include two stages: first stage is error detection, detects mistake wherein included or abnormal data；Second stage It is errors repair, the mistake or exception for including in repair data.Current existing data cleaning method utilizes existing mostly Constraint rule or mode carry out error detection and reparation, and be primarily present following limitation: (1) existing method usually requires people The participation of work or additional information carry out the detection and reparation of mistake.These methods can also when data scale is lesser It deals with, however nowadays the scale of data is increased with exponential, therefore these methods are not appropriate for when facing mass data.And And the cost of labor cost of these methods is larger, when data scale is larger, cost of labor is higher.(2) some other method Dependent on using offline mode/rule when carrying out data reparation, and for many truthful datas and application scenarios, Obtaining offline mode/rule might not be feasible.(3) certain methods are depended on carries out mode/rule from clean data Study or the study of mode/rule is directly carried out from the data comprising dirty data.For some special scenes, number The participation of user is usually required according to the foundation of cleaning corpus or is suitable only for specific application scenarios.Although these methods can be with Study obtains mode/rule, but the effect of usually data cleansing is not fully up to expectations.

Summary of the invention

Technical problems to be solved

To solve to realize automaticdata cleaning, the present invention when lacking existing mode/rule and artificial participation A kind of automaticdata cleaning method based on DeepDive is proposed, to solve foregoing problems existing in the prior art.

Technical solution

A kind of automaticdata cleaning method based on DeepDive, it is characterised in that steps are as follows:

Step 1: data prediction carries out data scale judgement to the initial data comprising dirty data, if data scale More than threshold value, initial data is sampled, otherwise data after being sampled maintain initial data；

Step 2: data model study, from the dependence learnt between attribute in the data that step 1 obtains, study is obtained Implicit nisi or relatively weak dependence, and indicated with the form of Bayesian network；

Step 3: data model translation defines first-order predicate logic predicate, including " equivalence ", " matching " predicate, by step 1 The obtained Bayesian network between attribute is automatically converted into first-order predicate logic rule, obtains based on pattra leaves between data attribute The first-order predicate logic regular collection of this network；

Step 4: calculating the single order based on Bayesian network between data attribute that step 2 obtains using Mutual Information Theory and call The weight of each first-order predicate logic rule, the first-order predicate logic rule of Weight is converted into word logic rules set Markov Logic Network；

Step 5: DeepDive rule being generated based on Markov logical network, boolean queries are patrolled by Markov in rule It collects first-order predicate logic in network to obtain, regular weight is the weight of Markov logical network；

Step 6: the probability inference of mistake/missing data being carried out based on DeepDive, constructs mould of the factor graph as reasoning Type takes the probability of different value using the attribute of Gibbs sampling method estimation tuple；

Step 7: according to probability inference as a result, the dirty data that includes in detection data and selecting probability to dirty data Maximum value is compared with the threshold value of setting, if the maximum value of probability is greater than threshold value, is automatically repaired, otherwise It is not processed.

Beneficial effect

A kind of automaticdata cleaning method based on DeepDive proposed by the present invention, according to data scale it is of different sizes from To the Bayesian network between a data attribute, the network is anti-for the sampled data middle school acquistion of initial data or initial data Reflected the dependence in data between different attribute, by defining first-order predicate logic predicate, will the obtained attribute of study it Between Bayesian network be converted into first-order predicate logic rule, based on mutual information according to initial data calculate each first-order predicate The weight of logic rules obtains the corresponding Markov Logic Network with each first-order predicate logic rule, is based on Ma Erke Husband's logical network generates DeepDive rule, is carried out in the sampled data of initial data or initial data based on DeepDive The probability inference of mistake/missing data takes the probability of different value using the attribute of Gibbs sampling method estimation tuple, according to general Reparation value of the maximum value of rate calculated result select probability as attribute, repairs wrong data using reparation value, to scarce It loses data to be filled, to complete the automatic cleaning of data；Therefore, the present invention is when carrying out data cleansing without ready-made number According to cleaning model/rule, while without manpower intervention, on the one hand, built from the angle of probability to the relationship between data attribute Formwork erection type, therefore can tolerate that there are a certain proportion of dirty datas in training data；On the other hand, statistical relational learning and reasoning Application help to find out the latent fault in data and promote the effect of data cleansing；The present invention is able to solve in the prior art Existing the problem of needing ready-made quality of data mode/rule or needing manpower intervention, simultaneously because realizing data Automatic cleaning, the present invention are more suitable for the dirty data cleaning of data scale magnanimity, effectively improve the efficiency and quality of data cleansing.

Detailed description of the invention

Automaticdata cleaning process figure of the Fig. 1 based on DeepDive

Fig. 2 includes the factor graph of 3 variable nodes and 1 factor nodes

Specific embodiment

Now in conjunction with embodiment, attached drawing, the invention will be further described:

The invention proposes the automaticdata cleaning method based on DeepDive, automaticdata cleaning process figure is shown in Fig. 1, solution Certainly technical solution used by its technical problem includes the following contents:

1. data prediction

Set data scale threshold value to be cleanedData scale to be cleaned is calculated, that is, includes tuple number, ifThen to initial data stochastical sampling, otherwise data after being sampled keep initial data.

2. data model learns

In the case where lacking ready-made data cleansing mode/rule situation, counted from the data obtained after data prediction According to model learning, the nisi or relatively weak dependence implied in data is found out, and with the shape of Bayesian network Formula indicates.In the study stage of Bayesian network, the present invention is based on the full connection figures between tables of data building attribute, then calculate and appoint Mutual information I (the X to anticipate between attribute X, Y；Y), given threshold λ, if I (X；Y) > λ, then the connection between reserved property X, Y, no The connection between X, Y is then deleted, obtains a simplified network in this way, complete search is carried out based on the network, is looked for To the bayesian network structure for representing dependence between data Table Properties, this method ensures that the result is able to satisfy global optimum Change, facilitates the building of quality data quality rule.

3. data model translation

The Bayesian network of dependence is into one between the reflection data Table Properties that the present invention learns data model Step is converted into first-order predicate logic.In first-order predicate logic, relationship constant has contained the relationship between multiple elements.The present invention is fixed Adopted relationship constant is as follows:

It is of equal value: equal-A (key₁,key₂), indicate that major key is key in relation table R₁Tuple and major key be key₂Member The A attribute value of group is identical.

Matching: match-A (key₁,key₂), indicate that major key is key in relation table R₁Tuple and major key be key₂Member The A attribute value of group matches.

The present invention is expressed as A (key, v) for each attribute A in relationship R, by A, and A (key, v) indicates that major key is here The A attribute value of the tuple of key is v.

The Bayesian network building first-order predicate logic that the present invention is obtained according to study.It is assumed that data Table Properties A₁And A₂It Between there are a directed edge and from A₁It is directed toward A₂, then can be by A₁And A₂Between dependence form turn to following single order meaning Word logic:

Here v is tuple key₁And key₂A attribute value.If there is multiple attributes are directed toward an attribute, such as attribute A₁、 A₂、…、A_iIt is directed toward A simultaneously_j, then the dependence between them can turn to following first-order predicate logic in the form of:

Here v₁、v₂、…、v_iIt is tuple key₁And key₂In attribute A₁、A₂、…、A_iOn attribute value.

The Bayesian network that study obtains is converted to corresponding first-order predicate logic rule through the above way by the present invention SetConversion is carried out using automated manner, is inputted to rely on for Bayesian network and be gatheredAnd attribute set Attrs (R)=(A₁..., A_m), it exports as predicate setAnd rule setIt is rightByIt is called Set of wordsEither element in set Φ is relied on to Bayesian networkJ ∈ [1, n], it is assumed thatForLeft part,ForRight part.To rule setIn any rule f_j, j ∈ [1, n], it is assumed thatFor f_jLeft part, RHS (f_j) it is f_jRight part.It is assumed thatIn f_jWhen being not empty, according to B_k(key₁,B_k)∧B_k(key₂,B_k) obtainf_jWhen for sky, according to LHS (f_j)∧B_k(key₁,B_k)∧B_k(key₂,B_k) obtainAnd RHS (f_j) byIt obtains.

Two kinds of rule is obtained after conversion, one is absolute rules, another is the approximate non-absolute rule met Then.

4. first-order predicate logic weight calculation

For first-order predicate logic regular collectionIn each rule F, the present invention is converted into markov and patrols Network L is collected, L is the set for combining (F, w), and w is the weight of F here.For absolute rule, the present invention is positive to its weight assignment It is infinitely great；For the non-absolute rule that approximation meets, the present invention is based on the mutual information computation rules between the regular attribute being related to Weight.It is assumed that a first order logic rule is related to two attributes X and Y, the present invention carries out regular power using following formula Weight W_FCalculating:

W_F=exp { I (X；Y)}-1 (1)

For the first-order predicate logic rule comprising multiple attributes, it is assumed that rule is related to n+1 attribute, the mutual information of attribute For I (X；Y₁,Y₂,…,Y_n), the present invention carries out regular weight W using following formula_FCalculating:

W_F=exp { I (X；Y1,Y2,…,Yn)}-1 (2)

Here e is exponential function, index of the mutual information as exponential function e between attribute, W_FIndicate the confidence of rule Degree.Due to I (X；) >=0, and I (X Y；Y₁,Y₂,…,Y_n) >=0, it is ensured that exp { I (X；Y) } -1 >=0, while exp { I (X；Y1, Y2 ..., Yn) } -1 >=0, the regular weight W being calculated_FValue be nonnegative value, while the introducing of exponential function e is so that single order The weight of predicate logic rule is strengthened compared with the mutual information of corresponding attribute.

5.DeepDive rule generates

Markov Logic Network is converted into DeepDive rule by the present invention, in conversion process, DeepDive rule The predicate of boolean queries as in Markov Logic Network first-order predicate logic rule and data model translation defined in close It is that constant equal, match are obtained, the weight of DeepDive rule is obtained by Markov Logic Network rule weight, is represented The confidence level of DeepDive rule.

6. carrying out mistake/missing data reparation value probability inference based on DeepDive

Data are loaded into PostgreSQL database after pre-processing in the present invention, read preprocessed data by DeepDive, The DeepDive rule generated is executed, by traversing to the data obtained after initial data/initial data sampling, obtains horse The possible world of attribute involved in Er Kefu logical network L, it would be possible to which the world constructs the constant subset C that is limited.Markov is patrolled Network L is collected in conjunction with limited constant subset C, obtains corresponding Markov Network M_L,C。

For Markov Logic Network L, convert thereof into factor graph, the factor graph being converted to include variable node and Factor nodes.The variable node of factor graph obtains, by M_L,CThe attribute being related in rule obtains.The factor nodes of factor graph are The function of coupled variable, the present invention in factor nodes according to M_L,CThe first-order predicate being related in rule obtains.It is assumed that because Subgraph includes 3 variable nodes, respectively x₁、x₂And x₃, a factor nodes f₁, exist between factor nodes and variable node Connection relationship, as shown in Fig. 2.

After the factor graph after being converted, using the Ji Bu in Markov chain Monte-Carlo method on factor graph This method of sampling carries out statistical inference, and then the possible different values of the attribute for obtaining tuple and corresponding probability.

7. original dirty data cleaning

The present invention according to DeepDive probability inference as a result, the dirty data for including in detection data, according to obtained member The possible different values of the attribute of group and corresponding probability, choose the value of the attribute of maximum probability as error value reparation or The foundation of Missing Data Filling, traversal include the initial data of dirty data, are replaced to error value, are filled to missing values, To realize the cleaning of dirty data.

Claims

1. a kind of data auto-cleaning method based on DeepDive, it is characterised in that steps are as follows:

Step 1: data prediction carries out data scale judgement to the initial data comprising dirty data, if data scale is more than Threshold value samples initial data, otherwise data after being sampled maintain initial data；

Step 2: data model study, from the dependence learnt between attribute in the data that step 1 obtains, study is implied Nisi or relatively weak dependence, and indicated with the form of Bayesian network；

Step 3: data model translation defines first-order predicate logic predicate, including " equivalence ", " matching " predicate, step 1 is obtained Attribute between Bayesian network be automatically converted into first-order predicate logic rule, obtain based on Bayesian network between data attribute The first-order predicate logic regular collection of network；

Step 4: calculating the first-order predicate based on Bayesian network between data attribute that step 2 obtains using Mutual Information Theory and patrol The weight for collecting each first-order predicate logic rule in regular collection, is converted into Ma Er for the first-order predicate logic rule of Weight It can husband's logical network；

Step 5: DeepDive rule being generated based on Markov logical network, boolean queries are by Markov Logic Networks in rule First-order predicate logic obtains in network, and regular weight is the weight of Markov logical network；

Step 6: the probability inference of mistake/missing data being carried out based on DeepDive, model of the factor graph as reasoning is constructed, makes The probability of different value is taken with the attribute of Gibbs sampling method estimation tuple；

Step 7: according to probability inference as a result, the dirty data that includes in detection data and to dirty data selection probability it is maximum Value be compared with the threshold value of setting, if probability it is maximum value be greater than threshold value, be automatically repaired, otherwise do not done Processing.

2. data model learning method according to claim 1, which is characterized in that based between tables of data building attribute Then full connection figure calculates the mutual information I (X between any attribute X, Y；Y), given threshold λ, if I (X；Y) > λ, then retain Otherwise connection between attribute X, Y deletes the connection between X, Y, obtains a simplified network in this way, being based on should Network carries out complete search, finds the bayesian network structure for representing dependence between data Table Properties.

3. data model translation method according to claim 1, which is characterized in that the Bayesian network obtained according to study Construct first-order predicate logic, it is assumed that data Table Properties A₁And A₂Between there are a directed edge and from A₁It is directed toward A₂, then can be by A₁ And A₂Between dependence form turn to following first-order predicate logic:

Here v is tuple key₁And key₂A attribute value；

If there is multiple attributes are directed toward an attribute, such as attribute A₁、A₂、…、A_iIt is directed toward A simultaneously_j, then the dependence between them Following first-order predicate logic can be turned in the form of:

Here v₁、v₂、…、v_iIt is tuple key₁And key₂In attribute A₁、A₂、…、A_iOn attribute value；

The Bayesian network that study obtains is converted into corresponding first-order predicate logic regular collection through the above wayInclude two The rule of seed type, one is absolute rules, another is the approximate non-absolute rule met.

4. first-order predicate logic weighing computation method according to claim 1, which is characterized in that it is assumed that a first order logic Rule is related to two attributes X and Y, and the present invention carries out regular weight W using following formula_FCalculating:

W_F=exp { I (X；Y)}-1

For the first-order predicate logic rule comprising multiple attributes, it is assumed that rule is related to n+1 attribute, and the mutual information of attribute is I (X；Y₁,Y₂,…,Y_n), the present invention carries out regular weight W using following formula_FCalculating:

W_F=exp { I (X；Y1,Y2,…,Yn)}-1

Here e is exponential function, index of the mutual information as exponential function e between attribute, W_FIndicate the confidence level of rule.

5. DeepDive rule generating method according to claim 1, which is characterized in that turn Markov Logic Network It is regular to change DeepDive into, in conversion process, the predicate of the boolean queries of DeepDive rule is by Markov Logic Network First-order predicate logic rule and data model translation defined in relationship constant equal, match obtain, DeepDive rule Weight obtained by Markov Logic Network rule weight.

6. according to claim 1 carry out mistake/missing data reparation value probability inference method based on DeepDive, It is characterized in that, by traversing to the data obtained after initial data/initial data sampling, obtains Markov Logic Network The possible world of attribute involved in L, it would be possible to world's building is limited constant subset C, by Markov Logic Network L and it is limited often Duration set C is combined, and obtains corresponding Markov Network M_L,C；

For Markov Logic Network L, factor graph is converted thereof into, the factor graph being converted to includes variable node and the factor The variable node of node, factor graph obtains, by M_L,CThe attribute being related in rule obtains, and factor nodes are according to M_L,CIt is related in rule And to first-order predicate obtain；

After the factor graph after being converted, adopted on factor graph using the gibbs in Markov chain Monte-Carlo method Quadrat method carries out statistical inference, and then the possible different values of the attribute for obtaining tuple and corresponding probability.

7. original dirty data cleaning method according to claim 1, which is characterized in that according to DeepDive probability inference As a result, the dirty data for including in detection data, according to the possible different values of the attribute of obtained tuple and corresponding probability, The value of the attribute of maximum probability is chosen as error value reparation or the foundation of Missing Data Filling, traversal is original comprising dirty data Data are replaced error value, are filled to missing values, to realize the cleaning of dirty data.