CN109947752A - A kind of automaticdata cleaning method based on DeepDive - Google Patents

A kind of automaticdata cleaning method based on DeepDive Download PDF

Info

Publication number
CN109947752A
CN109947752A CN201910077102.6A CN201910077102A CN109947752A CN 109947752 A CN109947752 A CN 109947752A CN 201910077102 A CN201910077102 A CN 201910077102A CN 109947752 A CN109947752 A CN 109947752A
Authority
CN
China
Prior art keywords
data
attribute
rule
network
deepdive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910077102.6A
Other languages
Chinese (zh)
Inventor
李卫榜
李玲
谈文蓉
崔梦天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Minzu University
Original Assignee
Southwest Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Minzu University filed Critical Southwest Minzu University
Priority to CN201910077102.6A priority Critical patent/CN109947752A/en
Publication of CN109947752A publication Critical patent/CN109947752A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of data auto-cleaning method based on DeepDive, comprising: (1) be compared according to initial data scale with given threshold, scale is more than that then stochastical sampling obtains sampled data to threshold value;(2) from initial data or the acquistion of sampled data middle school to Bayesian network attribute;(3) Bayesian network that study obtains is converted into first-order predicate logic rule;(4) weight that first-order predicate logic rule is calculated using Mutual Information Theory, is converted into Markov Logic Network for the first-order predicate logic rule of Weight;(5) DeepDive rule is generated based on Markov logical network;(6) probability inference of mistake/missing data is carried out based on DeepDive, the attribute for obtaining tuple takes the probability of different value;(7) the reasoning results original dirty data is used to clean.The present invention can be used in cleaning automatically without the data in the case of ready-made quality of data mode/rule and manpower intervention, can effectively improve the efficiency and quality of data cleansing.

Description

A kind of automaticdata cleaning method based on DeepDive
Technical field
The present invention relates to technical field of data processing, and in particular to a kind of data based on the DeepDive side of cleaning automatically Method.
Background technique
Dirty data is very universal in real world, aiming at the problem that cleaning of dirty data is a long-term existence.Big data The importance in epoch, data cleansing more highlights.Detecting mistake from dirty data and repairing is that data analysis field is faced One of significant challenge, the inferior quality of data may cause precision of analysis and have a greatly reduced quality.In general, data cleansing Include two stages: first stage is error detection, detects mistake wherein included or abnormal data;Second stage It is errors repair, the mistake or exception for including in repair data.Current existing data cleaning method utilizes existing mostly Constraint rule or mode carry out error detection and reparation, and be primarily present following limitation: (1) existing method usually requires people The participation of work or additional information carry out the detection and reparation of mistake.These methods can also when data scale is lesser It deals with, however nowadays the scale of data is increased with exponential, therefore these methods are not appropriate for when facing mass data.And And the cost of labor cost of these methods is larger, when data scale is larger, cost of labor is higher.(2) some other method Dependent on using offline mode/rule when carrying out data reparation, and for many truthful datas and application scenarios, Obtaining offline mode/rule might not be feasible.(3) certain methods are depended on carries out mode/rule from clean data Study or the study of mode/rule is directly carried out from the data comprising dirty data.For some special scenes, number The participation of user is usually required according to the foundation of cleaning corpus or is suitable only for specific application scenarios.Although these methods can be with Study obtains mode/rule, but the effect of usually data cleansing is not fully up to expectations.
Summary of the invention
Technical problems to be solved
To solve to realize automaticdata cleaning, the present invention when lacking existing mode/rule and artificial participation A kind of automaticdata cleaning method based on DeepDive is proposed, to solve foregoing problems existing in the prior art.
Technical solution
A kind of automaticdata cleaning method based on DeepDive, it is characterised in that steps are as follows:
Step 1: data prediction carries out data scale judgement to the initial data comprising dirty data, if data scale More than threshold value, initial data is sampled, otherwise data after being sampled maintain initial data;
Step 2: data model study, from the dependence learnt between attribute in the data that step 1 obtains, study is obtained Implicit nisi or relatively weak dependence, and indicated with the form of Bayesian network;
Step 3: data model translation defines first-order predicate logic predicate, including " equivalence ", " matching " predicate, by step 1 The obtained Bayesian network between attribute is automatically converted into first-order predicate logic rule, obtains based on pattra leaves between data attribute The first-order predicate logic regular collection of this network;
Step 4: calculating the single order based on Bayesian network between data attribute that step 2 obtains using Mutual Information Theory and call The weight of each first-order predicate logic rule, the first-order predicate logic rule of Weight is converted into word logic rules set Markov Logic Network;
Step 5: DeepDive rule being generated based on Markov logical network, boolean queries are patrolled by Markov in rule It collects first-order predicate logic in network to obtain, regular weight is the weight of Markov logical network;
Step 6: the probability inference of mistake/missing data being carried out based on DeepDive, constructs mould of the factor graph as reasoning Type takes the probability of different value using the attribute of Gibbs sampling method estimation tuple;
Step 7: according to probability inference as a result, the dirty data that includes in detection data and selecting probability to dirty data Maximum value is compared with the threshold value of setting, if the maximum value of probability is greater than threshold value, is automatically repaired, otherwise It is not processed.
Beneficial effect
A kind of automaticdata cleaning method based on DeepDive proposed by the present invention, according to data scale it is of different sizes from To the Bayesian network between a data attribute, the network is anti-for the sampled data middle school acquistion of initial data or initial data Reflected the dependence in data between different attribute, by defining first-order predicate logic predicate, will the obtained attribute of study it Between Bayesian network be converted into first-order predicate logic rule, based on mutual information according to initial data calculate each first-order predicate The weight of logic rules obtains the corresponding Markov Logic Network with each first-order predicate logic rule, is based on Ma Erke Husband's logical network generates DeepDive rule, is carried out in the sampled data of initial data or initial data based on DeepDive The probability inference of mistake/missing data takes the probability of different value using the attribute of Gibbs sampling method estimation tuple, according to general Reparation value of the maximum value of rate calculated result select probability as attribute, repairs wrong data using reparation value, to scarce It loses data to be filled, to complete the automatic cleaning of data;Therefore, the present invention is when carrying out data cleansing without ready-made number According to cleaning model/rule, while without manpower intervention, on the one hand, built from the angle of probability to the relationship between data attribute Formwork erection type, therefore can tolerate that there are a certain proportion of dirty datas in training data;On the other hand, statistical relational learning and reasoning Application help to find out the latent fault in data and promote the effect of data cleansing;The present invention is able to solve in the prior art Existing the problem of needing ready-made quality of data mode/rule or needing manpower intervention, simultaneously because realizing data Automatic cleaning, the present invention are more suitable for the dirty data cleaning of data scale magnanimity, effectively improve the efficiency and quality of data cleansing.
Detailed description of the invention
Automaticdata cleaning process figure of the Fig. 1 based on DeepDive
Fig. 2 includes the factor graph of 3 variable nodes and 1 factor nodes
Specific embodiment
Now in conjunction with embodiment, attached drawing, the invention will be further described:
The invention proposes the automaticdata cleaning method based on DeepDive, automaticdata cleaning process figure is shown in Fig. 1, solution Certainly technical solution used by its technical problem includes the following contents:
1. data prediction
Set data scale threshold value to be cleanedData scale to be cleaned is calculated, that is, includes tuple number, ifThen to initial data stochastical sampling, otherwise data after being sampled keep initial data.
2. data model learns
In the case where lacking ready-made data cleansing mode/rule situation, counted from the data obtained after data prediction According to model learning, the nisi or relatively weak dependence implied in data is found out, and with the shape of Bayesian network Formula indicates.In the study stage of Bayesian network, the present invention is based on the full connection figures between tables of data building attribute, then calculate and appoint Mutual information I (the X to anticipate between attribute X, Y;Y), given threshold λ, if I (X;Y) > λ, then the connection between reserved property X, Y, no The connection between X, Y is then deleted, obtains a simplified network in this way, complete search is carried out based on the network, is looked for To the bayesian network structure for representing dependence between data Table Properties, this method ensures that the result is able to satisfy global optimum Change, facilitates the building of quality data quality rule.
3. data model translation
The Bayesian network of dependence is into one between the reflection data Table Properties that the present invention learns data model Step is converted into first-order predicate logic.In first-order predicate logic, relationship constant has contained the relationship between multiple elements.The present invention is fixed Adopted relationship constant is as follows:
It is of equal value: equal-A (key1,key2), indicate that major key is key in relation table R1Tuple and major key be key2Member The A attribute value of group is identical.
Matching: match-A (key1,key2), indicate that major key is key in relation table R1Tuple and major key be key2Member The A attribute value of group matches.
The present invention is expressed as A (key, v) for each attribute A in relationship R, by A, and A (key, v) indicates that major key is here The A attribute value of the tuple of key is v.
The Bayesian network building first-order predicate logic that the present invention is obtained according to study.It is assumed that data Table Properties A1And A2It Between there are a directed edge and from A1It is directed toward A2, then can be by A1And A2Between dependence form turn to following single order meaning Word logic:
Here v is tuple key1And key2A attribute value.If there is multiple attributes are directed toward an attribute, such as attribute A1、 A2、…、AiIt is directed toward A simultaneouslyj, then the dependence between them can turn to following first-order predicate logic in the form of:
Here v1、v2、…、viIt is tuple key1And key2In attribute A1、A2、…、AiOn attribute value.
The Bayesian network that study obtains is converted to corresponding first-order predicate logic rule through the above way by the present invention SetConversion is carried out using automated manner, is inputted to rely on for Bayesian network and be gatheredAnd attribute set Attrs (R)=(A1..., Am), it exports as predicate setAnd rule setIt is rightByIt is called Set of wordsEither element in set Φ is relied on to Bayesian networkJ ∈ [1, n], it is assumed thatForLeft part,ForRight part.To rule setIn any rule fj, j ∈ [1, n], it is assumed thatFor fjLeft part, RHS (fj) it is fjRight part.It is assumed thatIn fjWhen being not empty, according to Bk(key1,Bk)∧Bk(key2,Bk) obtainfjWhen for sky, according to LHS (fj)∧Bk(key1,Bk)∧Bk(key2,Bk) obtainAnd RHS (fj) byIt obtains.
Two kinds of rule is obtained after conversion, one is absolute rules, another is the approximate non-absolute rule met Then.
4. first-order predicate logic weight calculation
For first-order predicate logic regular collectionIn each rule F, the present invention is converted into markov and patrols Network L is collected, L is the set for combining (F, w), and w is the weight of F here.For absolute rule, the present invention is positive to its weight assignment It is infinitely great;For the non-absolute rule that approximation meets, the present invention is based on the mutual information computation rules between the regular attribute being related to Weight.It is assumed that a first order logic rule is related to two attributes X and Y, the present invention carries out regular power using following formula Weight WFCalculating:
WF=exp { I (X;Y)}-1 (1)
For the first-order predicate logic rule comprising multiple attributes, it is assumed that rule is related to n+1 attribute, the mutual information of attribute For I (X;Y1,Y2,…,Yn), the present invention carries out regular weight W using following formulaFCalculating:
WF=exp { I (X;Y1,Y2,…,Yn)}-1 (2)
Here e is exponential function, index of the mutual information as exponential function e between attribute, WFIndicate the confidence of rule Degree.Due to I (X;) >=0, and I (X Y;Y1,Y2,…,Yn) >=0, it is ensured that exp { I (X;Y) } -1 >=0, while exp { I (X;Y1, Y2 ..., Yn) } -1 >=0, the regular weight W being calculatedFValue be nonnegative value, while the introducing of exponential function e is so that single order The weight of predicate logic rule is strengthened compared with the mutual information of corresponding attribute.
5.DeepDive rule generates
Markov Logic Network is converted into DeepDive rule by the present invention, in conversion process, DeepDive rule The predicate of boolean queries as in Markov Logic Network first-order predicate logic rule and data model translation defined in close It is that constant equal, match are obtained, the weight of DeepDive rule is obtained by Markov Logic Network rule weight, is represented The confidence level of DeepDive rule.
6. carrying out mistake/missing data reparation value probability inference based on DeepDive
Data are loaded into PostgreSQL database after pre-processing in the present invention, read preprocessed data by DeepDive, The DeepDive rule generated is executed, by traversing to the data obtained after initial data/initial data sampling, obtains horse The possible world of attribute involved in Er Kefu logical network L, it would be possible to which the world constructs the constant subset C that is limited.Markov is patrolled Network L is collected in conjunction with limited constant subset C, obtains corresponding Markov Network ML,C
For Markov Logic Network L, convert thereof into factor graph, the factor graph being converted to include variable node and Factor nodes.The variable node of factor graph obtains, by ML,CThe attribute being related in rule obtains.The factor nodes of factor graph are The function of coupled variable, the present invention in factor nodes according to ML,CThe first-order predicate being related in rule obtains.It is assumed that because Subgraph includes 3 variable nodes, respectively x1、x2And x3, a factor nodes f1, exist between factor nodes and variable node Connection relationship, as shown in Fig. 2.
After the factor graph after being converted, using the Ji Bu in Markov chain Monte-Carlo method on factor graph This method of sampling carries out statistical inference, and then the possible different values of the attribute for obtaining tuple and corresponding probability.
7. original dirty data cleaning
The present invention according to DeepDive probability inference as a result, the dirty data for including in detection data, according to obtained member The possible different values of the attribute of group and corresponding probability, choose the value of the attribute of maximum probability as error value reparation or The foundation of Missing Data Filling, traversal include the initial data of dirty data, are replaced to error value, are filled to missing values, To realize the cleaning of dirty data.

Claims (7)

1. a kind of data auto-cleaning method based on DeepDive, it is characterised in that steps are as follows:
Step 1: data prediction carries out data scale judgement to the initial data comprising dirty data, if data scale is more than Threshold value samples initial data, otherwise data after being sampled maintain initial data;
Step 2: data model study, from the dependence learnt between attribute in the data that step 1 obtains, study is implied Nisi or relatively weak dependence, and indicated with the form of Bayesian network;
Step 3: data model translation defines first-order predicate logic predicate, including " equivalence ", " matching " predicate, step 1 is obtained Attribute between Bayesian network be automatically converted into first-order predicate logic rule, obtain based on Bayesian network between data attribute The first-order predicate logic regular collection of network;
Step 4: calculating the first-order predicate based on Bayesian network between data attribute that step 2 obtains using Mutual Information Theory and patrol The weight for collecting each first-order predicate logic rule in regular collection, is converted into Ma Er for the first-order predicate logic rule of Weight It can husband's logical network;
Step 5: DeepDive rule being generated based on Markov logical network, boolean queries are by Markov Logic Networks in rule First-order predicate logic obtains in network, and regular weight is the weight of Markov logical network;
Step 6: the probability inference of mistake/missing data being carried out based on DeepDive, model of the factor graph as reasoning is constructed, makes The probability of different value is taken with the attribute of Gibbs sampling method estimation tuple;
Step 7: according to probability inference as a result, the dirty data that includes in detection data and to dirty data selection probability it is maximum Value be compared with the threshold value of setting, if probability it is maximum value be greater than threshold value, be automatically repaired, otherwise do not done Processing.
2. data model learning method according to claim 1, which is characterized in that based between tables of data building attribute Then full connection figure calculates the mutual information I (X between any attribute X, Y;Y), given threshold λ, if I (X;Y) > λ, then retain Otherwise connection between attribute X, Y deletes the connection between X, Y, obtains a simplified network in this way, being based on should Network carries out complete search, finds the bayesian network structure for representing dependence between data Table Properties.
3. data model translation method according to claim 1, which is characterized in that the Bayesian network obtained according to study Construct first-order predicate logic, it is assumed that data Table Properties A1And A2Between there are a directed edge and from A1It is directed toward A2, then can be by A1 And A2Between dependence form turn to following first-order predicate logic:
Here v is tuple key1And key2A attribute value;
If there is multiple attributes are directed toward an attribute, such as attribute A1、A2、…、AiIt is directed toward A simultaneouslyj, then the dependence between them Following first-order predicate logic can be turned in the form of:
Here v1、v2、…、viIt is tuple key1And key2In attribute A1、A2、…、AiOn attribute value;
The Bayesian network that study obtains is converted into corresponding first-order predicate logic regular collection through the above wayInclude two The rule of seed type, one is absolute rules, another is the approximate non-absolute rule met.
4. first-order predicate logic weighing computation method according to claim 1, which is characterized in that it is assumed that a first order logic Rule is related to two attributes X and Y, and the present invention carries out regular weight W using following formulaFCalculating:
WF=exp { I (X;Y)}-1
For the first-order predicate logic rule comprising multiple attributes, it is assumed that rule is related to n+1 attribute, and the mutual information of attribute is I (X;Y1,Y2,…,Yn), the present invention carries out regular weight W using following formulaFCalculating:
WF=exp { I (X;Y1,Y2,…,Yn)}-1
Here e is exponential function, index of the mutual information as exponential function e between attribute, WFIndicate the confidence level of rule.
5. DeepDive rule generating method according to claim 1, which is characterized in that turn Markov Logic Network It is regular to change DeepDive into, in conversion process, the predicate of the boolean queries of DeepDive rule is by Markov Logic Network First-order predicate logic rule and data model translation defined in relationship constant equal, match obtain, DeepDive rule Weight obtained by Markov Logic Network rule weight.
6. according to claim 1 carry out mistake/missing data reparation value probability inference method based on DeepDive, It is characterized in that, by traversing to the data obtained after initial data/initial data sampling, obtains Markov Logic Network The possible world of attribute involved in L, it would be possible to world's building is limited constant subset C, by Markov Logic Network L and it is limited often Duration set C is combined, and obtains corresponding Markov Network ML,C
For Markov Logic Network L, factor graph is converted thereof into, the factor graph being converted to includes variable node and the factor The variable node of node, factor graph obtains, by ML,CThe attribute being related in rule obtains, and factor nodes are according to ML,CIt is related in rule And to first-order predicate obtain;
After the factor graph after being converted, adopted on factor graph using the gibbs in Markov chain Monte-Carlo method Quadrat method carries out statistical inference, and then the possible different values of the attribute for obtaining tuple and corresponding probability.
7. original dirty data cleaning method according to claim 1, which is characterized in that according to DeepDive probability inference As a result, the dirty data for including in detection data, according to the possible different values of the attribute of obtained tuple and corresponding probability, The value of the attribute of maximum probability is chosen as error value reparation or the foundation of Missing Data Filling, traversal is original comprising dirty data Data are replaced error value, are filled to missing values, to realize the cleaning of dirty data.
CN201910077102.6A 2019-01-28 2019-01-28 A kind of automaticdata cleaning method based on DeepDive Pending CN109947752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910077102.6A CN109947752A (en) 2019-01-28 2019-01-28 A kind of automaticdata cleaning method based on DeepDive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910077102.6A CN109947752A (en) 2019-01-28 2019-01-28 A kind of automaticdata cleaning method based on DeepDive

Publications (1)

Publication Number Publication Date
CN109947752A true CN109947752A (en) 2019-06-28

Family

ID=67006553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910077102.6A Pending CN109947752A (en) 2019-01-28 2019-01-28 A kind of automaticdata cleaning method based on DeepDive

Country Status (1)

Country Link
CN (1) CN109947752A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352928A (en) * 2020-02-27 2020-06-30 哈尔滨工业大学 Data cleaning method using CFDs, computer device and readable storage medium
WO2021098214A1 (en) * 2019-11-19 2021-05-27 平安科技(深圳)有限公司 Data sample obtaining method and apparatus, and electronic device and storage medium
CN114968827A (en) * 2022-08-01 2022-08-30 江铃汽车股份有限公司 Vehicle bus signal information checking method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447079A (en) * 2015-11-04 2016-03-30 华中科技大学 Data cleaning method based on functional dependency
CN107103000A (en) * 2016-02-23 2017-08-29 广州启法信息科技有限公司 It is a kind of based on correlation rule and the integrated recommended technology of Bayesian network
CN108399226A (en) * 2018-02-12 2018-08-14 安徽千云度信息技术有限公司 A kind of big data cleaning method for digital library
CN109213755A (en) * 2018-09-30 2019-01-15 长安大学 A kind of traffic flow data cleaning and restorative procedure based on Time-space serial

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447079A (en) * 2015-11-04 2016-03-30 华中科技大学 Data cleaning method based on functional dependency
CN107103000A (en) * 2016-02-23 2017-08-29 广州启法信息科技有限公司 It is a kind of based on correlation rule and the integrated recommended technology of Bayesian network
CN108399226A (en) * 2018-02-12 2018-08-14 安徽千云度信息技术有限公司 A kind of big data cleaning method for digital library
CN109213755A (en) * 2018-09-30 2019-01-15 长安大学 A kind of traffic flow data cleaning and restorative procedure based on Time-space serial

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李卫榜: "分布式大数据一致性管理关键技术研究", 《中国博士学位论文全文数据库 科技信息辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021098214A1 (en) * 2019-11-19 2021-05-27 平安科技(深圳)有限公司 Data sample obtaining method and apparatus, and electronic device and storage medium
CN111352928A (en) * 2020-02-27 2020-06-30 哈尔滨工业大学 Data cleaning method using CFDs, computer device and readable storage medium
CN114968827A (en) * 2022-08-01 2022-08-30 江铃汽车股份有限公司 Vehicle bus signal information checking method and system

Similar Documents

Publication Publication Date Title
CN109947752A (en) A kind of automaticdata cleaning method based on DeepDive
CN112966954B (en) Flood control scheduling scheme optimization method based on time convolution network
CN111310438A (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN111914550B (en) Knowledge graph updating method and system oriented to limited field
CN113112164A (en) Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
TWI590095B (en) Verification system for software function and verification mathod therefor
Nie et al. 2-tuple linguistic intuitionistic preference relation and its application in sustainable location planning voting system
CN104331523A (en) Conceptual object model-based question searching method
CN107067033A (en) The local route repair method of machine learning model
CN101901251A (en) Method for analyzing and recognizing complex network cluster structure based on markov process metastability
CN113312494A (en) Vertical domain knowledge graph construction method, system, equipment and storage medium
Crook et al. Lossless value directed compression of complex user goal states for statistical spoken dialogue systems
CN110796169A (en) Attribute reduction method for neighborhood decision error rate integration
CN109491991B (en) Unsupervised automatic data cleaning method
CN108536796A (en) A kind of isomery Ontology Matching method and system based on figure
CN113239272B (en) Intention prediction method and intention prediction device of network management and control system
CN113034033B (en) Method for determining variety of newly-researched equipment spare parts
CN110570093B (en) Method and device for automatically managing business expansion channel
CN110990426B (en) RDF query method based on tree search
CN102654865A (en) Method and system for digital object classification
CN117057522B (en) Intelligent construction method and system of cost database
CN115795131B (en) Electronic file classification method and device based on artificial intelligence and electronic equipment
CN108615056A (en) A kind of tree enhancing Naive Bayes Classification method based on decomposable asymmetric choice net score function
Wright et al. Bayesian networks, total variation and robustness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190628

WD01 Invention patent application deemed withdrawn after publication