CN109491991B

CN109491991B - Unsupervised automatic data cleaning method

Info

Publication number: CN109491991B
Application number: CN201811325335.5A
Authority: CN
Inventors: 李玲; 唐军; 吴纯彬; 于跃; 陈秋宇
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2022-03-01
Anticipated expiration: 2038-11-08
Also published as: CN109491991A

Abstract

The invention discloses an unsupervised automatic data cleaning method, which comprises the following steps: A. learning a data model, namely learning the dependency relationship among attributes from original data possibly containing invalid data, and finding out the implicit non-absolute or relatively weak dependency relationship to obtain the data model represented in the form of a Bayesian network; B. generating a data cleaning rule; generating a data cleaning rule after obtaining the original data or a complete data model of original data sampling, and specifically generating a predicate and a first-order predicate rule; C. b, generating a Markov logic network based on the predicate generated in the step B and the first-order predicate rule; D. and C, generating a reasoning rule based on the Markov logic network generated in the step C and cleaning data based on a reasoning result. The method can effectively improve the data quality of each business system of the company under the condition of not consuming a large amount of manpower and material resources, and is beneficial to a management layer to make a correct decision.

Description

Unsupervised automatic data cleaning method

Technical Field

The invention relates to the technical field of data management, in particular to an unsupervised automatic data cleaning method.

Background

Real world data is typically cleaned (hereinafter, cleaning is defined as dirty data), as it may contain values such as inconsistent, noisy, incomplete, or repetitive. In the commercial world, erroneous data can cause significant economic losses. For example, incorrect customer information may lead to incorrect delivery of goods purchased by the customer by the company, which not only increases the delivery cost of the enterprise, but also has a significant negative impact on the image of the enterprise over a relatively long period of time.

Among the existing data cleaning methods, some methods need heavy manual participation in the data cleaning process, such as providing suggestions for cleaning or confirming repair; some methods do not need manual participation in the cleaning process, but need to make relevant cleaning rules in advance. The existing data cleaning method is not suitable for the situation that the data rule is unknown or the labor cost is hard to bear. In consideration of the current situation of the current data cleaning method, the method solves the problem that data cleaning is carried out without defining cleaning rules in advance and without manual participation, and improves data quality.

Disclosure of Invention

The invention aims to overcome the defects in the background art, and provides an unsupervised automatic data cleaning method, which is used for learning rules from data based on statistical relationship learning and cleaning the data based on probabilistic reasoning, so that the data cleaning efficiency and effect can be effectively improved, the data quality of each business system of a company can be effectively improved under the condition of not consuming a large amount of manpower and material resources, the satisfaction degree of a user is improved, and meanwhile, a management layer can make a correct decision based on the improved data quality.

In order to achieve the technical effects, the invention adopts the following technical scheme:

an unsupervised automatic data cleaning method comprises the following steps:

A. learning a data model, namely learning the dependency relationship among attributes from original data possibly containing invalid data, and finding out the implicit non-absolute or relatively weak dependency relationship to obtain the data model represented in the form of a Bayesian network;

B. generating a data cleaning rule; generating a data cleaning rule after obtaining the original data or a complete data model of original data sampling, and specifically generating a predicate and a first-order predicate rule, namely a first-order predicate logic expression;

C. b, generating a Markov logic network based on the predicate generated in the step B and the first-order predicate rule;

D. and C, generating a reasoning rule based on the Markov logic network generated in the step C and cleaning data based on a reasoning result.

Further, the step a specifically includes:

A1. evaluating and sampling data to be repaired, namely original data possibly containing invalid data;

A2. learning the original data set or the sampled data set to obtain a structure of a data model expressed in a Bayesian network form; the structure of the bayesian network reflects the dependencies and degrees of dependencies between data attributes,

A3. learning an original data set or a sampled data set to obtain parameters of a data model, wherein the specific form of the parameters is a conditional probability table of a dependency relationship;

A4. and combining the structure of the data model and the parameters of the data model to obtain the complete data model. Further, the step B specifically includes:

B1. defining a relationship constant for representing a relationship between the bodies;

B2. generating a corresponding first-order predicate logic expression according to the complete data model obtained in the step A4: specifically, predicates and first-order predicate rules are generated according to the Bayesian network obtained through learning, and conversion rules for converting the dependency relationship into the first-order predicate logic expression are respectively formulated according to different conditions that a single attribute points to one attribute and a plurality of attributes point to one attribute.

Further, in the step B2;

when a single attribute is pointed to an attribute, attribute A₁And A₂There is a directed edge between and from A₁Point of direction A₂Then A will be₁And A₂The dependency relationship between them is formalized as the first order predicate logic as follows:

where v is tuple id₁And id₂The A attribute value of (1);

when one attribute is pointed to for a plurality of attributes, attribute A₁、A₂、…、A_iPointing simultaneously to A_jThen its dependency is formalized as a first order predicate logic as follows:

wherein v is₁、v₂、…、v_iIs a tuple id₁And id₂At attribute A₁、A₂、…、A_iIs determined. Further, the step C specifically includes:

C1. distinguishing the generated first-order predicate rules into absolute rules and non-absolute rules according to whether the first-order predicate rules are logic effective expressions or not, namely, under any explanation, the probability is 1;

C2. calculating the weight of the first-order predicate logic, which comprises respectively making different weight calculation strategies aiming at absolute rules and non-absolute rules, wherein the weight of the absolute rules is assigned as positive infinity, and the weights of the rules are calculated by using mutual information for the non-absolute rules;

C3. according to the first-order predicate rule generated in the step B2, calculating the weight of the rule based on mutual information between the attributes related to the rule;

C4. from the weight calculation result in step C3, a markov logic network of the original data set or the sampled data set is obtained.

Further, the step C3 specifically includes:

c3.1, aiming at different conditions that a first-order predicate logic rule relates to two attributes and a plurality of attributes, respectively formulating different rule weight calculation methods; wherein the content of the first and second substances,

aiming at the condition that a first-order predicate logic rule relates to two attributes, the mutual information of the two attributes on an original data set or a sampled data set is used for calculating rule weight;

the mutual information is a real number with a value range between 0 and 1, if the attributes are completely correlated, the mutual information is 1, if the attributes are completely uncorrelated, the mutual information is 0, if the rule relates to the two attributes, the mutual information is a statistical average value of joint probability densities of the two attribute variables, and the statistical average value is used as a first-order predicate logic rule to relate to the weights of the two attributes, if the weight is higher, the correlation is strong, and the interpretability is strong; since the first-order predicate logic rule relates to the discrete characteristics of the attributes, the mutual information is defined as:

where P (x, y) is a joint probability distribution function, and P (x) and P (y) are edge probability density functions

C3.2 when the regular weight is calculated, introducing an exponential function for calculation, and ensuring that the weight result is a number not less than 0, wherein the introduced exponential function is a potential function of several attributes characterized by a non-negative real function, is equivalent to the weighted characteristic quantity of several attribute characteristics, and plays a role in normalization, and the formula is as follows:

further, the step D specifically includes:

D1. reasoning is carried out based on the Markov logic network generated in the step C4, rule reasoning is carried out by adopting a Gibbs sampling method in the Markov chain Monte Carlo, and the weight of the Gibbs sampling reasoning rule is determined according to the rule of Gibbs sampling reasoning generated by the Markov logic network;

D2. constructing a Gibbs sampling reasoning model, using a factor graph as the Gibbs sampling reasoning model, and determining variables and factors of the factor graph in the reasoning model, wherein the factors are used for evaluating the relation between the variables;

D3. constructing possible worlds of the variables according to the predicates generated in step B2;

D4. reasoning about the possible world of the predicate of step D3 according to the reasoning model constructed in step D2;

D5. and D4, cleaning and repairing the original data set based on the reasoning result of the step D4.

Further, in the step D5, the maximum value is selected as the value after the repair.

Compared with the prior art, the invention has the following beneficial effects:

the unsupervised automatic data cleaning method is an unsupervised automatic data cleaning method based on statistical relationship learning, manual intervention is not needed when data cleaning is carried out, so that the labor cost of data cleaning can be greatly saved, and meanwhile, because rule discovery is automatically carried out from original data containing dirty data, a data quality rule does not need to be established in advance. The unsupervised automatic data cleaning method can effectively improve the data cleaning effect, improve the data accuracy and improve the data cleaning efficiency.

Drawings

FIG. 1 is a block diagram of the unsupervised data auto-cleaning method of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the embodiments of the invention described hereinafter.

Example (b):

as shown in fig. 1, an unsupervised automatic data cleaning method can implement data cleaning under the condition of lacking data quality mode/rule and without human intervention, and simultaneously ensure the effect and efficiency of data cleaning.

The method specifically comprises the following steps:

s10, learning of a data model:

to find the implicit patterns/rules, the dependencies between attributes need to be learned from the raw data that may contain invalid data. Since invalid data may exist, absolute or strong dependency relationships between the attributes of the data table do not necessarily exist, and the data model is obtained by finding out implicit non-absolute or relatively weak dependency relationships and representing the dependency relationships in the form of a Bayesian network.

The key flow extracted in the step is as follows:

s101, evaluating and sampling data to be repaired;

s102, learning an original data set or a sampled data set to obtain a structure of a data model expressed in a Bayesian network form, wherein the specific form is the Bayesian network;

s103, learning the original data set or the sampled data set to obtain parameters of a data model, wherein the specific form of the parameters is a conditional probability table of a dependency relationship;

and S104, combining the structures and parameters of the data models in the step S102 and the step S103 to obtain a complete data model.

S20, generating a data cleaning rule:

after a complete data model of the raw data or raw data samples is obtained, generation of data cleansing rules is performed.

The data cleaning rule generation comprises the following main steps:

s201, defining a relation constant. The relation constants contain relations among a plurality of elements and are mainly used for representing relations among the main bodies, and the relation constants such as equivalence, matching and the like need to be defined in the step.

And S202, generating a corresponding first-order predicate logic expression according to the data model.

The Bayesian network is a reflection of the dependency between attributes in the relational table if node N₁Point to N₂Then, it represents N₂To some extent dependent on N₁. Based on the consideration, a first-order predicate logic is constructed according to the Bayesian network obtained by learning.

Assume attribute A₁And A₂There is a directed edge between and from A₁Point of direction A₂Then A can be substituted₁And A₂The dependency relationship between the two is formalized as the following first-order predicate logic expression:

where v is tuple id₁And id₂The a attribute value of (1).

If there are multiple attributes pointing to an attribute, e.g., attribute A₁、A₂、…、A_iPointing simultaneously to A_jThen the dependencies between them can also be formalized as first order predicate logic as follows:

wherein v is₁、v₂、…、v_iIs a tuple id₁And id₂At attribute A₁、A₂、…、A_iIs determined.

In other words, in this step, conversion rules for converting the dependency relationship into the first-order predicate logic expression are respectively formulated for different situations where a single attribute points to one attribute and multiple attributes point to one attribute, and predicates and the first-order predicate rules are automatically generated according to the complete data model obtained in S104.

And S30, generating the Markov logic network based on the predicate generated in the step S202 and the first-order predicate rule.

The markov logic network defines a probability distribution over a possible world, which in the context of data cleansing refers to the possible repair of erroneous data. The markov logic network includes first order predicate logic rules and corresponding weights. The weight is the reflection of the satisfaction degree of the first-order predicate logic, and the larger the weight is, the higher the satisfaction degree of the first-order predicate logic is.

And S301, distinguishing the generated first-order predicate rules into absolute rules and non-absolute rules, specifically, distinguishing the generated first-order predicate rules into the absolute rules and the non-absolute rules according to whether the first-order predicate rules are logic effective expressions or not, namely, the probability is 1 under any explanation.

S302, calculating the weight of the first-order predicate logic.

And respectively making different weight calculation strategies according to the absolute rule and the non-absolute rule. For absolute rules, the weight assignment is positive infinity. Non-absolute rules belong to approximate fulfilment, for which mutual information is used to calculate the weights of these rules. Each approximately satisfied first-order predicate logic rule is a reflection of the dependency relationship between the attributes in the relationship table, and the degree of dependency relationship is expressed by calculating mutual information between the attributes.

And S303, calculating the weight of the rule based on mutual information between the attributes related to the rule according to the first-order predicate rule generated in the step S302.

Aiming at different conditions that a first-order logic rule relates to two attributes and a plurality of attributes, different rule weight calculation methods are respectively formulated.

For the case that a first-order logic rule relates to two attributes, the rule weight is calculated by utilizing mutual information of the two attributes on the original data set or the original data set sample.

Mutual information is a real number with a value range between 0 and 1, and if the attributes are completely correlated, the mutual information is 1, and if the attributes are completely uncorrelated, the mutual information is 0. If the rule relates to two attributes, the mutual information is a statistical average value of the joint probability density of the two attribute variables, and the statistical average value is used as the weight of the first-order predicate logic rule related to the two attributes, and if the weight is higher, the correlation is strong, and the interpretability is strong; since the first-order predicate logic rule relates to the discrete characteristics of the attributes, the mutual information is defined as:

where P (x, y) is a joint probability distribution function, and P (x) and P (y) are edge probability density functions.

When the rule weight is calculated, an exponential function is introduced for calculation, the weight result is ensured to be a number which is more than or equal to 0, and the obtained weight can better reflect the dependency relationship among the attributes. Because the introduced exponential function is a potential function of several attributes characterized by non-negative real functions, which is equivalent to the weighted characteristic quantity of several attribute characteristics, the normalization is performed, and the formula is as follows:

meanwhile, with the increase of mutual information, the weight is increased exponentially, so that the effect of a high-weight rule in the data cleaning process can be increased, and the data cleaning effect is improved.

S304, according to the weight calculation result in the step S303, the original data or the Markov logic network of the original data sampling is automatically obtained.

And S40, generating a reasoning rule based on the Markov logic network generated in the step S304 and cleaning data based on a reasoning result.

The method specifically comprises the following steps:

s401, reasoning is carried out based on the Markov logic network generated in the step S304, and regular reasoning is carried out by adopting a Gibbs sampling method in the Markov chain Monte Carlo. And determining the weight of the Gibbs sampling inference rule according to the Gibbs sampling inference rule generated by the Markov logic network.

S402, constructing a Gibbs sampling reasoning model.

The factor graph is used as a gibbs sampling inference model. Variables and factors of a factor graph in the inference model are determined, and the factors are used for evaluating the relation between the variables.

S403, constructing possible worlds of variables based on the predicates generated in the step S202, wherein the possible worlds are the basis of reasoning.

S404, reasoning is carried out on the possible world of the predicate of the step S403 based on the reasoning model constructed in the step S402.

S405, cleaning and repairing the original data set based on the inference result of the step S404. For each data to be repaired, the expected maximum value is selected as the repaired value.

In summary, the unsupervised automatic data cleaning method is an unsupervised automatic data cleaning method based on statistical relationship learning, manual intervention is not needed during data cleaning, so that the labor cost of data cleaning can be greatly saved, and meanwhile, the rule discovery is automatically carried out from original data containing dirty data, so that the data quality rule does not need to be established in advance. The unsupervised automatic data cleaning method can effectively improve the data cleaning effect, improve the data accuracy and improve the data cleaning efficiency.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. An unsupervised automatic data cleaning method is characterized by comprising the following steps:

B. generating a data cleaning rule; generating a data cleaning rule after obtaining the original data or a complete data model of original data sampling, and specifically generating a predicate and a first-order predicate rule;

2. The unsupervised automatic data cleaning method according to claim 1, wherein the step a specifically comprises:

A2. learning the original data set or the sampled data set to obtain a structure of a data model expressed in a Bayesian network form;

A4. and combining the structure of the data model and the parameters of the data model to obtain the complete data model.

3. The unsupervised automatic data cleaning method according to claim 2, wherein the step B specifically comprises:

B2. generating a corresponding first-order predicate logic expression according to the complete data model obtained in the step A4: specifically, the method comprises the steps of generating predicates and first-order predicate rules, namely first-order predicate logic expressions, according to the Bayesian network obtained through learning, and respectively formulating conversion rules for converting the dependency relationship into the first-order predicate logic expressions according to different conditions that a single attribute points to one attribute and a plurality of attributes point to one attribute.

4. An unsupervised automatic data washing method as claimed in claim 3, wherein in step B2;

where v is tuple id₁And id₂The A attribute value of (1);

5. The unsupervised automatic data cleaning method according to claim 3, wherein the step C specifically comprises:

C1. distinguishing the generated first-order predicate rules into absolute rules and non-absolute rules;

6. The unsupervised automatic data washing method according to claim 5, wherein the step C3 specifically includes:

c3.1, aiming at the condition that a first-order predicate logic rule relates to two attributes, utilizing mutual information of the two attributes on an original data set or a sampled data set to calculate rule weight;

the mutual information is a real number with a value range between 0 and 1, if the attributes are completely correlated, the mutual information is 1, and if the attributes are completely uncorrelated, the mutual information is 0;

and C3.2, when the rule weight is calculated, introducing an exponential function for calculation, and ensuring that the weight result is a number not less than 0.

7. The unsupervised automatic data cleaning method according to claim 6, wherein the step D specifically comprises:

8. The method according to claim 7, wherein the repairing in step D5 is performed by selecting the expected maximum value as the repaired value.