CN109634949B

CN109634949B - Mixed data cleaning method based on multiple data versions

Info

Publication number: CN109634949B
Application number: CN201811628044.3A
Authority: CN
Inventors: 高云君; 陈刚; 陈纯; 葛丛丛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2022-04-12
Anticipated expiration: 2038-12-28
Also published as: CN109634949A

Abstract

The invention discloses a mixed data cleaning method based on multiple data versions. The method utilizes a Markov logic network probability graph model and a minimum restoration principle, combines a qualitative technology and a quantitative technology, designs an efficient data cleaning method, detects and corrects wrong structured data, ensures that a cleaning result can clean dirty data violating the rule constraint, meets the requirement of minimum change cost of a data set, and can meet the statistical characteristic. The present invention first divides the entire data set into blocks and groups according to the Markov logical indexing technique, and then performs a two-stage data cleansing. In the first stage, data in each group are cleaned by introducing an evaluation standard of a credibility score to obtain a multi-version data cleaning result; and in the second stage, the multi-version results generated in the previous stage are fused by introducing the evaluation standard of the fusion score, so that the final uniform cleaning result is generated.

Description

Mixed data cleaning method based on multiple data versions

Technical Field

The invention relates to a technology for cleaning error data in the field of computer databases, in particular to a mixed data cleaning method based on multiple data versions.

Background

The purpose of data cleansing is to find the content in the data set that is most likely to be erroneous data and to provide a reliable method of correcting erroneous data. Dirty data is data in which an error exists in the data set.

Nowadays, with the continuous emergence of new information publishing modes represented by social networks and electronic commerce and the rise of computer technologies of cloud computing and internet of things, data is continuously growing and accumulating at an unprecedented speed, and in data analysis, the existence of dirty data not only can lead to wrong decision and unreliable analysis, but also can strike the economy of a company. Therefore, a great interest in data cleansing is generated in both the industry and academia. Data scrubbing is a process of detecting and repairing error data, and aims to delete redundant information in the error data, correct existing error information and maintain data consistency.

At present, scholars at home and abroad have made some work aiming at a data cleaning method. The mainstream methods at present can be roughly divided into two types, namely qualitative methods and quantitative methods: (1) the qualitative method mainly cleans the error data violating the integrity constraint rule, the evaluation standard is a minimum cost principle, namely the change of the cleaning cost to the data set is minimized, and the defect is that the method cannot clean the error data not meeting the minimum cost principle, although the method still violates the integrity constraint; (2) the quantitative method is to construct a proper model based on the probability distribution of data to decide a cleaning strategy, and has the disadvantages that the method is strongly dependent on a training set, enough and clean known data needs to be provided as the training set to construct a reliable model, which is not suitable for the current big data environment, the data obtained by cleaning most of the current quantitative methods are poorer in performance than the qualitative method, and the running time of the current method is longer.

Disclosure of Invention

In order to overcome the defects, the invention provides a mixed data cleaning method based on multiple data versions. The method is based on a Markov logic network, firstly, a whole data set is divided into blocks and groups according to a Markov logic index technology, then two-stage data cleaning is carried out, wherein in the first stage, data cleaning is carried out on each block independently to obtain multi-version data cleaning results; and in the second stage, based on the multi-version data result, the conflict is eliminated, and the final overall unified cleaning result is obtained. The markov logic index technique reduces the detection range of dirty data and can efficiently perform data cleansing.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a mixed data cleaning method based on multiple data versions comprises the following steps:

(1) obtaining integrity constraint rules (ICs) with dirty data sets and associated;

(2) converting different types of integrity constraint rules into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in a dirty data set, wherein each instantiation rule is called a data slice;

(3) establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different data groups again;

(4) on the basis of the step (3), cleaning in the first stage is carried out, evaluation criteria of credibility scores are introduced, and data versions of a plurality of preliminary cleaning results are obtained by independently cleaning each data set;

(5) executing the cleaning of the second stage, introducing an evaluation standard of a fusion score, fusing the data versions of a plurality of preliminary cleaning results generated in the first stage, eliminating the conflict problem among multiple versions, and generating a final uniform cleaning result;

(6) marking repeated items existing in the dirty data set, and deleting repeated data still existing after the two-stage washing;

(7) and outputting the data set after data cleaning.

Further, the step (2) is specifically as follows:

(2.1) normalizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;

(2.2) replace all variables in the normalized rule with the corresponding constants of the data set.

Further, the step (3) is specifically as follows:

(3.1) dividing the whole dirty data set into a plurality of data blocks according to integrity constraint rules contained in the dirty data set, wherein each rule corresponds to one data block, and each data block contains a plurality of data slices;

(3.2) in each data block, dividing the items with the same key in the attribute into the same group; the keywords are reason items of the rules, and the data pieces with the same reason are divided into a group.

Further, the step (4) is specifically as follows:

(4.1) processing exception data: the phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormity', and then the data slices with the errors are divided into the corresponding groups again;

(4.2) calculating a reliability score (reliability score) of abnormal data in each group according to a similarity distance measurement method and a Markov logic network weight learning method;

(4.3) washing each data set independently: the cleaning unit is each group in the data block, the data slice gamma with the largest credibility score is selected as the replacement reference, and other suspicious data belonging to the same data group are replaced by the data until each data group in the data block is cleaned completely, so that the independent cleaning of the data block is completed;

similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version.

Further, the step (5) is specifically as follows:

(5.1) firstly, recording all different data versions of the positions where the conflict occurs as references respectively, then taking each reference as a start, finding out data pieces which do not conflict with the references and have the maximum Markov weight in other data blocks except the data block where the reference is located, and merging the data pieces with the references;

(5.2) repeatedly executing the merging operation until all the data blocks are traversed; then, the fusion score f-score (t) w of the merged results at this reference is calculated₁×…×w_mWherein w is_iA Markov weight representing a merged piece of data in the ith data block;

(5.3) selecting another benchmark as a start, executing the merging operation again, calculating and recording the corresponding merging score until the merging scores of the merging results under all different benchmarks are obtained; and then selecting the combined result with the maximum fusion score as the final globally uniform cleaning result of the tuple.

Further, the step (6) is specifically that after the two-stage cleaning is completed, the whole data set is scanned, a hash table is established for each tuple therein, and when a duplicate entry is scanned, the duplicate entry is removed.

The invention has the beneficial effects that: the invention relates to a mixed data cleaning method based on qualitative and quantitative technologies, which combines various types of integrity constraints through Markov logic network rules, introduces a Markov logic network weight learning method and a similarity distance measurement method and simultaneously uses the Markov logic network weight learning method and the similarity distance measurement method as data cleaning bases, so that the cleaning results can meet the minimum cost principle required to be followed by the qualitative technologies and can also accord with the statistical characteristics of the quantitative technologies. In addition, the optimization method designed by the invention, namely the Markov logic index, reduces the detection range of dirty data and accelerates the running time of data cleaning. The invention utilizes real and synthetic data sets to carry out experiments, and results show higher cleaning efficiency and cleaning precision than the current popular system.

Drawings

FIG. 1 is a flow chart of the steps of carrying out the present invention;

FIG. 2(a) is a diagram of a hospital data set according to the rule (r)₁)FD:

Forming a Markov logic network index structure;

FIG. 2(b) is a diagram of a hospital data set according to the rule (r)₂)DC:

Forming a Markov logic network index structure;

FIG. 2(c) is a diagram of a hospital data set according to the rule (r)₃)CFD:HN[“ELIZA”],CT[“BOAZ”]＝>PN[“2567688400”]Forming a Markov logic network index structure;

FIG. 3(a) shows rule r after a first stage cleaning₁A corresponding Markov logic network index structure schematic diagram;

FIG. 3(b) shows rule r after a first stage cleaning₂A corresponding Markov logic network index structure schematic diagram;

FIG. 3(c) shows the rule after the first stage cleaningr₃A corresponding Markov logic network index structure schematic diagram;

fig. 4 is a schematic view of a second stage cleaning process.

Detailed Description

The technical solution of the present invention will be further explained with reference to the accompanying drawings and specific implementation:

as shown in fig. 1, the specific implementation process and the working principle of the present invention are as follows:

step (1): inputting an Integrity Constraint (IC) in a framework and a data set with dirty data into the framework; the dirty data set and integrity constraints are illustrated below in table 1:

table 1 shows a hospital information data set record containing 4 attributes, namely Hospital Name (HN), City (CT), state of ownership (ST), and contact address (PN), and the grey shading in table 1 indicates error data. Given three integrity constraints:

where D represents the data set, t₁,t₂Representing two different tuples, a Function Dependency (FD) rule r₁Negative Constraint (DC) rule r representing that a city can only belong to a state₂Indicating that hospitals in different states have different telephone numbers, Conditional Function Dependency (CFD) r₃The name representing the hospital, the corresponding city and state determine the telephone number of the hospital.

Table 1:

step (2): and converting the integrity constraint rules of different types into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in the dirty data set, wherein each instantiation rule is called a data slice.

The method comprises the following specific steps:

1) standardizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;

2) variables in the normalized rule are replaced with constants of the data set.

And (3): the method comprises the following steps of establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different groups again, wherein the specific steps comprise:

1) the integrity constraint rules contained in the dirty data set divide the whole dirty data set into a plurality of data blocks, each rule corresponds to one data block, and each data block contains a plurality of data pieces gamma;

2) in each data block, the entries having the same key in the attribute are divided into the same group, wherein the key is a reason item of a rule, and γ having the same reason is divided into one group.

The following describes the markov logical network index construction with reference to fig. 2(a), fig. 2(b), and fig. 2(c) as an example:

taking the data set of Table 1 as an example, given constraint rules relating to HN, CT, ST and PN, the data set is correspondingly divided into three blocks B according to three rules₁、B₂、B₃And note that the reason attribute and the result attribute among the constraint rules are distinguished. Next, grouping the three blocks, and dividing the arrays with the same reason attribute key in one group into one group, such as B₁Middle G₁₃Three arrays ofThe reason for (2) is that the keywords are all the same and are grouped together. B is₁The corresponding Markov logic network index structure is shown in FIG. 2(a), B₂The corresponding Markov logic network index structure is shown in FIG. 2(B), B₃The corresponding markov logical network index structure is shown in fig. 2 (c);

and (4): on the basis of the step (3), performing a first-stage cleaning, introducing evaluation criteria of credibility scores, and cleaning a plurality of data versions (each data version is from different blocks) independently for each data group, specifically as follows:

1) and processing the abnormal data. The phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormal', and then the data slices with the errors are divided into the corresponding groups again;

2) calculating the reliability score (r-score) of the abnormal data in each group according to the similarity distance measurement method and the Markov logic network weight learning method, wherein the formula is

Wherein d (gamma)_i,γ^*) Representative data piece gamma and its candidate substitute data gamma^*Distance between, w (γ)_i) Is the markov weight of the slice γ.

3) Each data block is independently flushed. Specifically, the unit of cleaning is each group in the data block, and the data slice γ with the highest credibility score is selected as a reference for replacement, and other suspicious data belonging to the same group is replaced by using the data. And completing the independent cleaning of the data block until each group in the data block is cleaned. Similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version. The markov logical index structure after this stage of washing is shown in fig. 3(a), 3(b), and 3 (c).

And (5): since the first stage cleansing step produces multiple versions of the data results, conflicts between different versions of the data may arise, i.e., the same location in the data set produces different cleansing results between different versions. Therefore, the problem of multi-version data conflict is eliminated by introducing the evaluation standard of the fusion score, and the final overall unified cleaning result is obtained.

With the tuple t in Table 1₃For example, after performing the first stage of cleaning, at B₁Neutralization of t₃The relevant slice is CT, DOTHAN, ST, AL (first data version), however at B₃Neutralization of t₃The relevant data slice is { HN: ELIZA, CT: BOAZ, PN:2567688400} (third data version). Obviously, t₃.[CT]After the first phase of the cleaning, two different values (i.e., "DOTHAN" and "BOAZ") are associated, which are derived from two different versions of the data. In other words, for t₃In other words, there is a conflict in the attributes CT, and the conflict needs to be resolved in order to obtain a final consistent cleaning result.

The method comprises the following steps:

1) all tuples containing conflicts are detected, and the data slice where each conflict is located is recorded. As shown in fig. 4, t₃Corresponding to two conflicting data slices, respectively, are alpha₁∈B₁And alpha₂∈B₂And the two are respectively used as the reference for generating different candidate schemes.

2) Merging of corresponding data slices between different data blocks is performed for each reference. Two situations need to be considered, and if the data pieces to be combined do not conflict with the reference, the data pieces are directly combined; if there is a conflict, another data slice (which has no conflict with the reference and has the largest corresponding markov weight) needs to be found in the block corresponding to the data slice to be merged, then the merging operation is performed, and the merged new data slice is used as the reference, and the above steps are performed again until all the data blocks are merged. It should be noted that if no data slice meeting the requirement can be found in the merging process, it is considered that the merging cannot be completed under the reference.

3) After step 2) has been performed, a number of possible candidates have been generated for each tuple containing a conflict, by indexingAnd (3) entering fusion score (f-score), scoring each candidate scheme, and selecting the item with the highest score as a final result, wherein the formula of the fusion score is f-score (t) w₁×…×w_m. As shown in fig. 4, for the signal at α₁∈B₁A base combining scheme, since B is combined₃If the data piece is a corresponding data piece, the data piece meeting the requirement cannot be found, so that the merging cannot be completed under the standard, and the notation f-score (t)₃) 0. And by alpha₂∈B₂As a basis, the combined result is t₃(HN: ELIZA, CT: BOAZ, ST: AL, PN:2567688400}, corresponding f-score (t)₃) 0.0678. Therefore, the second merging scheme is taken as the final t₃The result of the cleaning is shown.

And (6): after the two-stage cleaning is completed, the whole data set is scanned, a hash table is built for each tuple in the data set, and when repeated items are scanned, the repeated items are removed.

And (7): and outputting the data processed data set.

Claims

1. A mixed data cleaning method based on multiple data versions is characterized by comprising the following steps:

(7) and outputting the data set after data cleaning.

2. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (2) is specifically as follows:

3. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (3) is specifically as follows:

4. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (4) is specifically as follows:

5. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (5) is specifically as follows:

6. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: and (6) specifically, after the two-stage cleaning is completed, scanning the whole data set, establishing a hash table for each tuple in the data set, and removing repeated items when the repeated items are scanned.