CN109634949B - Mixed data cleaning method based on multiple data versions - Google Patents

Mixed data cleaning method based on multiple data versions Download PDF

Info

Publication number
CN109634949B
CN109634949B CN201811628044.3A CN201811628044A CN109634949B CN 109634949 B CN109634949 B CN 109634949B CN 201811628044 A CN201811628044 A CN 201811628044A CN 109634949 B CN109634949 B CN 109634949B
Authority
CN
China
Prior art keywords
data
cleaning
versions
stage
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811628044.3A
Other languages
Chinese (zh)
Other versions
CN109634949A (en
Inventor
高云君
陈刚
陈纯
葛丛丛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201811628044.3A priority Critical patent/CN109634949B/en
Publication of CN109634949A publication Critical patent/CN109634949A/en
Application granted granted Critical
Publication of CN109634949B publication Critical patent/CN109634949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a mixed data cleaning method based on multiple data versions. The method utilizes a Markov logic network probability graph model and a minimum restoration principle, combines a qualitative technology and a quantitative technology, designs an efficient data cleaning method, detects and corrects wrong structured data, ensures that a cleaning result can clean dirty data violating the rule constraint, meets the requirement of minimum change cost of a data set, and can meet the statistical characteristic. The present invention first divides the entire data set into blocks and groups according to the Markov logical indexing technique, and then performs a two-stage data cleansing. In the first stage, data in each group are cleaned by introducing an evaluation standard of a credibility score to obtain a multi-version data cleaning result; and in the second stage, the multi-version results generated in the previous stage are fused by introducing the evaluation standard of the fusion score, so that the final uniform cleaning result is generated.

Description

Mixed data cleaning method based on multiple data versions
Technical Field
The invention relates to a technology for cleaning error data in the field of computer databases, in particular to a mixed data cleaning method based on multiple data versions.
Background
The purpose of data cleansing is to find the content in the data set that is most likely to be erroneous data and to provide a reliable method of correcting erroneous data. Dirty data is data in which an error exists in the data set.
Nowadays, with the continuous emergence of new information publishing modes represented by social networks and electronic commerce and the rise of computer technologies of cloud computing and internet of things, data is continuously growing and accumulating at an unprecedented speed, and in data analysis, the existence of dirty data not only can lead to wrong decision and unreliable analysis, but also can strike the economy of a company. Therefore, a great interest in data cleansing is generated in both the industry and academia. Data scrubbing is a process of detecting and repairing error data, and aims to delete redundant information in the error data, correct existing error information and maintain data consistency.
At present, scholars at home and abroad have made some work aiming at a data cleaning method. The mainstream methods at present can be roughly divided into two types, namely qualitative methods and quantitative methods: (1) the qualitative method mainly cleans the error data violating the integrity constraint rule, the evaluation standard is a minimum cost principle, namely the change of the cleaning cost to the data set is minimized, and the defect is that the method cannot clean the error data not meeting the minimum cost principle, although the method still violates the integrity constraint; (2) the quantitative method is to construct a proper model based on the probability distribution of data to decide a cleaning strategy, and has the disadvantages that the method is strongly dependent on a training set, enough and clean known data needs to be provided as the training set to construct a reliable model, which is not suitable for the current big data environment, the data obtained by cleaning most of the current quantitative methods are poorer in performance than the qualitative method, and the running time of the current method is longer.
Disclosure of Invention
In order to overcome the defects, the invention provides a mixed data cleaning method based on multiple data versions. The method is based on a Markov logic network, firstly, a whole data set is divided into blocks and groups according to a Markov logic index technology, then two-stage data cleaning is carried out, wherein in the first stage, data cleaning is carried out on each block independently to obtain multi-version data cleaning results; and in the second stage, based on the multi-version data result, the conflict is eliminated, and the final overall unified cleaning result is obtained. The markov logic index technique reduces the detection range of dirty data and can efficiently perform data cleansing.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a mixed data cleaning method based on multiple data versions comprises the following steps:
(1) obtaining integrity constraint rules (ICs) with dirty data sets and associated;
(2) converting different types of integrity constraint rules into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in a dirty data set, wherein each instantiation rule is called a data slice;
(3) establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different data groups again;
(4) on the basis of the step (3), cleaning in the first stage is carried out, evaluation criteria of credibility scores are introduced, and data versions of a plurality of preliminary cleaning results are obtained by independently cleaning each data set;
(5) executing the cleaning of the second stage, introducing an evaluation standard of a fusion score, fusing the data versions of a plurality of preliminary cleaning results generated in the first stage, eliminating the conflict problem among multiple versions, and generating a final uniform cleaning result;
(6) marking repeated items existing in the dirty data set, and deleting repeated data still existing after the two-stage washing;
(7) and outputting the data set after data cleaning.
Further, the step (2) is specifically as follows:
(2.1) normalizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
(2.2) replace all variables in the normalized rule with the corresponding constants of the data set.
Further, the step (3) is specifically as follows:
(3.1) dividing the whole dirty data set into a plurality of data blocks according to integrity constraint rules contained in the dirty data set, wherein each rule corresponds to one data block, and each data block contains a plurality of data slices;
(3.2) in each data block, dividing the items with the same key in the attribute into the same group; the keywords are reason items of the rules, and the data pieces with the same reason are divided into a group.
Further, the step (4) is specifically as follows:
(4.1) processing exception data: the phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormity', and then the data slices with the errors are divided into the corresponding groups again;
(4.2) calculating a reliability score (reliability score) of abnormal data in each group according to a similarity distance measurement method and a Markov logic network weight learning method;
(4.3) washing each data set independently: the cleaning unit is each group in the data block, the data slice gamma with the largest credibility score is selected as the replacement reference, and other suspicious data belonging to the same data group are replaced by the data until each data group in the data block is cleaned completely, so that the independent cleaning of the data block is completed;
similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version.
Further, the step (5) is specifically as follows:
(5.1) firstly, recording all different data versions of the positions where the conflict occurs as references respectively, then taking each reference as a start, finding out data pieces which do not conflict with the references and have the maximum Markov weight in other data blocks except the data block where the reference is located, and merging the data pieces with the references;
(5.2) repeatedly executing the merging operation until all the data blocks are traversed; then, the fusion score f-score (t) w of the merged results at this reference is calculated1×…×wmWherein w isiA Markov weight representing a merged piece of data in the ith data block;
(5.3) selecting another benchmark as a start, executing the merging operation again, calculating and recording the corresponding merging score until the merging scores of the merging results under all different benchmarks are obtained; and then selecting the combined result with the maximum fusion score as the final globally uniform cleaning result of the tuple.
Further, the step (6) is specifically that after the two-stage cleaning is completed, the whole data set is scanned, a hash table is established for each tuple therein, and when a duplicate entry is scanned, the duplicate entry is removed.
The invention has the beneficial effects that: the invention relates to a mixed data cleaning method based on qualitative and quantitative technologies, which combines various types of integrity constraints through Markov logic network rules, introduces a Markov logic network weight learning method and a similarity distance measurement method and simultaneously uses the Markov logic network weight learning method and the similarity distance measurement method as data cleaning bases, so that the cleaning results can meet the minimum cost principle required to be followed by the qualitative technologies and can also accord with the statistical characteristics of the quantitative technologies. In addition, the optimization method designed by the invention, namely the Markov logic index, reduces the detection range of dirty data and accelerates the running time of data cleaning. The invention utilizes real and synthetic data sets to carry out experiments, and results show higher cleaning efficiency and cleaning precision than the current popular system.
Drawings
FIG. 1 is a flow chart of the steps of carrying out the present invention;
FIG. 2(a) is a diagram of a hospital data set according to the rule (r)1)FD:
Figure BDA0001928397200000031
Forming a Markov logic network index structure;
FIG. 2(b) is a diagram of a hospital data set according to the rule (r)2)DC:
Figure BDA0001928397200000032
Forming a Markov logic network index structure;
FIG. 2(c) is a diagram of a hospital data set according to the rule (r)3)CFD:HN[“ELIZA”],CT[“BOAZ”]=>PN[“2567688400”]Forming a Markov logic network index structure;
FIG. 3(a) shows rule r after a first stage cleaning1A corresponding Markov logic network index structure schematic diagram;
FIG. 3(b) shows rule r after a first stage cleaning2A corresponding Markov logic network index structure schematic diagram;
FIG. 3(c) shows the rule after the first stage cleaningr3A corresponding Markov logic network index structure schematic diagram;
fig. 4 is a schematic view of a second stage cleaning process.
Detailed Description
The technical solution of the present invention will be further explained with reference to the accompanying drawings and specific implementation:
as shown in fig. 1, the specific implementation process and the working principle of the present invention are as follows:
step (1): inputting an Integrity Constraint (IC) in a framework and a data set with dirty data into the framework; the dirty data set and integrity constraints are illustrated below in table 1:
table 1 shows a hospital information data set record containing 4 attributes, namely Hospital Name (HN), City (CT), state of ownership (ST), and contact address (PN), and the grey shading in table 1 indicates error data. Given three integrity constraints:
Figure BDA0001928397200000041
Figure BDA0001928397200000042
Figure BDA0001928397200000043
where D represents the data set, t1,t2Representing two different tuples, a Function Dependency (FD) rule r1Negative Constraint (DC) rule r representing that a city can only belong to a state2Indicating that hospitals in different states have different telephone numbers, Conditional Function Dependency (CFD) r3The name representing the hospital, the corresponding city and state determine the telephone number of the hospital.
Table 1:
Figure BDA0001928397200000044
step (2): and converting the integrity constraint rules of different types into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in the dirty data set, wherein each instantiation rule is called a data slice.
The method comprises the following specific steps:
1) standardizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
2) variables in the normalized rule are replaced with constants of the data set.
And (3): the method comprises the following steps of establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different groups again, wherein the specific steps comprise:
1) the integrity constraint rules contained in the dirty data set divide the whole dirty data set into a plurality of data blocks, each rule corresponds to one data block, and each data block contains a plurality of data pieces gamma;
2) in each data block, the entries having the same key in the attribute are divided into the same group, wherein the key is a reason item of a rule, and γ having the same reason is divided into one group.
The following describes the markov logical network index construction with reference to fig. 2(a), fig. 2(b), and fig. 2(c) as an example:
taking the data set of Table 1 as an example, given constraint rules relating to HN, CT, ST and PN, the data set is correspondingly divided into three blocks B according to three rules1、B2、B3And note that the reason attribute and the result attribute among the constraint rules are distinguished. Next, grouping the three blocks, and dividing the arrays with the same reason attribute key in one group into one group, such as B1Middle G13Three arrays ofThe reason for (2) is that the keywords are all the same and are grouped together. B is1The corresponding Markov logic network index structure is shown in FIG. 2(a), B2The corresponding Markov logic network index structure is shown in FIG. 2(B), B3The corresponding markov logical network index structure is shown in fig. 2 (c);
and (4): on the basis of the step (3), performing a first-stage cleaning, introducing evaluation criteria of credibility scores, and cleaning a plurality of data versions (each data version is from different blocks) independently for each data group, specifically as follows:
1) and processing the abnormal data. The phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormal', and then the data slices with the errors are divided into the corresponding groups again;
2) calculating the reliability score (r-score) of the abnormal data in each group according to the similarity distance measurement method and the Markov logic network weight learning method, wherein the formula is
Figure BDA0001928397200000051
Wherein d (gamma)i*) Representative data piece gamma and its candidate substitute data gamma*Distance between, w (γ)i) Is the markov weight of the slice γ.
3) Each data block is independently flushed. Specifically, the unit of cleaning is each group in the data block, and the data slice γ with the highest credibility score is selected as a reference for replacement, and other suspicious data belonging to the same group is replaced by using the data. And completing the independent cleaning of the data block until each group in the data block is cleaned. Similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version. The markov logical index structure after this stage of washing is shown in fig. 3(a), 3(b), and 3 (c).
And (5): since the first stage cleansing step produces multiple versions of the data results, conflicts between different versions of the data may arise, i.e., the same location in the data set produces different cleansing results between different versions. Therefore, the problem of multi-version data conflict is eliminated by introducing the evaluation standard of the fusion score, and the final overall unified cleaning result is obtained.
With the tuple t in Table 13For example, after performing the first stage of cleaning, at B1Neutralization of t3The relevant slice is CT, DOTHAN, ST, AL (first data version), however at B3Neutralization of t3The relevant data slice is { HN: ELIZA, CT: BOAZ, PN:2567688400} (third data version). Obviously, t3.[CT]After the first phase of the cleaning, two different values (i.e., "DOTHAN" and "BOAZ") are associated, which are derived from two different versions of the data. In other words, for t3In other words, there is a conflict in the attributes CT, and the conflict needs to be resolved in order to obtain a final consistent cleaning result.
The method comprises the following steps:
1) all tuples containing conflicts are detected, and the data slice where each conflict is located is recorded. As shown in fig. 4, t3Corresponding to two conflicting data slices, respectively, are alpha1∈B1And alpha2∈B2And the two are respectively used as the reference for generating different candidate schemes.
2) Merging of corresponding data slices between different data blocks is performed for each reference. Two situations need to be considered, and if the data pieces to be combined do not conflict with the reference, the data pieces are directly combined; if there is a conflict, another data slice (which has no conflict with the reference and has the largest corresponding markov weight) needs to be found in the block corresponding to the data slice to be merged, then the merging operation is performed, and the merged new data slice is used as the reference, and the above steps are performed again until all the data blocks are merged. It should be noted that if no data slice meeting the requirement can be found in the merging process, it is considered that the merging cannot be completed under the reference.
3) After step 2) has been performed, a number of possible candidates have been generated for each tuple containing a conflict, by indexingAnd (3) entering fusion score (f-score), scoring each candidate scheme, and selecting the item with the highest score as a final result, wherein the formula of the fusion score is f-score (t) w1×…×wm. As shown in fig. 4, for the signal at α1∈B1A base combining scheme, since B is combined3If the data piece is a corresponding data piece, the data piece meeting the requirement cannot be found, so that the merging cannot be completed under the standard, and the notation f-score (t)3) 0. And by alpha2∈B2As a basis, the combined result is t3(HN: ELIZA, CT: BOAZ, ST: AL, PN:2567688400}, corresponding f-score (t)3) 0.0678. Therefore, the second merging scheme is taken as the final t3The result of the cleaning is shown.
And (6): after the two-stage cleaning is completed, the whole data set is scanned, a hash table is built for each tuple in the data set, and when repeated items are scanned, the repeated items are removed.
And (7): and outputting the data processed data set.

Claims (6)

1. A mixed data cleaning method based on multiple data versions is characterized by comprising the following steps:
(1) obtaining integrity constraint rules (ICs) with dirty data sets and associated;
(2) converting different types of integrity constraint rules into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in a dirty data set, wherein each instantiation rule is called a data slice;
(3) establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different data groups again;
(4) on the basis of the step (3), cleaning in the first stage is carried out, evaluation criteria of credibility scores are introduced, and data versions of a plurality of preliminary cleaning results are obtained by independently cleaning each data set;
(5) executing the cleaning of the second stage, introducing an evaluation standard of a fusion score, fusing the data versions of a plurality of preliminary cleaning results generated in the first stage, eliminating the conflict problem among multiple versions, and generating a final uniform cleaning result;
(6) marking repeated items existing in the dirty data set, and deleting repeated data still existing after the two-stage washing;
(7) and outputting the data set after data cleaning.
2. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (2) is specifically as follows:
(2.1) normalizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
(2.2) replace all variables in the normalized rule with the corresponding constants of the data set.
3. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (3) is specifically as follows:
(3.1) dividing the whole dirty data set into a plurality of data blocks according to integrity constraint rules contained in the dirty data set, wherein each rule corresponds to one data block, and each data block contains a plurality of data slices;
(3.2) in each data block, dividing the items with the same key in the attribute into the same group; the keywords are reason items of the rules, and the data pieces with the same reason are divided into a group.
4. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (4) is specifically as follows:
(4.1) processing exception data: the phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormity', and then the data slices with the errors are divided into the corresponding groups again;
(4.2) calculating a reliability score (reliability score) of abnormal data in each group according to a similarity distance measurement method and a Markov logic network weight learning method;
(4.3) washing each data set independently: the cleaning unit is each group in the data block, the data slice gamma with the largest credibility score is selected as the replacement reference, and other suspicious data belonging to the same data group are replaced by the data until each data group in the data block is cleaned completely, so that the independent cleaning of the data block is completed;
similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version.
5. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (5) is specifically as follows:
(5.1) firstly, recording all different data versions of the positions where the conflict occurs as references respectively, then taking each reference as a start, finding out data pieces which do not conflict with the references and have the maximum Markov weight in other data blocks except the data block where the reference is located, and merging the data pieces with the references;
(5.2) repeatedly executing the merging operation until all the data blocks are traversed; then, the fusion score f-score (t) w of the merged results at this reference is calculated1×…×wmWherein w isiA Markov weight representing a merged piece of data in the ith data block;
(5.3) selecting another benchmark as a start, executing the merging operation again, calculating and recording the corresponding merging score until the merging scores of the merging results under all different benchmarks are obtained; and then selecting the combined result with the maximum fusion score as the final globally uniform cleaning result of the tuple.
6. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: and (6) specifically, after the two-stage cleaning is completed, scanning the whole data set, establishing a hash table for each tuple in the data set, and removing repeated items when the repeated items are scanned.
CN201811628044.3A 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions Active CN109634949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811628044.3A CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811628044.3A CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Publications (2)

Publication Number Publication Date
CN109634949A CN109634949A (en) 2019-04-16
CN109634949B true CN109634949B (en) 2022-04-12

Family

ID=66079015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811628044.3A Active CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Country Status (1)

Country Link
CN (1) CN109634949B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287191B (en) * 2019-06-25 2021-07-27 北京明略软件系统有限公司 Data alignment method and device, storage medium and electronic device
CN110968576A (en) * 2019-11-28 2020-04-07 哈尔滨工程大学 Content correlation-based numerical data consistency cleaning method
CN111291029B (en) * 2020-01-17 2024-03-08 深圳市华傲数据技术有限公司 Data cleaning method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2919533A1 (en) * 2012-08-01 2014-02-06 Sherpa Technologies Inc. System and method for managing versions of program assets
CN105339940A (en) * 2013-06-28 2016-02-17 甲骨文国际公司 Naive, client-side sharding with online addition of shards
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
CN108921399A (en) * 2018-06-14 2018-11-30 北京新广视通科技有限公司 A kind of intelligence direct management system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150543A1 (en) * 2016-11-30 2018-05-31 Linkedin Corporation Unified multiversioned processing of derived data
US10205735B2 (en) * 2017-01-30 2019-02-12 Splunk Inc. Graph-based network security threat detection across time and entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2919533A1 (en) * 2012-08-01 2014-02-06 Sherpa Technologies Inc. System and method for managing versions of program assets
CN105339940A (en) * 2013-06-28 2016-02-17 甲骨文国际公司 Naive, client-side sharding with online addition of shards
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
CN108921399A (en) * 2018-06-14 2018-11-30 北京新广视通科技有限公司 A kind of intelligence direct management system and method

Also Published As

Publication number Publication date
CN109634949A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN110008288B (en) Construction method and application of knowledge map library for network fault analysis
CN109634949B (en) Mixed data cleaning method based on multiple data versions
US8321383B2 (en) System and method for automatic weight generation for probabilistic matching
Shivaji et al. Reducing features to improve code change-based bug prediction
CN110263230B (en) Data cleaning method and device based on density clustering
CN101986296B (en) Noise data cleaning method based on semantic ontology
US20100235296A1 (en) Flow comparison processing method and apparatus
Ge et al. A hybrid data cleaning framework using markov logic networks
Kumar et al. Attribute correction-data cleaning using association rule and clustering methods
CN110389950B (en) Rapid running big data cleaning method
CN104268216A (en) Data cleaning system based on internet information
Hao et al. Cleaning relations using knowledge bases
CN104699796A (en) Data cleaning method based on data warehouse
CN113487211A (en) Nuclear power equipment quality tracing method and system, computer equipment and medium
US11321359B2 (en) Review and curation of record clustering changes at large scale
Berko et al. Knowledge-based Big Data cleanup method
Ciszak Application of clustering and association methods in data cleaning
Monge An adaptive and efficient algorithm for detecting approximately duplicate database records
US12052134B2 (en) Identification of clusters of elements causing network performance degradation or outage
CN115185933A (en) Multi-source manufacturing data preprocessing method for aerospace products
Zada et al. Large-scale data integration using graph probabilistic dependencies (gpds)
Du et al. Research on data cleaning technology based on RD-CFD method
CN110968576A (en) Content correlation-based numerical data consistency cleaning method
CN108776697B (en) Multi-source data set cleaning method based on predicates
Ali et al. Duplicates detection within incomplete data sets using blocking and dynamic sorting key methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant