CN109634949B - Mixed data cleaning method based on multiple data versions - Google Patents
Mixed data cleaning method based on multiple data versions Download PDFInfo
- Publication number
- CN109634949B CN109634949B CN201811628044.3A CN201811628044A CN109634949B CN 109634949 B CN109634949 B CN 109634949B CN 201811628044 A CN201811628044 A CN 201811628044A CN 109634949 B CN109634949 B CN 109634949B
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- versions
- stage
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 230000002159 abnormal effect Effects 0.000 claims description 5
- 238000000691 measurement method Methods 0.000 claims description 5
- 238000005406 washing Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000004445 quantitative analysis Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000006386 neutralization reaction Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005201 scrubbing Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a mixed data cleaning method based on multiple data versions. The method utilizes a Markov logic network probability graph model and a minimum restoration principle, combines a qualitative technology and a quantitative technology, designs an efficient data cleaning method, detects and corrects wrong structured data, ensures that a cleaning result can clean dirty data violating the rule constraint, meets the requirement of minimum change cost of a data set, and can meet the statistical characteristic. The present invention first divides the entire data set into blocks and groups according to the Markov logical indexing technique, and then performs a two-stage data cleansing. In the first stage, data in each group are cleaned by introducing an evaluation standard of a credibility score to obtain a multi-version data cleaning result; and in the second stage, the multi-version results generated in the previous stage are fused by introducing the evaluation standard of the fusion score, so that the final uniform cleaning result is generated.
Description
Technical Field
The invention relates to a technology for cleaning error data in the field of computer databases, in particular to a mixed data cleaning method based on multiple data versions.
Background
The purpose of data cleansing is to find the content in the data set that is most likely to be erroneous data and to provide a reliable method of correcting erroneous data. Dirty data is data in which an error exists in the data set.
Nowadays, with the continuous emergence of new information publishing modes represented by social networks and electronic commerce and the rise of computer technologies of cloud computing and internet of things, data is continuously growing and accumulating at an unprecedented speed, and in data analysis, the existence of dirty data not only can lead to wrong decision and unreliable analysis, but also can strike the economy of a company. Therefore, a great interest in data cleansing is generated in both the industry and academia. Data scrubbing is a process of detecting and repairing error data, and aims to delete redundant information in the error data, correct existing error information and maintain data consistency.
At present, scholars at home and abroad have made some work aiming at a data cleaning method. The mainstream methods at present can be roughly divided into two types, namely qualitative methods and quantitative methods: (1) the qualitative method mainly cleans the error data violating the integrity constraint rule, the evaluation standard is a minimum cost principle, namely the change of the cleaning cost to the data set is minimized, and the defect is that the method cannot clean the error data not meeting the minimum cost principle, although the method still violates the integrity constraint; (2) the quantitative method is to construct a proper model based on the probability distribution of data to decide a cleaning strategy, and has the disadvantages that the method is strongly dependent on a training set, enough and clean known data needs to be provided as the training set to construct a reliable model, which is not suitable for the current big data environment, the data obtained by cleaning most of the current quantitative methods are poorer in performance than the qualitative method, and the running time of the current method is longer.
Disclosure of Invention
In order to overcome the defects, the invention provides a mixed data cleaning method based on multiple data versions. The method is based on a Markov logic network, firstly, a whole data set is divided into blocks and groups according to a Markov logic index technology, then two-stage data cleaning is carried out, wherein in the first stage, data cleaning is carried out on each block independently to obtain multi-version data cleaning results; and in the second stage, based on the multi-version data result, the conflict is eliminated, and the final overall unified cleaning result is obtained. The markov logic index technique reduces the detection range of dirty data and can efficiently perform data cleansing.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a mixed data cleaning method based on multiple data versions comprises the following steps:
(1) obtaining integrity constraint rules (ICs) with dirty data sets and associated;
(2) converting different types of integrity constraint rules into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in a dirty data set, wherein each instantiation rule is called a data slice;
(3) establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different data groups again;
(4) on the basis of the step (3), cleaning in the first stage is carried out, evaluation criteria of credibility scores are introduced, and data versions of a plurality of preliminary cleaning results are obtained by independently cleaning each data set;
(5) executing the cleaning of the second stage, introducing an evaluation standard of a fusion score, fusing the data versions of a plurality of preliminary cleaning results generated in the first stage, eliminating the conflict problem among multiple versions, and generating a final uniform cleaning result;
(6) marking repeated items existing in the dirty data set, and deleting repeated data still existing after the two-stage washing;
(7) and outputting the data set after data cleaning.
Further, the step (2) is specifically as follows:
(2.1) normalizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
(2.2) replace all variables in the normalized rule with the corresponding constants of the data set.
Further, the step (3) is specifically as follows:
(3.1) dividing the whole dirty data set into a plurality of data blocks according to integrity constraint rules contained in the dirty data set, wherein each rule corresponds to one data block, and each data block contains a plurality of data slices;
(3.2) in each data block, dividing the items with the same key in the attribute into the same group; the keywords are reason items of the rules, and the data pieces with the same reason are divided into a group.
Further, the step (4) is specifically as follows:
(4.1) processing exception data: the phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormity', and then the data slices with the errors are divided into the corresponding groups again;
(4.2) calculating a reliability score (reliability score) of abnormal data in each group according to a similarity distance measurement method and a Markov logic network weight learning method;
(4.3) washing each data set independently: the cleaning unit is each group in the data block, the data slice gamma with the largest credibility score is selected as the replacement reference, and other suspicious data belonging to the same data group are replaced by the data until each data group in the data block is cleaned completely, so that the independent cleaning of the data block is completed;
similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version.
Further, the step (5) is specifically as follows:
(5.1) firstly, recording all different data versions of the positions where the conflict occurs as references respectively, then taking each reference as a start, finding out data pieces which do not conflict with the references and have the maximum Markov weight in other data blocks except the data block where the reference is located, and merging the data pieces with the references;
(5.2) repeatedly executing the merging operation until all the data blocks are traversed; then, the fusion score f-score (t) w of the merged results at this reference is calculated1×…×wmWherein w isiA Markov weight representing a merged piece of data in the ith data block;
(5.3) selecting another benchmark as a start, executing the merging operation again, calculating and recording the corresponding merging score until the merging scores of the merging results under all different benchmarks are obtained; and then selecting the combined result with the maximum fusion score as the final globally uniform cleaning result of the tuple.
Further, the step (6) is specifically that after the two-stage cleaning is completed, the whole data set is scanned, a hash table is established for each tuple therein, and when a duplicate entry is scanned, the duplicate entry is removed.
The invention has the beneficial effects that: the invention relates to a mixed data cleaning method based on qualitative and quantitative technologies, which combines various types of integrity constraints through Markov logic network rules, introduces a Markov logic network weight learning method and a similarity distance measurement method and simultaneously uses the Markov logic network weight learning method and the similarity distance measurement method as data cleaning bases, so that the cleaning results can meet the minimum cost principle required to be followed by the qualitative technologies and can also accord with the statistical characteristics of the quantitative technologies. In addition, the optimization method designed by the invention, namely the Markov logic index, reduces the detection range of dirty data and accelerates the running time of data cleaning. The invention utilizes real and synthetic data sets to carry out experiments, and results show higher cleaning efficiency and cleaning precision than the current popular system.
Drawings
FIG. 1 is a flow chart of the steps of carrying out the present invention;
FIG. 2(a) is a diagram of a hospital data set according to the rule (r)1)FD:Forming a Markov logic network index structure;
FIG. 2(b) is a diagram of a hospital data set according to the rule (r)2)DC:Forming a Markov logic network index structure;
FIG. 2(c) is a diagram of a hospital data set according to the rule (r)3)CFD:HN[“ELIZA”],CT[“BOAZ”]=>PN[“2567688400”]Forming a Markov logic network index structure;
FIG. 3(a) shows rule r after a first stage cleaning1A corresponding Markov logic network index structure schematic diagram;
FIG. 3(b) shows rule r after a first stage cleaning2A corresponding Markov logic network index structure schematic diagram;
FIG. 3(c) shows the rule after the first stage cleaningr3A corresponding Markov logic network index structure schematic diagram;
fig. 4 is a schematic view of a second stage cleaning process.
Detailed Description
The technical solution of the present invention will be further explained with reference to the accompanying drawings and specific implementation:
as shown in fig. 1, the specific implementation process and the working principle of the present invention are as follows:
step (1): inputting an Integrity Constraint (IC) in a framework and a data set with dirty data into the framework; the dirty data set and integrity constraints are illustrated below in table 1:
table 1 shows a hospital information data set record containing 4 attributes, namely Hospital Name (HN), City (CT), state of ownership (ST), and contact address (PN), and the grey shading in table 1 indicates error data. Given three integrity constraints:
where D represents the data set, t1,t2Representing two different tuples, a Function Dependency (FD) rule r1Negative Constraint (DC) rule r representing that a city can only belong to a state2Indicating that hospitals in different states have different telephone numbers, Conditional Function Dependency (CFD) r3The name representing the hospital, the corresponding city and state determine the telephone number of the hospital.
Table 1:
step (2): and converting the integrity constraint rules of different types into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in the dirty data set, wherein each instantiation rule is called a data slice.
The method comprises the following specific steps:
1) standardizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
2) variables in the normalized rule are replaced with constants of the data set.
And (3): the method comprises the following steps of establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different groups again, wherein the specific steps comprise:
1) the integrity constraint rules contained in the dirty data set divide the whole dirty data set into a plurality of data blocks, each rule corresponds to one data block, and each data block contains a plurality of data pieces gamma;
2) in each data block, the entries having the same key in the attribute are divided into the same group, wherein the key is a reason item of a rule, and γ having the same reason is divided into one group.
The following describes the markov logical network index construction with reference to fig. 2(a), fig. 2(b), and fig. 2(c) as an example:
taking the data set of Table 1 as an example, given constraint rules relating to HN, CT, ST and PN, the data set is correspondingly divided into three blocks B according to three rules1、B2、B3And note that the reason attribute and the result attribute among the constraint rules are distinguished. Next, grouping the three blocks, and dividing the arrays with the same reason attribute key in one group into one group, such as B1Middle G13Three arrays ofThe reason for (2) is that the keywords are all the same and are grouped together. B is1The corresponding Markov logic network index structure is shown in FIG. 2(a), B2The corresponding Markov logic network index structure is shown in FIG. 2(B), B3The corresponding markov logical network index structure is shown in fig. 2 (c);
and (4): on the basis of the step (3), performing a first-stage cleaning, introducing evaluation criteria of credibility scores, and cleaning a plurality of data versions (each data version is from different blocks) independently for each data group, specifically as follows:
1) and processing the abnormal data. The phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormal', and then the data slices with the errors are divided into the corresponding groups again;
2) calculating the reliability score (r-score) of the abnormal data in each group according to the similarity distance measurement method and the Markov logic network weight learning method, wherein the formula isWherein d (gamma)i,γ*) Representative data piece gamma and its candidate substitute data gamma*Distance between, w (γ)i) Is the markov weight of the slice γ.
3) Each data block is independently flushed. Specifically, the unit of cleaning is each group in the data block, and the data slice γ with the highest credibility score is selected as a reference for replacement, and other suspicious data belonging to the same group is replaced by using the data. And completing the independent cleaning of the data block until each group in the data block is cleaned. Similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version. The markov logical index structure after this stage of washing is shown in fig. 3(a), 3(b), and 3 (c).
And (5): since the first stage cleansing step produces multiple versions of the data results, conflicts between different versions of the data may arise, i.e., the same location in the data set produces different cleansing results between different versions. Therefore, the problem of multi-version data conflict is eliminated by introducing the evaluation standard of the fusion score, and the final overall unified cleaning result is obtained.
With the tuple t in Table 13For example, after performing the first stage of cleaning, at B1Neutralization of t3The relevant slice is CT, DOTHAN, ST, AL (first data version), however at B3Neutralization of t3The relevant data slice is { HN: ELIZA, CT: BOAZ, PN:2567688400} (third data version). Obviously, t3.[CT]After the first phase of the cleaning, two different values (i.e., "DOTHAN" and "BOAZ") are associated, which are derived from two different versions of the data. In other words, for t3In other words, there is a conflict in the attributes CT, and the conflict needs to be resolved in order to obtain a final consistent cleaning result.
The method comprises the following steps:
1) all tuples containing conflicts are detected, and the data slice where each conflict is located is recorded. As shown in fig. 4, t3Corresponding to two conflicting data slices, respectively, are alpha1∈B1And alpha2∈B2And the two are respectively used as the reference for generating different candidate schemes.
2) Merging of corresponding data slices between different data blocks is performed for each reference. Two situations need to be considered, and if the data pieces to be combined do not conflict with the reference, the data pieces are directly combined; if there is a conflict, another data slice (which has no conflict with the reference and has the largest corresponding markov weight) needs to be found in the block corresponding to the data slice to be merged, then the merging operation is performed, and the merged new data slice is used as the reference, and the above steps are performed again until all the data blocks are merged. It should be noted that if no data slice meeting the requirement can be found in the merging process, it is considered that the merging cannot be completed under the reference.
3) After step 2) has been performed, a number of possible candidates have been generated for each tuple containing a conflict, by indexingAnd (3) entering fusion score (f-score), scoring each candidate scheme, and selecting the item with the highest score as a final result, wherein the formula of the fusion score is f-score (t) w1×…×wm. As shown in fig. 4, for the signal at α1∈B1A base combining scheme, since B is combined3If the data piece is a corresponding data piece, the data piece meeting the requirement cannot be found, so that the merging cannot be completed under the standard, and the notation f-score (t)3) 0. And by alpha2∈B2As a basis, the combined result is t3(HN: ELIZA, CT: BOAZ, ST: AL, PN:2567688400}, corresponding f-score (t)3) 0.0678. Therefore, the second merging scheme is taken as the final t3The result of the cleaning is shown.
And (6): after the two-stage cleaning is completed, the whole data set is scanned, a hash table is built for each tuple in the data set, and when repeated items are scanned, the repeated items are removed.
And (7): and outputting the data processed data set.
Claims (6)
1. A mixed data cleaning method based on multiple data versions is characterized by comprising the following steps:
(1) obtaining integrity constraint rules (ICs) with dirty data sets and associated;
(2) converting different types of integrity constraint rules into Markov logic network standardization rules, and instantiating the converted standardization rules by using constants contained in tuples in a dirty data set, wherein each instantiation rule is called a data slice;
(3) establishing a Markov logic index structure for a dirty data set, dividing the dirty data set into different data blocks according to rules, wherein each rule corresponds to one data block, the minimum unit in each data block is a data slice, and then dividing each data block into different data groups again;
(4) on the basis of the step (3), cleaning in the first stage is carried out, evaluation criteria of credibility scores are introduced, and data versions of a plurality of preliminary cleaning results are obtained by independently cleaning each data set;
(5) executing the cleaning of the second stage, introducing an evaluation standard of a fusion score, fusing the data versions of a plurality of preliminary cleaning results generated in the first stage, eliminating the conflict problem among multiple versions, and generating a final uniform cleaning result;
(6) marking repeated items existing in the dirty data set, and deleting repeated data still existing after the two-stage washing;
(7) and outputting the data set after data cleaning.
2. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (2) is specifically as follows:
(2.1) normalizing the input different types of integrity constraints into Markov logic network rules through conjunctive normal form conversion rules;
(2.2) replace all variables in the normalized rule with the corresponding constants of the data set.
3. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (3) is specifically as follows:
(3.1) dividing the whole dirty data set into a plurality of data blocks according to integrity constraint rules contained in the dirty data set, wherein each rule corresponds to one data block, and each data block contains a plurality of data slices;
(3.2) in each data block, dividing the items with the same key in the attribute into the same group; the keywords are reason items of the rules, and the data pieces with the same reason are divided into a group.
4. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (4) is specifically as follows:
(4.1) processing exception data: the phenomenon that the corresponding data slices are divided into incorrect groups due to the occurrence of data errors in the reason items is called as 'abnormity', and then the data slices with the errors are divided into the corresponding groups again;
(4.2) calculating a reliability score (reliability score) of abnormal data in each group according to a similarity distance measurement method and a Markov logic network weight learning method;
(4.3) washing each data set independently: the cleaning unit is each group in the data block, the data slice gamma with the largest credibility score is selected as the replacement reference, and other suspicious data belonging to the same data group are replaced by the data until each data group in the data block is cleaned completely, so that the independent cleaning of the data block is completed;
similarly, the cleaning is also performed on other data blocks; and regarding a plurality of preliminary cleaning results obtained by cleaning in the stage as a plurality of data versions, wherein each data block is a data version.
5. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: the step (5) is specifically as follows:
(5.1) firstly, recording all different data versions of the positions where the conflict occurs as references respectively, then taking each reference as a start, finding out data pieces which do not conflict with the references and have the maximum Markov weight in other data blocks except the data block where the reference is located, and merging the data pieces with the references;
(5.2) repeatedly executing the merging operation until all the data blocks are traversed; then, the fusion score f-score (t) w of the merged results at this reference is calculated1×…×wmWherein w isiA Markov weight representing a merged piece of data in the ith data block;
(5.3) selecting another benchmark as a start, executing the merging operation again, calculating and recording the corresponding merging score until the merging scores of the merging results under all different benchmarks are obtained; and then selecting the combined result with the maximum fusion score as the final globally uniform cleaning result of the tuple.
6. The hybrid data cleansing method based on multiple data versions according to claim 1, characterized in that: and (6) specifically, after the two-stage cleaning is completed, scanning the whole data set, establishing a hash table for each tuple in the data set, and removing repeated items when the repeated items are scanned.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811628044.3A CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811628044.3A CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109634949A CN109634949A (en) | 2019-04-16 |
CN109634949B true CN109634949B (en) | 2022-04-12 |
Family
ID=66079015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811628044.3A Active CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109634949B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287191B (en) * | 2019-06-25 | 2021-07-27 | 北京明略软件系统有限公司 | Data alignment method and device, storage medium and electronic device |
CN110968576A (en) * | 2019-11-28 | 2020-04-07 | 哈尔滨工程大学 | Content correlation-based numerical data consistency cleaning method |
CN111291029B (en) * | 2020-01-17 | 2024-03-08 | 深圳市华傲数据技术有限公司 | Data cleaning method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2919533A1 (en) * | 2012-08-01 | 2014-02-06 | Sherpa Technologies Inc. | System and method for managing versions of program assets |
CN105339940A (en) * | 2013-06-28 | 2016-02-17 | 甲骨文国际公司 | Naive, client-side sharding with online addition of shards |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
CN108921399A (en) * | 2018-06-14 | 2018-11-30 | 北京新广视通科技有限公司 | A kind of intelligence direct management system and method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180150543A1 (en) * | 2016-11-30 | 2018-05-31 | Linkedin Corporation | Unified multiversioned processing of derived data |
US10205735B2 (en) * | 2017-01-30 | 2019-02-12 | Splunk Inc. | Graph-based network security threat detection across time and entities |
-
2018
- 2018-12-28 CN CN201811628044.3A patent/CN109634949B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2919533A1 (en) * | 2012-08-01 | 2014-02-06 | Sherpa Technologies Inc. | System and method for managing versions of program assets |
CN105339940A (en) * | 2013-06-28 | 2016-02-17 | 甲骨文国际公司 | Naive, client-side sharding with online addition of shards |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
CN108921399A (en) * | 2018-06-14 | 2018-11-30 | 北京新广视通科技有限公司 | A kind of intelligence direct management system and method |
Also Published As
Publication number | Publication date |
---|---|
CN109634949A (en) | 2019-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110008288B (en) | Construction method and application of knowledge map library for network fault analysis | |
CN109634949B (en) | Mixed data cleaning method based on multiple data versions | |
US8321383B2 (en) | System and method for automatic weight generation for probabilistic matching | |
Shivaji et al. | Reducing features to improve code change-based bug prediction | |
CN110263230B (en) | Data cleaning method and device based on density clustering | |
CN101986296B (en) | Noise data cleaning method based on semantic ontology | |
US20100235296A1 (en) | Flow comparison processing method and apparatus | |
Ge et al. | A hybrid data cleaning framework using markov logic networks | |
Kumar et al. | Attribute correction-data cleaning using association rule and clustering methods | |
CN110389950B (en) | Rapid running big data cleaning method | |
CN104268216A (en) | Data cleaning system based on internet information | |
Hao et al. | Cleaning relations using knowledge bases | |
CN104699796A (en) | Data cleaning method based on data warehouse | |
CN113487211A (en) | Nuclear power equipment quality tracing method and system, computer equipment and medium | |
US11321359B2 (en) | Review and curation of record clustering changes at large scale | |
Berko et al. | Knowledge-based Big Data cleanup method | |
Ciszak | Application of clustering and association methods in data cleaning | |
Monge | An adaptive and efficient algorithm for detecting approximately duplicate database records | |
US12052134B2 (en) | Identification of clusters of elements causing network performance degradation or outage | |
CN115185933A (en) | Multi-source manufacturing data preprocessing method for aerospace products | |
Zada et al. | Large-scale data integration using graph probabilistic dependencies (gpds) | |
Du et al. | Research on data cleaning technology based on RD-CFD method | |
CN110968576A (en) | Content correlation-based numerical data consistency cleaning method | |
CN108776697B (en) | Multi-source data set cleaning method based on predicates | |
Ali et al. | Duplicates detection within incomplete data sets using blocking and dynamic sorting key methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |