CN110069480A - A kind of parallel data cleaning method - Google Patents

A kind of parallel data cleaning method Download PDF

Info

Publication number
CN110069480A
CN110069480A CN201910161073.1A CN201910161073A CN110069480A CN 110069480 A CN110069480 A CN 110069480A CN 201910161073 A CN201910161073 A CN 201910161073A CN 110069480 A CN110069480 A CN 110069480A
Authority
CN
China
Prior art keywords
data
constraint relationship
abnormal data
abnormal
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910161073.1A
Other languages
Chinese (zh)
Other versions
CN110069480B (en
Inventor
姚箐晨
陈德健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Heng Rui Technology Co Ltd
Original Assignee
Guangdong Heng Rui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Heng Rui Technology Co Ltd filed Critical Guangdong Heng Rui Technology Co Ltd
Priority to CN201910161073.1A priority Critical patent/CN110069480B/en
Publication of CN110069480A publication Critical patent/CN110069480A/en
Application granted granted Critical
Publication of CN110069480B publication Critical patent/CN110069480B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention discloses a kind of parallel data cleaning methods, by the overall architecture for constructing distributed parallel cleaning system, the hypergraph that conflicts is constituted with corresponding constraint using the data cells for violating the constraint relationship all in data, carry out data cleansing, and according to data cell in conflict hypergraph and the corresponding position constrained, the rapid data cleaning method for being suitble to mass data is formed.By means of the invention it is possible to which reaching data cleansing repairs speed faster, and algorithm complexity is lower, is suitble to the reparation of mass data.

Description

A kind of parallel data cleaning method
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of parallel data cleaning method.
Background technique
Business and science data have become the wealth of most worthy under current era development.But in data Source because the mistake of the interference of noise and bring data substantially reduces the value of data itself: what data were extracted Inaccurately lead to the missing of data;The data extracted from multiple data sources lead to the redundancy of data after merging.Data set provider is wrong Data input accidentally, causes Data Integrality Restriction no longer to be set up.These mistakes all cause a large amount of economic loss every year.It provides The operation of data is cleaned, improving the quality of data is the key that realize data efficient management.
Data cleansing includes the detection and correction of wrong data.The current algorithm for having had already appeared many data cleansings, one Kind is for different type of errors (shortage of data, data redundancy, error in data) for the straightforward procedure of global data Cleaning algorithm is simply together in series use, can reduce the processing complexity to global data to the maximum extent in this way.But This method does not account for the interaction relationship between global data different type.Eventually result in the result of data cleansing simultaneously It is undesirable.
MapReduce programming framework can use the big data cleaning that concurrent technique realizes enhanced scalability.MapReduce It is a kind of programming model, the concurrent operation for large-scale dataset.Concept " Map (mapping) " and " Reduce (reduction) " are Their main thought is all the characteristic borrowed in Functional Programming and vector programming language.It is greatly facilitated Programming personnel will not distributed parallel programming in the case where, the program of oneself is operated in distributed system.Current is soft Part realization is to specify Map (mapping) function, for one group of key-value pair is mapped to one group of new key-value pair, is specified concurrently Reduce (reduction) function, for guaranteeing that each of the key-value pair of all mappings shares identical key group.
Research in terms of data cleansing appears in the U.S. earliest, even to this day, has emerged algorithm too numerous to enumerate.With The transition in epoch, the form of wrong data change multiplicity, the growth of data volume also proposes for the design of data cleansing algorithm new Requirement.Many traditional data cleansing algorithms have been unable to satisfy the demand of big data era.Therefore how accurately and efficiently clear Washing dirty data always is the project for being worth research.Data cleansing is intended to identify and corrects the noise in data, by noise to data The influence of analysis result minimizes.Noise in data mainly includes incomplete data, the data of redundancy, the data of conflict With the data of mistake.Detecting and eliminating for data noise can be realized by automatic algorithm, it can also be by data cleansing Rule, or by user participation.
Instantly, machine learning and the development of crowdsourcing technology are that the research work of data cleansing is filled with new vitality.Machine Learning art can learn to formulate the rule of cleaning decision from user record, to mitigate the burden of user annotation data.Together When, the conversion from cleaning rule to machine learning model is so that user no longer needs to formulate a large amount of data cleansing rule.Crowdsourcing Data cleansing task is published to internet by technology, to concentrate the knowledge of numerous users and the form of decision crowdsourcing can be abundant The accuracy and efficiency of data cleansing are improved while reducing cleaning cost using external resource advantage.
In existing data cleansing amending method, usually using the technology of negative constraint, found out according to negative constraint All data cells for not meeting constraint, these data cells for not meeting constraint constitute conflict hypergraph, root with corresponding constraint According to the corresponding position of Bian Yudian in conflict hypergraph, reparation context is defined.It is cleaned according to context and corresponding constraint is repaired Data.But available data cleaning restorative procedure is unsatisfactory for the processing of mass data, because the complexity of algorithm is high So that the efficiency of data cleansing is very low.Or the algorithm designed for specific data, there is no propose logarithm on the whole According to the unified approach of cleaning.
Summary of the invention
The purpose of the present invention is provide a kind of parallel data cleaning method to solve above-mentioned the deficiencies in the prior art place.
Technology of the invention is achieved through the following technical solutions: a kind of parallel data cleaning method, comprising:
Using the incidence relation and the constraint relationship between data all in database, abnormal data retrieval model is constructed;Its In, the input of the abnormal data retrieval model is each data in database, with the data in abnormal data retrieval model Incidence relation and the constraint relationship are compared, if being unsatisfactory at least one of data correlation relation and the constraint relationship, will input Data as abnormal data, and using abnormal data and its ungratified whole incidence relation and the constraint relationship as abnormal data The output of retrieval model;
According to the abnormal data and the ungratified the constraint relationship of each abnormal data of the output of abnormal data retrieval model, structure Build the hypergraph of abnormal data;Wherein, the super side using ungratified the constraint relationship as hypergraph, correspondence are unsatisfactory for the constraint relationship extremely The violation unit that a few abnormal data is covered as super side;
It selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds minimum vertex-covering In the ungratified the constraint relationship of violation unit of point, the most the constraint relationship conduct of the violation unit of relationship is not only satisfied the constraint The first surpass side, to the inversion operation for the first surpassing the constraint relationship when violation unit covered in executes and the first surpasses, after negating The violation unit for being unsatisfactory for the first surpassing the constraint relationship on side is changed into normal data, the first surpasses side elimination;
Loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point executes minimum vertex-covering point With the inversion operation of its current whole the constraint relationship, the reparation of whole abnormal datas is completed.
Wherein, abnormal data retrieval model meets formula:
Wherein,Any data in database to input abnormal data retrieval model, Ri With dataRelated pass Connection relationship, PiIt is dataThe constraint relationship of satisfaction;If the data x of input is unsatisfactory for formula (1), data are determinedFor exception Data, while determining ungratified the constraint relationship.
Wherein, in establishing the step of conflicting hypergraph, settingSearching is unsatisfactory for any the constraint relationship PiViolation unit V={ v1,...,vn, the two is corresponding to obtain conflict hypergraph.
Wherein, in the step of selection the first surpasses side, comprising steps of
For the subdomain being unsatisfactory for when super all violation units in corresponding the constraint relationship constitute corresponding super;
In the subdomain that the violation unit that all super sides where counting minimum vertex-covering point surround is constituted, it is only unsatisfactory for one The quantity of the violation data unit of the corresponding the constraint relationship in super side;
In the subdomain that the violation unit that all super sides where choosing minimum vertex-covering point surround is constituted, only it is unsatisfactory for one and surpasses In most super of the quantity of the violation unit of corresponding the constraint relationship as the first surpassing side.
Wherein, after completion the first surpasses the step of side is eliminated, again by whole violation units in initial collision hypergraph It is retrieved in input abnormal data retrieval model, excludes the first to surpass in crack approach by eliminating, be corrected for normal data Abnormal data and the ungratified incidence relation of abnormal data and the constraint relationship.
Wherein, by the abnormal data newly obtained and corresponding ungratified incidence relation and the constraint relationship, punching is rebuild Prominent hypergraph, carries out loop iteration, finds the minimum vertex-covering point in the violation unit that new abnormal data is formed and the first surpasses side, and Progress the first surpasses side elimination, until abnormal data only remaining minimum vertex-covering point.
Wherein, when abnormal data only remaining minimum vertex-covering point, for the ungratified whole the constraint relationships of minimum vertex-covering point, into Row inversion operation, and the data obtained after minimum vertex-covering point reparation are inputted in abnormal data retrieval model, judge minimum vertex-covering Whether the data that point obtains after repairing are abnormal data.
Wherein, if abnormal data retrieval model judge the data obtained after minimum vertex-covering point reparation for normal data, it is complete The reparation of paired data library total data;If abnormal data retrieval model judges the data obtained after minimum vertex-covering point reparation to be different Regular data then deletes the data, the reparation of database total data.
It is different from the prior art, parallel data cleaning method of the invention passes through building distributed parallel cleaning system Overall architecture, conflicted with corresponding constraint composition hypergraph using the data cells for violating the constraint relationships all in data, counted According to cleaning, and according to data cell in conflict hypergraph and the corresponding position constrained, the rapid data for being suitble to mass data is formed Cleaning method.By means of the invention it is possible to which reaching data cleansing repairs speed faster, and algorithm complexity is lower, is suitble to a large amount of numbers According to reparation.
Detailed description of the invention
Fig. 1 be it is provided by the invention it is a kind of for coordinate conversion in abnormal point positioning and estimation method process illustrate Figure.
Specific embodiment
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention.But the present invention can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to intension of the present invention the case where Under do similar popularization, therefore the present invention is not limited to the specific embodiments disclosed below.
Secondly, the present invention is described in detail using schematic diagram, when describing the embodiments of the present invention, for purposes of illustration only, showing It is intended to be example, the scope of protection of the invention should not be limited herein.
Refering to fig. 1, Fig. 1 is a kind of flow diagram of parallel data cleaning method provided by the invention.The step of this method Suddenly include:
S110: it using the incidence relation and the constraint relationship between data all in database, constructs abnormal data and retrieves mould Type;Wherein, the input of the abnormal data retrieval model is each data in database, in abnormal data retrieval model Data correlation relation and the constraint relationship are compared, will if being unsatisfactory at least one of data correlation relation and the constraint relationship The data of input are as abnormal data, and using abnormal data and its ungratified whole incidence relation and the constraint relationship as extremely The output of data retrieval model.
S120: it is closed according to the abnormal data of abnormal data retrieval model output and the ungratified constraint of each abnormal data System, constructs the hypergraph of abnormal data;Wherein, the super side using ungratified the constraint relationship as hypergraph, it is corresponding to be unsatisfactory for constraint pass The violation unit that at least one abnormal data of system is covered as super side.
S130: it selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds most In the ungratified the constraint relationship of violation unit of small covering point, the most constraint pass of the violation unit of relationship is not only satisfied the constraint System executes inversion operation with the constraint relationship when the first surpassing as the first surpassing side, to the violation unit covered in the first is surpassed, The violation unit for being unsatisfactory for the first surpassing the constraint relationship on side after negating is changed into normal data, the first surpasses side elimination.
S140: loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point, to minimum vertex-covering Point executes the inversion operation with its current whole the constraint relationship, completes the reparation of whole abnormal datas.
Wherein, abnormal data retrieval model meets formula:
Wherein,Any data in database to input abnormal data retrieval model, Ri With dataRelated pass Connection relationship, PiIt is dataThe constraint relationship of satisfaction;If the data of inputIt is unsatisfactory for formula (1), then determines dataFor exception Data, while determining ungratified the constraint relationship.
Whether the principle of abnormal data retrieval model concludes a contract or treaty beam, and negative constraint is exactly one-level formula:Here x refers to data cell.Ri∈ R is a relationship atom.Each PiForm: v1θv2,θ ∈ B, similar predicate ≈, when the editing distance of two character strings is big In the threshold value σ that user gives, this predicate just comes into force.Single constraint, functional dependence, matching relies on and conditional function relies on It is all the negative constraint in a tuple.A database instance I at database schema S is provided, there are also negative constraintsIf I meetsIt writes and doesIf there is one group of DC ∑, and if only ifWrite and be I |=∑.It is right One, there are the reparation I' of the database instance I of wrong data, will meet the negative constraint ∑ on database instance I, and And there is identical first deck label with example I.Attribute value in I and I' can be different, for the domain of attribute in R, also have difference Reparation I'.The new value of these reparations i.e. attribute.In repairing at one, each is ok the new value FV of attribute A With one from Dom (A) Doma(A) value replaces, Doma(A) be a value domain, the value in this domain at least meet for Include a predicate in each negative constraint including FV.In other words, new value to meet on actual attribute it is all about Beam.
Wherein, in establishing the step of conflicting hypergraph, settingSearching is unsatisfactory for any the constraint relationship PiViolation unit V={ v1,...,vn, the two is corresponding to obtain conflict hypergraph.
Conflict hypergraph is the detection for unlawful practice.Its node is the unit of violation, and the side being connected to node is corresponding The unlawful practice of node.It represents the current state of the data under all the constraint relationships.By this state, can analyze Interaction between different unlawful practices also can analyze out the interaction between different constraints.
Provide a constraint DC d:The violation unit V={ v connected with super side1,...,vn, it is right In each vi∈ V, at least one selectable reparation viθ t, t are and the connection in unit V, that is, violation unit viInstitute All constraints of violation.
Wherein, in the step of selection the first surpasses side, comprising steps of
For the subdomain being unsatisfactory for when super all violation units in corresponding the constraint relationship constitute corresponding super;
In the subdomain that the violation unit that all super sides where counting minimum vertex-covering point surround is constituted, it is only unsatisfactory for one The quantity of the violation data unit of the corresponding the constraint relationship in super side;
In the subdomain that the violation unit that all super sides where choosing minimum vertex-covering point surround is constituted, only it is unsatisfactory for one and surpasses In most super of the quantity of the violation unit of corresponding the constraint relationship as the first surpassing side.
According to conflict hypergraph in each super violation unit in possessed it is super while number, to select preferentially to repair Violation constraint, the violation unit that these violate same constraint is uniformly repaired according to corresponding constraint and reparation rule.
Different violation units is surrounded by the super side of a hypergraph, illustrates that they all violate the pass of constraint corresponding to super side System, then the processing of the unit in these super side is similar.In violation data cleaning, belong to disobeying for the super side of same Advise unit processing together.
Wherein, after completion the first surpasses the step of side is eliminated, again by whole violation units in initial collision hypergraph It is retrieved in input abnormal data retrieval model, excludes the first to surpass in crack approach by eliminating, be corrected for normal data Abnormal data and the ungratified incidence relation of abnormal data and the constraint relationship.
Wherein, by the abnormal data newly obtained and corresponding ungratified incidence relation and the constraint relationship, punching is rebuild Prominent hypergraph, carries out loop iteration, finds the minimum vertex-covering point in the violation unit that new abnormal data is formed and the first surpasses side, and Progress the first surpasses side elimination, until abnormal data only remaining minimum vertex-covering point.
Wherein, when abnormal data only remaining minimum vertex-covering point, for the ungratified whole the constraint relationships of minimum vertex-covering point, into Row inversion operation, and the data obtained after minimum vertex-covering point reparation are inputted in abnormal data retrieval model, judge minimum vertex-covering Whether the data that point obtains after repairing are abnormal data.
Wherein, if abnormal data retrieval model judge the data obtained after minimum vertex-covering point reparation for normal data, it is complete The reparation of paired data library total data;If abnormal data retrieval model judges the data obtained after minimum vertex-covering point reparation to be different Regular data then deletes the data, the reparation of database total data.
It is different from the prior art, parallel data cleaning method of the invention passes through building distributed parallel cleaning system Overall architecture, conflicted with corresponding constraint composition hypergraph using the data cells for violating the constraint relationships all in data, counted According to cleaning, and according to data cell in conflict hypergraph and the corresponding position constrained, the rapid data for being suitble to mass data is formed Cleaning method.By means of the invention it is possible to which reaching data cleansing repairs speed faster, and algorithm complexity is lower, is suitble to a large amount of numbers According to reparation.
Although the invention has been described by way of example and in terms of the preferred embodiments, but it is not for limiting the present invention, any this field Technical staff without departing from the spirit and scope of the present invention, may be by the methods and technical content of the disclosure above to this hair Bright technical solution makes possible variation and modification, therefore, anything that does not depart from the technical scheme of the invention, and according to the present invention Technical spirit any simple modifications, equivalents, and modifications to the above embodiments, belong to technical solution of the present invention Protection scope.

Claims (8)

1. a kind of parallel data cleaning method characterized by comprising
Using the incidence relation and the constraint relationship between data all in database, abnormal data retrieval model is constructed;Wherein, institute The input of abnormal data retrieval model is stated as each data in database, is closed with the data correlation in abnormal data retrieval model System and the constraint relationship are compared, if being unsatisfactory at least one of data correlation relation and the constraint relationship, by the data of input Mould is retrieved as abnormal data, and using abnormal data and its ungratified whole incidence relations and the constraint relationship as abnormal data The output of type;
According to the abnormal data and the ungratified the constraint relationship of each abnormal data of the output of abnormal data retrieval model, construct different The hypergraph of regular data;Wherein, the super side using ungratified the constraint relationship as hypergraph, correspondence are unsatisfactory at least the one of the constraint relationship The violation unit that a abnormal data is covered as super side;
It selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds minimum vertex-covering point In the ungratified the constraint relationship of violation unit, the most the constraint relationship of violation unit of relationship is not only satisfied the constraint as first Super side is discontented with by the inversion operation for the first surpassing the constraint relationship when violation unit covered in executes and the first surpasses after negating The violation unit for the constraint relationship that foot the first surpasses side is changed into normal data, the first surpasses side elimination;
Loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point, to the execution of minimum vertex-covering point and its The inversion operation of current whole the constraint relationship, completes the reparation of whole abnormal datas.
2. parallel data cleaning method according to claim 1, which is characterized in that the abnormal data retrieval model meets Formula:
Wherein,Any data in database to input abnormal data retrieval model,With dataRelated association Relationship, PiIt is dataThe constraint relationship of satisfaction;If the data of inputIt is unsatisfactory for formula (1), then determines dataFor abnormal number According to, while determining ungratified the constraint relationship.
3. parallel data cleaning method according to claim 1, which is characterized in that in establishing the step of conflicting hypergraph, SettingSearching is unsatisfactory for any the constraint relationship PiViolation unit V={ v1,...,vn, the two is to deserved To conflict hypergraph.
4. parallel data cleaning method according to claim 1, which is characterized in that in the step of selection the first surpasses side, Comprising steps of
For the subdomain being unsatisfactory for when super all violation units in corresponding the constraint relationship constitute corresponding super;
In the subdomain that the violation unit that all super sides where counting minimum vertex-covering point surround is constituted, it is only unsatisfactory for a super side The quantity of the violation data unit of corresponding the constraint relationship;
In the subdomain that the violation unit that all super sides where choosing minimum vertex-covering point surround is constituted, it is only unsatisfactory for a super side pair Most super while as the first surpassing of the quantity of the violation unit for the constraint relationship answered.
5. parallel data cleaning method according to claim 4, which is characterized in that the first surpass the step of side is eliminated in completion Later, whole violation units in initial collision hypergraph are inputted again in abnormal data retrieval model and is retrieved, exclude warp Cross elimination the first surpass in crack approach, be corrected for normal data abnormal data and the ungratified incidence relation of abnormal data and The constraint relationship.
6. parallel data cleaning method according to claim 5, which is characterized in that by the abnormal data newly obtained and correspondence Ungratified incidence relation and the constraint relationship, rebuild conflict hypergraph, carry out loop iteration, find new abnormal data shape At violation unit in minimum vertex-covering point and the first surpass side, and carry out the first surpassing side elimination, until abnormal data is only remaining minimum Covering point.
7. parallel data cleaning method according to claim 6, which is characterized in that when abnormal data only remaining minimum vertex-covering point When, for the ungratified whole the constraint relationships of minimum vertex-covering point, carry out inversion operation, and will obtain after minimum vertex-covering point reparation Data input in abnormal data retrieval model, judge whether the data obtained after minimum vertex-covering point reparation are abnormal data.
8. parallel data cleaning method according to claim 7, which is characterized in that if the judgement of abnormal data retrieval model is most The data that small covering point obtains after repairing are normal data, the then reparation of database total data;If abnormal data is examined Rope model judges the data obtained after minimum vertex-covering point reparation for abnormal data, then deletes the data, and database is whole The reparation of data.
CN201910161073.1A 2019-03-04 2019-03-04 Parallel data cleaning method Expired - Fee Related CN110069480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910161073.1A CN110069480B (en) 2019-03-04 2019-03-04 Parallel data cleaning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910161073.1A CN110069480B (en) 2019-03-04 2019-03-04 Parallel data cleaning method

Publications (2)

Publication Number Publication Date
CN110069480A true CN110069480A (en) 2019-07-30
CN110069480B CN110069480B (en) 2022-06-24

Family

ID=67366031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910161073.1A Expired - Fee Related CN110069480B (en) 2019-03-04 2019-03-04 Parallel data cleaning method

Country Status (1)

Country Link
CN (1) CN110069480B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050081194A1 (en) * 2000-05-02 2005-04-14 Microsoft Corporation Methods for enhancing type reconstruction
US20120137367A1 (en) * 2009-11-06 2012-05-31 Cataphora, Inc. Continuous anomaly detection based on behavior modeling and heterogeneous information analysis
WO2014000788A1 (en) * 2012-06-27 2014-01-03 Qatar Foundation A method for cleaning data records in a database
US20160140190A1 (en) * 2014-11-04 2016-05-19 Spatial Information Systems Research Limited Data representation
US20170193078A1 (en) * 2016-01-06 2017-07-06 International Business Machines Corporation Hybrid method for anomaly Classification
US20170364831A1 (en) * 2016-06-21 2017-12-21 Sri International Systems and methods for machine learning using a trusted model
CN107633099A (en) * 2017-10-20 2018-01-26 西北工业大学 The importance decision method of data base consistency(-tance) mistake
US20180276261A1 (en) * 2014-05-30 2018-09-27 Georgetown University Process and Framework For Facilitating Information Sharing Using a Distributed Hypergraph

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050081194A1 (en) * 2000-05-02 2005-04-14 Microsoft Corporation Methods for enhancing type reconstruction
US20120137367A1 (en) * 2009-11-06 2012-05-31 Cataphora, Inc. Continuous anomaly detection based on behavior modeling and heterogeneous information analysis
WO2014000788A1 (en) * 2012-06-27 2014-01-03 Qatar Foundation A method for cleaning data records in a database
US20180276261A1 (en) * 2014-05-30 2018-09-27 Georgetown University Process and Framework For Facilitating Information Sharing Using a Distributed Hypergraph
US20160140190A1 (en) * 2014-11-04 2016-05-19 Spatial Information Systems Research Limited Data representation
US20170193078A1 (en) * 2016-01-06 2017-07-06 International Business Machines Corporation Hybrid method for anomaly Classification
US20170364831A1 (en) * 2016-06-21 2017-12-21 Sri International Systems and methods for machine learning using a trusted model
CN107633099A (en) * 2017-10-20 2018-01-26 西北工业大学 The importance decision method of data base consistency(-tance) mistake

Also Published As

Publication number Publication date
CN110069480B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
Zhang et al. On complexity and optimization of expensive queries in complex event processing
Dijkman et al. Aligning business process models
Chen et al. A choice relation framework for supporting category-partition test case generation
US9519862B2 (en) Domains for knowledge-based data quality solution
US20130117202A1 (en) Knowledge-based data quality solution
WO2019185039A1 (en) A data processing method and electronic apparatus
CN111177322A (en) Ontology model construction method of domain knowledge graph
Lee et al. A survey on data cleaning methods for improved machine learning model performance
CN111241079A (en) Data cleaning method and device and computer readable storage medium
Mahdavi et al. Semi-Supervised Data Cleaning with Raha and Baran.
CN109634949A (en) A kind of blended data cleaning method based on more versions of data
CN102799960A (en) Parallel operation flow anomaly detection method oriented to data model
CN110363662A (en) A kind of personal credit points-scoring system
WO2020259391A1 (en) Database script performance testing method and device
Diamantopoulos et al. Semantically-enriched Jira issue tracking data
Zhao et al. Safe semi-supervised classification algorithm combined with active learning sampling strategy
CN110069480A (en) A kind of parallel data cleaning method
CN116467437A (en) Automatic flow modeling method for complex scene description
CN116523284A (en) Automatic evaluation method and system for business operation flow based on machine learning
Karami et al. Maintaining accurate web usage models using updates from activity diagrams
CN112395343B (en) DSG-based field change data acquisition and extraction method
EP3306540A1 (en) System and method for content affinity analytics
CN114035783A (en) Software code knowledge graph construction method and tool
Palepu et al. Meta data quality control architecture in data warehousing
JP2010267229A (en) Association processing method and flow comparative processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A parallel data cleaning method

Effective date of registration: 20221008

Granted publication date: 20220624

Pledgee: China Co. truction Bank Corp Jiangmen branch

Pledgor: Guangdong Heng Rui Science and Technology Ltd. s

Registration number: Y2022980017520

PE01 Entry into force of the registration of the contract for pledge of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220624

CF01 Termination of patent right due to non-payment of annual fee