CN110069480A

CN110069480A - A kind of parallel data cleaning method

Info

Publication number: CN110069480A
Application number: CN201910161073.1A
Authority: CN
Inventors: 姚箐晨; 陈德健
Original assignee: Guangdong Heng Rui Technology Co Ltd
Current assignee: Guangdong Heng Rui Technology Co Ltd
Priority date: 2019-03-04
Filing date: 2019-03-04
Publication date: 2019-07-30
Anticipated expiration: 2039-03-04
Also published as: CN110069480B

Abstract

The invention discloses a kind of parallel data cleaning methods, by the overall architecture for constructing distributed parallel cleaning system, the hypergraph that conflicts is constituted with corresponding constraint using the data cells for violating the constraint relationship all in data, carry out data cleansing, and according to data cell in conflict hypergraph and the corresponding position constrained, the rapid data cleaning method for being suitble to mass data is formed.By means of the invention it is possible to which reaching data cleansing repairs speed faster, and algorithm complexity is lower, is suitble to the reparation of mass data.

Description

A kind of parallel data cleaning method

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of parallel data cleaning method.

Background technique

Business and science data have become the wealth of most worthy under current era development.But in data Source because the mistake of the interference of noise and bring data substantially reduces the value of data itself: what data were extracted Inaccurately lead to the missing of data；The data extracted from multiple data sources lead to the redundancy of data after merging.Data set provider is wrong Data input accidentally, causes Data Integrality Restriction no longer to be set up.These mistakes all cause a large amount of economic loss every year.It provides The operation of data is cleaned, improving the quality of data is the key that realize data efficient management.

Data cleansing includes the detection and correction of wrong data.The current algorithm for having had already appeared many data cleansings, one Kind is for different type of errors (shortage of data, data redundancy, error in data) for the straightforward procedure of global data Cleaning algorithm is simply together in series use, can reduce the processing complexity to global data to the maximum extent in this way.But This method does not account for the interaction relationship between global data different type.Eventually result in the result of data cleansing simultaneously It is undesirable.

MapReduce programming framework can use the big data cleaning that concurrent technique realizes enhanced scalability.MapReduce It is a kind of programming model, the concurrent operation for large-scale dataset.Concept " Map (mapping) " and " Reduce (reduction) " are Their main thought is all the characteristic borrowed in Functional Programming and vector programming language.It is greatly facilitated Programming personnel will not distributed parallel programming in the case where, the program of oneself is operated in distributed system.Current is soft Part realization is to specify Map (mapping) function, for one group of key-value pair is mapped to one group of new key-value pair, is specified concurrently Reduce (reduction) function, for guaranteeing that each of the key-value pair of all mappings shares identical key group.

Research in terms of data cleansing appears in the U.S. earliest, even to this day, has emerged algorithm too numerous to enumerate.With The transition in epoch, the form of wrong data change multiplicity, the growth of data volume also proposes for the design of data cleansing algorithm new Requirement.Many traditional data cleansing algorithms have been unable to satisfy the demand of big data era.Therefore how accurately and efficiently clear Washing dirty data always is the project for being worth research.Data cleansing is intended to identify and corrects the noise in data, by noise to data The influence of analysis result minimizes.Noise in data mainly includes incomplete data, the data of redundancy, the data of conflict With the data of mistake.Detecting and eliminating for data noise can be realized by automatic algorithm, it can also be by data cleansing Rule, or by user participation.

Instantly, machine learning and the development of crowdsourcing technology are that the research work of data cleansing is filled with new vitality.Machine Learning art can learn to formulate the rule of cleaning decision from user record, to mitigate the burden of user annotation data.Together When, the conversion from cleaning rule to machine learning model is so that user no longer needs to formulate a large amount of data cleansing rule.Crowdsourcing Data cleansing task is published to internet by technology, to concentrate the knowledge of numerous users and the form of decision crowdsourcing can be abundant The accuracy and efficiency of data cleansing are improved while reducing cleaning cost using external resource advantage.

In existing data cleansing amending method, usually using the technology of negative constraint, found out according to negative constraint All data cells for not meeting constraint, these data cells for not meeting constraint constitute conflict hypergraph, root with corresponding constraint According to the corresponding position of Bian Yudian in conflict hypergraph, reparation context is defined.It is cleaned according to context and corresponding constraint is repaired Data.But available data cleaning restorative procedure is unsatisfactory for the processing of mass data, because the complexity of algorithm is high So that the efficiency of data cleansing is very low.Or the algorithm designed for specific data, there is no propose logarithm on the whole According to the unified approach of cleaning.

Summary of the invention

The purpose of the present invention is provide a kind of parallel data cleaning method to solve above-mentioned the deficiencies in the prior art place.

Technology of the invention is achieved through the following technical solutions: a kind of parallel data cleaning method, comprising:

Using the incidence relation and the constraint relationship between data all in database, abnormal data retrieval model is constructed；Its In, the input of the abnormal data retrieval model is each data in database, with the data in abnormal data retrieval model Incidence relation and the constraint relationship are compared, if being unsatisfactory at least one of data correlation relation and the constraint relationship, will input Data as abnormal data, and using abnormal data and its ungratified whole incidence relation and the constraint relationship as abnormal data The output of retrieval model；

According to the abnormal data and the ungratified the constraint relationship of each abnormal data of the output of abnormal data retrieval model, structure Build the hypergraph of abnormal data；Wherein, the super side using ungratified the constraint relationship as hypergraph, correspondence are unsatisfactory for the constraint relationship extremely The violation unit that a few abnormal data is covered as super side；

It selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds minimum vertex-covering In the ungratified the constraint relationship of violation unit of point, the most the constraint relationship conduct of the violation unit of relationship is not only satisfied the constraint The first surpass side, to the inversion operation for the first surpassing the constraint relationship when violation unit covered in executes and the first surpasses, after negating The violation unit for being unsatisfactory for the first surpassing the constraint relationship on side is changed into normal data, the first surpasses side elimination；

Loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point executes minimum vertex-covering point With the inversion operation of its current whole the constraint relationship, the reparation of whole abnormal datas is completed.

Wherein, abnormal data retrieval model meets formula:

Wherein,Any data in database to input abnormal data retrieval model, R_i With dataRelated pass Connection relationship, P_iIt is dataThe constraint relationship of satisfaction；If the data x of input is unsatisfactory for formula (1), data are determinedFor exception Data, while determining ungratified the constraint relationship.

Wherein, in establishing the step of conflicting hypergraph, settingSearching is unsatisfactory for any the constraint relationship P_iViolation unit V={ v₁,...,v_n, the two is corresponding to obtain conflict hypergraph.

Wherein, in the step of selection the first surpasses side, comprising steps of

For the subdomain being unsatisfactory for when super all violation units in corresponding the constraint relationship constitute corresponding super；

In the subdomain that the violation unit that all super sides where counting minimum vertex-covering point surround is constituted, it is only unsatisfactory for one The quantity of the violation data unit of the corresponding the constraint relationship in super side；

In the subdomain that the violation unit that all super sides where choosing minimum vertex-covering point surround is constituted, only it is unsatisfactory for one and surpasses In most super of the quantity of the violation unit of corresponding the constraint relationship as the first surpassing side.

Wherein, after completion the first surpasses the step of side is eliminated, again by whole violation units in initial collision hypergraph It is retrieved in input abnormal data retrieval model, excludes the first to surpass in crack approach by eliminating, be corrected for normal data Abnormal data and the ungratified incidence relation of abnormal data and the constraint relationship.

Wherein, by the abnormal data newly obtained and corresponding ungratified incidence relation and the constraint relationship, punching is rebuild Prominent hypergraph, carries out loop iteration, finds the minimum vertex-covering point in the violation unit that new abnormal data is formed and the first surpasses side, and Progress the first surpasses side elimination, until abnormal data only remaining minimum vertex-covering point.

Wherein, when abnormal data only remaining minimum vertex-covering point, for the ungratified whole the constraint relationships of minimum vertex-covering point, into Row inversion operation, and the data obtained after minimum vertex-covering point reparation are inputted in abnormal data retrieval model, judge minimum vertex-covering Whether the data that point obtains after repairing are abnormal data.

Wherein, if abnormal data retrieval model judge the data obtained after minimum vertex-covering point reparation for normal data, it is complete The reparation of paired data library total data；If abnormal data retrieval model judges the data obtained after minimum vertex-covering point reparation to be different Regular data then deletes the data, the reparation of database total data.

It is different from the prior art, parallel data cleaning method of the invention passes through building distributed parallel cleaning system Overall architecture, conflicted with corresponding constraint composition hypergraph using the data cells for violating the constraint relationships all in data, counted According to cleaning, and according to data cell in conflict hypergraph and the corresponding position constrained, the rapid data for being suitble to mass data is formed Cleaning method.By means of the invention it is possible to which reaching data cleansing repairs speed faster, and algorithm complexity is lower, is suitble to a large amount of numbers According to reparation.

Detailed description of the invention

Fig. 1 be it is provided by the invention it is a kind of for coordinate conversion in abnormal point positioning and estimation method process illustrate Figure.

Specific embodiment

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention.But the present invention can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to intension of the present invention the case where Under do similar popularization, therefore the present invention is not limited to the specific embodiments disclosed below.

Secondly, the present invention is described in detail using schematic diagram, when describing the embodiments of the present invention, for purposes of illustration only, showing It is intended to be example, the scope of protection of the invention should not be limited herein.

Refering to fig. 1, Fig. 1 is a kind of flow diagram of parallel data cleaning method provided by the invention.The step of this method Suddenly include:

S110: it using the incidence relation and the constraint relationship between data all in database, constructs abnormal data and retrieves mould Type；Wherein, the input of the abnormal data retrieval model is each data in database, in abnormal data retrieval model Data correlation relation and the constraint relationship are compared, will if being unsatisfactory at least one of data correlation relation and the constraint relationship The data of input are as abnormal data, and using abnormal data and its ungratified whole incidence relation and the constraint relationship as extremely The output of data retrieval model.

S120: it is closed according to the abnormal data of abnormal data retrieval model output and the ungratified constraint of each abnormal data System, constructs the hypergraph of abnormal data；Wherein, the super side using ungratified the constraint relationship as hypergraph, it is corresponding to be unsatisfactory for constraint pass The violation unit that at least one abnormal data of system is covered as super side.

S130: it selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds most In the ungratified the constraint relationship of violation unit of small covering point, the most constraint pass of the violation unit of relationship is not only satisfied the constraint System executes inversion operation with the constraint relationship when the first surpassing as the first surpassing side, to the violation unit covered in the first is surpassed, The violation unit for being unsatisfactory for the first surpassing the constraint relationship on side after negating is changed into normal data, the first surpasses side elimination.

S140: loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point, to minimum vertex-covering Point executes the inversion operation with its current whole the constraint relationship, completes the reparation of whole abnormal datas.

Wherein, abnormal data retrieval model meets formula:

Wherein,Any data in database to input abnormal data retrieval model, R_i With dataRelated pass Connection relationship, P_iIt is dataThe constraint relationship of satisfaction；If the data of inputIt is unsatisfactory for formula (1), then determines dataFor exception Data, while determining ungratified the constraint relationship.

Whether the principle of abnormal data retrieval model concludes a contract or treaty beam, and negative constraint is exactly one-level formula:Here x refers to data cell.R_i∈ R is a relationship atom.Each P_iForm: v₁θv₂,θ ∈ B, similar predicate ≈, when the editing distance of two character strings is big In the threshold value σ that user gives, this predicate just comes into force.Single constraint, functional dependence, matching relies on and conditional function relies on It is all the negative constraint in a tuple.A database instance I at database schema S is provided, there are also negative constraintsIf I meetsIt writes and doesIf there is one group of DC ∑, and if only ifWrite and be I |=∑.It is right One, there are the reparation I' of the database instance I of wrong data, will meet the negative constraint ∑ on database instance I, and And there is identical first deck label with example I.Attribute value in I and I' can be different, for the domain of attribute in R, also have difference Reparation I'.The new value of these reparations i.e. attribute.In repairing at one, each is ok the new value FV of attribute A With one from Dom (A) Dom^a(A) value replaces, Dom^a(A) be a value domain, the value in this domain at least meet for Include a predicate in each negative constraint including FV.In other words, new value to meet on actual attribute it is all about Beam.

Conflict hypergraph is the detection for unlawful practice.Its node is the unit of violation, and the side being connected to node is corresponding The unlawful practice of node.It represents the current state of the data under all the constraint relationships.By this state, can analyze Interaction between different unlawful practices also can analyze out the interaction between different constraints.

Provide a constraint DC d:The violation unit V={ v connected with super side₁,...,v_n, it is right In each v_i∈ V, at least one selectable reparation v_iθ t, t are and the connection in unit V, that is, violation unit v_iInstitute All constraints of violation.

Wherein, in the step of selection the first surpasses side, comprising steps of

According to conflict hypergraph in each super violation unit in possessed it is super while number, to select preferentially to repair Violation constraint, the violation unit that these violate same constraint is uniformly repaired according to corresponding constraint and reparation rule.

Different violation units is surrounded by the super side of a hypergraph, illustrates that they all violate the pass of constraint corresponding to super side System, then the processing of the unit in these super side is similar.In violation data cleaning, belong to disobeying for the super side of same Advise unit processing together.

Although the invention has been described by way of example and in terms of the preferred embodiments, but it is not for limiting the present invention, any this field Technical staff without departing from the spirit and scope of the present invention, may be by the methods and technical content of the disclosure above to this hair Bright technical solution makes possible variation and modification, therefore, anything that does not depart from the technical scheme of the invention, and according to the present invention Technical spirit any simple modifications, equivalents, and modifications to the above embodiments, belong to technical solution of the present invention Protection scope.

Claims

1. a kind of parallel data cleaning method characterized by comprising

Using the incidence relation and the constraint relationship between data all in database, abnormal data retrieval model is constructed；Wherein, institute The input of abnormal data retrieval model is stated as each data in database, is closed with the data correlation in abnormal data retrieval model System and the constraint relationship are compared, if being unsatisfactory at least one of data correlation relation and the constraint relationship, by the data of input Mould is retrieved as abnormal data, and using abnormal data and its ungratified whole incidence relations and the constraint relationship as abnormal data The output of type；

According to the abnormal data and the ungratified the constraint relationship of each abnormal data of the output of abnormal data retrieval model, construct different The hypergraph of regular data；Wherein, the super side using ungratified the constraint relationship as hypergraph, correspondence are unsatisfactory at least the one of the constraint relationship The violation unit that a abnormal data is covered as super side；

It selects to be unsatisfactory for the violation unit most as the constraint relationship on super side as minimum vertex-covering point, finds minimum vertex-covering point In the ungratified the constraint relationship of violation unit, the most the constraint relationship of violation unit of relationship is not only satisfied the constraint as first Super side is discontented with by the inversion operation for the first surpassing the constraint relationship when violation unit covered in executes and the first surpasses after negating The violation unit for the constraint relationship that foot the first surpasses side is changed into normal data, the first surpasses side elimination；

Loop iteration, until all super sides are eliminated, the violation unit of remaining minimum vertex-covering point, to the execution of minimum vertex-covering point and its The inversion operation of current whole the constraint relationship, completes the reparation of whole abnormal datas.

2. parallel data cleaning method according to claim 1, which is characterized in that the abnormal data retrieval model meets Formula:

Wherein,Any data in database to input abnormal data retrieval model,With dataRelated association Relationship, P_iIt is dataThe constraint relationship of satisfaction；If the data of inputIt is unsatisfactory for formula (1), then determines dataFor abnormal number According to, while determining ungratified the constraint relationship.

3. parallel data cleaning method according to claim 1, which is characterized in that in establishing the step of conflicting hypergraph, SettingSearching is unsatisfactory for any the constraint relationship P_iViolation unit V={ v₁,...,v_n, the two is to deserved To conflict hypergraph.

4. parallel data cleaning method according to claim 1, which is characterized in that in the step of selection the first surpasses side, Comprising steps of

In the subdomain that the violation unit that all super sides where counting minimum vertex-covering point surround is constituted, it is only unsatisfactory for a super side The quantity of the violation data unit of corresponding the constraint relationship；

In the subdomain that the violation unit that all super sides where choosing minimum vertex-covering point surround is constituted, it is only unsatisfactory for a super side pair Most super while as the first surpassing of the quantity of the violation unit for the constraint relationship answered.

5. parallel data cleaning method according to claim 4, which is characterized in that the first surpass the step of side is eliminated in completion Later, whole violation units in initial collision hypergraph are inputted again in abnormal data retrieval model and is retrieved, exclude warp Cross elimination the first surpass in crack approach, be corrected for normal data abnormal data and the ungratified incidence relation of abnormal data and The constraint relationship.

6. parallel data cleaning method according to claim 5, which is characterized in that by the abnormal data newly obtained and correspondence Ungratified incidence relation and the constraint relationship, rebuild conflict hypergraph, carry out loop iteration, find new abnormal data shape At violation unit in minimum vertex-covering point and the first surpass side, and carry out the first surpassing side elimination, until abnormal data is only remaining minimum Covering point.

7. parallel data cleaning method according to claim 6, which is characterized in that when abnormal data only remaining minimum vertex-covering point When, for the ungratified whole the constraint relationships of minimum vertex-covering point, carry out inversion operation, and will obtain after minimum vertex-covering point reparation Data input in abnormal data retrieval model, judge whether the data obtained after minimum vertex-covering point reparation are abnormal data.

8. parallel data cleaning method according to claim 7, which is characterized in that if the judgement of abnormal data retrieval model is most The data that small covering point obtains after repairing are normal data, the then reparation of database total data；If abnormal data is examined Rope model judges the data obtained after minimum vertex-covering point reparation for abnormal data, then deletes the data, and database is whole The reparation of data.