WO2015181511A1

WO2015181511A1 - Data cleaning methods and systems

Info

Publication number: WO2015181511A1
Application number: PCT/GB2014/051670
Authority: WO
Inventors: Nan TANG; Mourad Ouzzani; Paolo Papotti; Ihab Francis Ilyas Kaldas; Xu Chu
Original assignee: Qatar Foundation; Hoarton, Lloyd
Priority date: 2014-05-30
Filing date: 2014-05-30
Publication date: 2015-12-03

Abstract

A end-to-end system to annotate unknown type data instances using a knowledge base and crowdsourcing. A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising: generating a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising data-types and data-type relationships; determining a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data- types and the data-type relationships; annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and repairing the erroneous annotated tuples to generate a clean database instance.

Description

Title: Data Cleaning Methods and Systems Description of Invention Background

As businesses generate and consume more data than ever, enforcing and maintaining data quality is of paramount importance to any organization. For example, in healthcare, inaccurate or incorrect data may threaten patient safety. Data cleaning is thus a key task towards improving data quality and is estimated to cost between 30% and 80% of a typical data warehouse project.

Data cleaning leads to correct information which results in quick and reliable searches. The performace increase resulting from clean data is well known in the art.

Various constraints are used to check data integrity. The variety of integrity constraints (ICs), include check constraints, conditional functional

dependencies (CFDS), and denial constraints (DCs), and are used to capture data errors in the form of IC violations.

A sample problem uses Fig 1 (Table 1 ) as a data set with some tuples which have erroneous attribute values. Consider a functional dependency f : B→C defined on the soccer players in Fig 1 ; it states that column B uniquely determines C . Tuples ( t_{l t} t₃ ) violate / since they have the same B -value (

Italy) but different C -values ( Rome for t ] and Madrid and t₃[C] ). The error

(Madrid in t₃[C] ) is captured by a violation with four values

{i₁[5], i₁[C], i₃[5], i₃[C]} . Automatic repair methods, resolve such violations by finding a consistent database with minimum changes; Randomly changing one value in {i₁[5], i₁[C], i₃[5], i₃[C]} would be such a change. However, it is known in the art that such methods cannot guarantee correct fixes ( i.e., ground truth). To guarantee correct fixes, it is a common practice to consult authority, including trustworthy master data, (i.e., knowledge bases KBs ), and domain experts. However, leveraging these two resources in data cleaning faces several limitations. (1 ) Experts have limited capacity, thus cannot manually go through all violations and resolve them; (2) The semantics of ICs is hard to understand for people not familiar with databases and their semantics. In our experience with IC validation, even simple ICs like FDs are not comprehensible to domain experts without a database background; and (3) KBs often have low coverage w.r.t. the database instance.

Suppose a KB states that Italy hasCapital Rome and Spain hasCapital Madrid, this information cannot be directly used to fix the FD violation of t and t₃ because more evidences are needed to identify which cell is erroneous.

The present invention seeks to provide an improved computer implemented method and a system to clean a database instance. A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising:

generating a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising data-types and data-type relationships;

determining a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data-types and the data-type relationships;

annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and repairing the erroneous annotated tuples to generate a clean database instance.

One embodiment of the present invention addresses the above mentioned issues and is an end-to-end data cleaning system that effectively leverages prevalent trustworthy master data,/.e., KBs , and judiciously involves users to resolve conflicts and remove ambiguities. The resulting clean data instance improves performace of the in retrieving data and the load resulting on the underlying hardware by reducing the number of incorrect queries returned. We present three main modules which integrate with each other to perform the end-to-end cleaning, a: holistic pattern module; validation module; and an annotation module.

Drawings

So that the present invention may be more readily understood, embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 is a sample dataset table T for soccer players.

Figure 2 is a flow diagram of the end-to-end workflow for updating a table T and an enriching a knowledge base KB.

Figure 3 shows examples of holistic patterns derived using the holistic pattern module and example validations.

Figure 4 shows the coverage of attributes in the table T by a knowledge base in a holistic pattern. Figure 5 shows a schematic of Top-k results of various column data-types and their support scores.

Figure 6 shows candidate holistic patterns repairs patterns for updating the tuple attributes in T.

Figure 7 shows results for the precision of holistic patterns using different ranking methods to generate top-k holistic patterns. Figure 8 shows validated holistic patterns for different data sets used in the experiment.

Figure 9 shows the data coverage of the knowledge bases of three different data sets, by three aspects of the preferred embodiment.

Figure 10 shows the effectiveness of the possible repairs by using recall, defined as the number of correct possible repairs over the number of dirty tuples. We present an overview of each of the modules which fit into the end-to-end system.

Holistic pattern module overview

We introduce holistic patterns; holistic patterns convey direct mappings from a given data instance to the constructs of a KB. Holistic patterns are used to explain relationships between tables. The coherence of binary relationships together are taken to produce the pattern. The holistic pattern is more than the sum of its parts and represent a key component in determining the quality of fixes in a give data instance. For example, holistic patterns recognize that the types of columns A , B , and C are Person, Country , and Capital in the KB, respectively, and two

relationships are in place between these columns in the KB, namely, A is related to B via nationality and B is related to C via hasCapital . Given such patterns, we may immediately detect that t₃ might be wrong since it does not conform to the master data, in this case the KB, which states that Italy hasCapital Rome and Pirlo nationality Italy.

To deteremine holistic patterns, the data values in the database instance / are mapped to those in the KBs. There may be more than one possible mapping for a data value. For example, "Rome" could be either city , capital , or club in the KBs. These ambiguous mappings generate several holistic patterns. The holistic patterns are used to explain table semantics using KBs {e.g., Yago or RxNorm). Each holistic pattern assigns a type to a column and a binary relationship to a pair of columns up to the coverage provided by the KB.

Attributes in a holistic pattern are connected by binary relationships. This global information can explotited using several methods, for example in a a preferred embodiment we use a rank-join based method to efficiently discover holistic patterns taking into account their coherence.

At the same time, to verify KB data or to supplement where there is

inadequate coverage by the KB the types and binary relationships in a holistic pattern are easily verifiable by humans, using crowdsourcing.

Validation module overview

There are several possibilities of matching different types and binary relationships between a data set and a KB. This produces a plurality of candidate holistic patterns. In the present invention we use at least one of: crowdsourcing; and the KB to validate the best holistic pattern. Determining the best holistic pattern is discussed in the detailed description. Preferably, the crowd, users, are used to validate the best holistic pattern from the several candidate holistic patterns determined by the holistic pattern module.

The validation module preferably uses crowd involvement to validate holistic patterns as well as to validate data when the KBs do not have enough coverage. To effectively involve the crowd, the validation module generates questions, preferably automatically, to ask the crowd. The questions must be easy to understand. Moreover, crowd involvement is optimized to reduce cost, and is discussed in in the detailed description.

KBs are usually incomplete in terms of covering all values in / , which complicates several tasks. First, it makes it more complex to find correct holistic patterns. Second, in case of a mismatch, it is unclear if the values in / cannot be found in the KB because they are incorrect or because the KB is incomplete. Third, it makes data repairing harder, since the KB does not have enough correct values to suggest as possible repairs.

The validation module presents the crowd with specific questions to enable the validation module to select a valid holistic pattern from the candidate holistic patterns. Preferably the validation module minimises the number of questions to ask and dynamically schedules the questions, we can achieve this results for example using entropy theory.

The validation module maximizes the uncertainty reduction of candidate holistic patterns and ouputs a valid holistic pattern. Annotation module overview

We annotate data of the dataset with different categories (i) correct data validated by the KB; (ii) correct data jointly validated by the KB and the crowd; and (iii) erroneous data jointly identified by the KB and the crowd. Data in categories (i) and (ii) are traditionally obtained through cumbersome manual work; these annotated correct data are an important yet scarce resource for IC discovery methods as well as machine learning-based data cleaning solutions. A positive side-effect from (ii) is that the crowd provides new facts, which can be used to enrich the KB. The validated holistic pattern and annotated data may be used to generate possible repairs and to further annotate erroneous data identified in (iii).

Using the valid holistic pattern and a given dataset to clean, we distinguish between the mismatches, i.e., data not conforming to the holistic pattern, due to the lack of coverage in the KB, and the actual data errors. In the first case, we enrich the KB, and in the latter we suggest possible fixes for actual errors.

We now discuss embodiments of the invention for each module and preferred aspects.

Referring now to figure 2, a data set with table T (201 ); a knowledge base KB (202), although it can a collection of KBs; the holistic pattern module (203) ; the validation module(204); the annotation module (205); a crowd of users (206); a clean data set T'(207); and an enriched KB K (208), where enrichment is comprises updating and/or adding knowledge to the database.

• The holistic pattern module determines holistic patterns between a

table T (201 ) and a KB (202). • The validation module (204) leverages crowdsourcing to select the best holistic pattern.

• The annotation module (205) uses the selected holistic pattern and interacts with the KB and the crowd to annotate data. It may also generates possible repairs for erroneous tuples. New facts verified by the crowd are used to enrich KBs.

To better understand how the present invention works, we use examples based on the above embodiment of the invention.

Example 1 : Considering the table T for soccer players Fig 1 . T has no type table header (101 ) ,i.e., each column is untyped, and thus its semantics is completely unknown. A KB K , does have type information {e.g., Yago) and contains some information related to T , although it may not completely cover the table T .

The holistic pattern module determines candidate holistic patterns, with each containing the types (or classes) of columns and the relationships between pairs of columns.

A holistic pattern can be represented as a labelled graph (301 ). A node represents an attribute and its associated type, e.g., "C ( Capital )" means that the type of attribute C in KB K is Capital ; while a directed edge between two nodes represents the relationship between two attributes, e.g.,

"B hasCapital C "means that the relationship from B to C in KB K is

hasCapital . A column could have multiple candidate types, e.g., C could also be of type City . However, knowing the relationship from B to C is

hasCapital indicates that Capital is a better choice for C than City . Note that KBs are often incomplete. Hence, the discovered patterns may not cover all attributes of a table, e.g., attributes G,H of table T are not described by the pattern in (301 ).

In a case where the holistic pattern module finds two similar patterns: the one in (301 ), and a variant with type locationfor column C (302). To select the best holistic pattern we forward the candidate holistic patterns to the validation module and ask the crowd the question " Which type ( Capital or location) is more precise for values ( Rome, Pretoria and Madrid)?". If the crowd picks Capital , the holistic pattern in (301 ) is selected.

The crowd validate and select the holistic pattern in (301 ). The valid holistic pattern is forwarded to the data annotation module where each tuple is annoated. In one aspect of this emobodiment there are three possible annotations, although additonal annotations are possible.

The annotation module annoates the tuples in the data set using one of the following:

(i) : Validated to be correct by the KB. For example, by mapping tuple t in table T to K , we find a full match, shown in (303). It means that e.g., Paolo Rossi (resp. Italy) is in K as a Person(resp. country), and the relationship from

Paolo Rossi to Italy is nationality. Similarly, all other information for ^ w.r.t. attributes A -F can be found in K. In such case, we consider t to be correct w.r.t. the given pattern in 301 , relatively to attributes A -F .

(ii) : Jointly validated to be correct by the KB and crowd. The case in (i) is not always found and validated, in which case we annoate the data to be jointly validated by the KB and crowd. Consider t₂ about Klate, whose explanation is depicted in (304). In KBK , we can find that South Africa is a country , Pretoria is a Capital . However, the relationship from South Africa to Pretoria is missing.

In such case, we ask the crowd a boolean question as " Does South Africa hasCapital Pretoria?". A positive answer from the crowd completes the semantic mapping that is missing from K . In this case, we consider t₂ correct and generate a new fact " South Africa hasCapital Pretoria" from the crowd.

The new fact is used to enrich the KBs.

(iii): Erroneous tuple. Similar to case (ii). For tuple t₃ about Pirlo, there is no link from Italy to Madrid in K . We then ask the crowd a boolean question such as " Does Italy hasCapital Madrid?". A negative answer from the crowd confirms that there is an error in t₃. That is, the problem is not a missing relationship in the KB as in case (ii). At this point, however, we cannot decide which value in t₃ is wrong, Italy or Madrid. The annoation module then derives evidences from K related to the relationshsip and nodes, e.g., Italy hasCapital Rome and Spain

hasCapital Madrid, join them and generate a ranking of possible repairs for this tuple based on the number of changes. The tuples in T Fig 1 are then updated using the releveant annotation and where appropriate the KB is enriched using the answers from the crowd.

Detailed Description We continue to use the example data set in Fig 1 in the description of embodiments of the present invention.

In a preferred embodiment of the present invention we consider knowledge bases (KBs) as RDF-based data consisting of resources, whose schema is defined using Resource Description Framework Schema (RDFS). Although we are not limited to RDFS.

A resource is an identifier for a real-word entity, e.g., Paolo Rossi, Rome, and Italy. A literal is a string, date, or number, e.g., 1 .78. A property is a binary predicate that represents a relationship between two resources or between a resource and a literal. We denote by P(x, y) the property ( a.k.a. relationship) between resource x and resource (or literal) y . For example, locatedIn( Milan, Italy) indicates that Milan is located in Italy.

An RDFS ontology distinguishes between classes and instances. A class is a resource that represents a set of objects, e.g., the class of countries. A resource that is a member of a class is called an instance of that class. The type relation associates an instance to a class e.g., type(Italy) = country .

A more specific class c can be specified as a subclass of a more general class d by using the statement subclassOf (c, d) . This means that all instances of c are also instances of d , e.g., subclassOf (Capital, location) . Similarly, a property P_l can be a sub-property of a property P₂ by the statement

subpropertyOf^, ^) . In this work, we consider KBs that are expressed with RDFS. While it is possible to use more expressive languages such as OWL, which could offer more reasoning opportunities, most KBs in the industry as well as popular ones such as Yago, Freebase, and DBpedia use RDFS.

A holistic pattern of a relation R may be represented by a labelled directed graph G{V,E) with nodes V and edges E . Each node U G V corresponds to an attribute in R .possibly typed, and each edge (w, v) e E from u to v has a label P , denoting the relationship between two attributes that u and v represent. We use 4 e R to denote that A_t is an attribute in R . There are two basic semantic annotations on a relational table.

(1 ) Type of an attribute A_i . The type of an attribute is an annotation that represents the class of attribute values in A_i . For example, the type of attribute B in (301 ) is country .

(2) Relationship from attribute A_t to attribute A_j . The

relationship between two attributes is an annotation that represents how A_i and A . are related through a directed binary relationship. For example, the relationship from attribute B to C in (301 ) is hasCapital .

In the preferred embodiment a holistic pattern is a connected graph. When there exist multiple disconnected patterns, i.e., two holistic patterns that do not share any common node, we treat them independently. In the preferred embodiment we focus on discussing the case of a single holistic pattern.

For a holistic pattern φ , we denote by cp_u a node u in φ , cp_{(u v)} an edge in φ , φ_ν all nodes in φ , and φ_Ε all edges in φ .

Semantics . A tuple t of R matches a holistic pattern φ (with m nodes {ν_γ, ... , ν_Μ} ) w.r.t. a knowledge base K , denoted by

, if there exist m attributes {Α_γ, ... ,Α_Μ} in R and m resources {χ_γ, ... , χ_Μ} in K such that: 1 . there is a one-to-one mapping from A_i (and x, ) to v, for i e [\, m] ;

2. Dist(t[A_i ], x_i ) < 0 and subclassOf (type(x, ), type(v, )) ; 3. for each edge (ν,.,ν .) in φ_Ε with property P , there exists a property P' for corresponding resources x, and x_} in K such that

subpropertyOf (Ρ',Ρ)■ If t matches φ , each corresponding attribute value of t will map to a resource r in KB under a given distance function; and r is a (sub-)type by the type given in φ (conditions 1 and 2). Moreover, for each relationship P that appears in a pattern, the property between the two corresponding resources must be P of one of its sub-properties (condition 3).

Example 2 : Consider tuple t in Fig 1 and the pattern q>_s in (301 ). We have t matches φ₈ , since as depicted in (303), for each attribute value ( e.g., t^A] = Paolo Rossi and t^B] = Italy), there is a resource in K that has a similar value with corresponding type ( e.g., Personfor Paolo Rossi and countryfor Italy) for conditions 1 and 2, and the property nationality holds from Paolo Rossi to Italy in K (condition 3). Similarly, the conditions 1-3 hold for other attribute values in t . Hence, we have t_x q>_s .

We say that a tuple t of R partially matches a holistic pattern φ w.r.t. a knowledge base K , denoted by t^pc , if at least one of the conditions 1-3 does not hold.

Example 3: Consider t₂ in Fig 1 and q>_s in (301 ) . We say that t₂ partially matches q>_s , since the property hasCapital from t₂[B] = South Africa to t₂[C] = Pretoria does not exist in K , i.e., condition 3 does not hold.

Given a table Γ (201 ), a KB K (202), and a holistic pattern φ between them, Fig. 4 shows T s attributes and tuples are covered using the holistic pattern: (1 ) Attributes covered by K iAO^*! ). Attributes A -F in Fig. 4 are covered by the pattern in (301 ). We consider two cases for the tuples. 1 . Fully covered by K . We annotate such tuples as semantically correct w.r.t. φ and K (303).

2. Partially covered by K . We use crowdsourcing to verify whether the uncovered information is caused by the incompleteness of K (304) or by actual data errors (305).

(2) Attributes not covered by K (402). Attributes G and H in Fig. 4 are not covered by the pattern in (301 ). In this case, these attributes are not annoated due to the missing information in K .

For the uncovered attributes, we ask the crowd open-ended questions, such as " what are the possible relationships between Paolo Rossi and Striker?".

Candidate holistic patterns

The holistic pattern module deteremines candidate holistic patterns of a dataset, for example such as table T from a KB K in the preferred

embodiment. We first describe how to generate candidate types and candidate relationships. We then discuss the scoring function used to rank candidate holistic patterns. We use a rank-join method in the preferred embodiment to efficiently compute iop-k candidate holistic patterns from the candidate types/relationships.

For generality of deteremining a holistic pattern, we use an instance based approach that does not require the availability of meaningful labels for the columns. For each column A_i of table T and for each value ί[ΑΛ of a tuple t , we can map this value to several resources in the KB K whose type can be extracted. We issue the following SPARQL query which returns the types and supertypes of entities whose label ( i.e., value) is t[A_t] . select ?c,

where { ?x, rdfs : label t[A_t ] , ?x, rdfs : type/rdfs : subClassOf * ? c, }

Similarly, the relationship between two values t[A_t] and t[A .] from a KB K can be retrieved via the two following SPARQL queries.

QL select w_u

where { ?x, rdfs : label 4] , Ίχ_] rdfs : label

?x, ?i^>./rdfs : subPropertyOf * ?x . }

Ql_ls select ?^.

where { ?x, rdfs : label t[A_t ] , ?x, ?i^>. / rdfs : subPropertyOf * t[A_j ] }

Query Q\_eXs retrieves relationships where the second attribute is a resource in KBs ; and Q _h retrieves relationships where the second attribute is a literal value, i.e., untyped in KBs.

We use example 4 to describe an embodiment of the holistic pattern module using the above queries: Both Italy and Rome are stored as resources in KBs , thus their relationship hasCapital can be discovered by query Q\_eXs ; while numerical values such as 1 .78 are stored as literals in the KBs , thus the relationship between Paolo Rossi and 1 .78, if available, would be discovered by query . In addition, for two values t[A_t] and t[A .] , we consider them as an ordered pair, thus in total four queries are issued between two values. Given table T (201 ) and KB : (202), the support of a candidate type T_t for column A_n denoted as

, is the percentage of tuples whose values onA_j are of type T_t . Similarly, the support of a candidate relationship P for column pair A_i and A . , denoted as

, is the percentage of tuples whose two values of attributes A_i and A . abide by the relationship P .

The candidate types and relationships are ranked according to their support. When two candidate types have the same support, we rank higher the more discriminative type, i.e., the one with less number of instances in K .

Example 5 : Consider attribute B in Fig. 1 , the top-3 candidate types for B in Yago (not shown) are economy, country and state, with support 1.0,1.0 and 0.66, respectively. While economy and country have the same support, the former is ranked higher; economy has 356 resources in Yago, which is more discriminative than country which has 5504 resources.

A holistic pattern contains types of attributes and properties between

attributes. The space of all candidate holistic patterns is huge (up to the

Cartesian product of all possible types and relationships), which is expensive for human verification. Since not all candidate patterns make sense in practice, we only generate iop-k holistic patterns . We use a meaningful scoring function to rank candidate holistic patterns . An example of a meaningful scoring function is the naive scoring model. In a naive scoring model a candidate holistic pattern φ , consisting of type T_t for column A_n and relationship P for column pair A_i and A . , is to simply add up all support scores of the candidate types and relationships in φ : naiveScorefo) =∑^™ ₀sup(J; , A, ) +∑_y sup(i^> , ,4)

However, columns are not independent of each other. The choice of the type for a column 4 affects the choice of the relationship for column pair 4 and 4 , and vice versa.

Example 6: Consider the two columns B,C in Fig. 1 , B has candidate types economy, country, state , C has candidate types City, Capital , and B,C has candidate relationships hasCapital . Intuitively, countryas a candidate type for column B is more compatible with hasCapital than economy because capitals are associated with countries, not economies. In addition, Capital is also more compatible with hasCapital than City because not all cities are capitals.

To quantify the "compatibility" between a type T and relationship P , where T serves as the type for the resources appearing as subjects of the relationship P , we introduce the coherence score subSC( , ) . Similarly, to quantify the

"compatibility" between a type T and relationship P , where T serves as the type for the entities appearing as objects of the relationship P , we introduce the coherence scores objSC( , ) .

Let resource( ) denote the set of resources in K that are of type T , subResource( ) the set of resources in K that participate as subject resources in a triple whose predicate is P , and objResource(P) the set of resources in K that participate in a triple as object resource whose predicate is P . Formally, for a given type T , and a given relationship P in the KB K , we define the semantic coherence of T as a subject for P as

_ I resource( ) o subResource( ) |

I resource( ) u subResource( ) | and the semantic coherence of T as an object for P as resource( ) n objResource(P) |

objSC(r, )

resource( ) u objResource(P) Intuitively, the above scores reflect how suitable a type T is for all those resources that appear as the subject or the object of P .

Example 7: We show some example coherence scores computed from Yago. As one can see, the score reflects our intuition in Example 6. country is more suitable than economy to act as a type for the subject resources of hasCapital ; and Capital is more suitable than city to act as a type for the object resources of hasCapital . subSC(economy, hasCapital)

sub SC (country, hasCapital) = 0.17

objSC(City, hasCapital)

objSC(Capital, hasCapital) However, as one can see the coherence scores are relatively small. This is because usually KBs are incomplete. For example, if every country appearing in the KB had its capital stored, then subSC(country, hasCapital) would be 1.0.

We can put a weight factor w_c to boost up the contribution of the coherence scores. In the experiments, we used a factor of two, however alternative emboidments may use a suitable factor.

Taking the coherence of types and relationships into account, we define the score of a holistic pattern φ as: scored) = ∑^™ ₀sup(J;, ) +∑_i/sup( , , ) ₊ M _c .∑_y (subSC(7: ,^ ) ₊ objSC(r_j ,^ )) Using the scoring model of holistic patterns , in an embodiment we retrieve top-£ holistic patterns without having to enumerate all candidate ones. We formulate this as a rank-join problem.

The preferred embodiment of the present invention takes as input to the holistic pattern module a table T , a KB K , and a number k , and produces top-£ holistic patterns. The preferred embodiment uses method 1 .

Method 1

Input: A table T , a KB K , and a number k .

Output: Top- holistic patterns based on their scores

1 : types(_^4, ) — get a ranked list of candidate types for A_j

2: properties^,., ^4 .) — get a ranked list of candidate relationships for A_j and A 3: Let P be the top- holistic patterns , initialized empty

4: for all T_i e types (^4,. ) , and P_y e properties^,., ^4 .) do

5: if I P \> k and TypePruning^) then

6: continue

7: generate all holistic patterns P'

8: compute the score for each holistic pattern P in P'

9: update P using P

10: compute the upper bound B of unseen patterns, and let ¾ ε be the holistic pattern with lowest score

1 1 : halt when score(¾) > B

12: return P

To start, each input list, i.e., candidate types for a column, and candidate relationships for a column pair, is ordered according to the respective supports (lines 1 -2). Two lists are joinable if they agree on one common column, and the scoring for joining is the coherence score between a type and a

relationship. The rank-join algorithm scans the ranked input lists in descending order of their support scores (lines 3-4), holistic patterns are generated incrementally as we move down the input lists. Holistic patterns that cannot be used to produce top-A: patterns will be pruned (lines 5-6). For each join result, i.e., each holistic pattern φ , the score scored) is computed (lines 7-8). We also maintain an upper bound B of the scores of all unseen join results, i.e., holistic patterns (line 10). We terminate the join process if either we have exhaustively scanned every input list, or we have obtained iop-k holistic patterns and the score of the k^th holistic pattern is greater or equal to B (line 1 1 ). Since each list is ranked, B can be computed by adding up the support scores of the current positions in the ranked lists, plus the maximum

coherence scores of all relationships.

Lines 5-6 in method 1 check whether a candidate type T_t for column A_i can be pruned without generating holistic patterns involving T_t by calling method 2.

The intuition behind type pruning (Method 2) is that a candidate type T_t is useful if it is more coherent with any relationship P_a than previously examined types for A_i . We first calculate the current minimum sum of coherence scores involving column A_i in the current top-k patterns (line 1 ). We then calculate the maximum possible sum of coherence scores involving type T_t (line 2). T_t can be pruned if we are sure that the maximum sum of coherence is still smaller than the current smallest sum of coherence scores involving column A_t (lines 3-6). Example 8 : Consider the rank-join graph in Fig. 5 for a table with just two columns B and C as in (100). Support scores are shown in the braces.

Consider k = 2.

Suppose we are currently examining type state for column B . We do not need to generate holistic patterns involving state since the maximum coherence between stateand hasCapital or isLocatedh is less than the the current minimum coherence score between type of Column B and relationship between B,C in the current top-2 patterns. Suppose we are currently examining type whole for column C , and we have reached type state for B and hasCapital for relationship B,C . The bound score 5 = 0.7 + 0.9+ 0.5 + 0.18+ 0.12 = 2.4 , where 0.18 (0.12) is the maximum coherence score between any type in types(5) ( types(C) ) and any relationship in properties^, C) . B is smaller than score((p₂) = \ .0 + 0.9 + 0.9 + 0.09 + 0.12 = 3.01. Therefore, we early terminate without having to go over each list.

Method 2 Input: current top- holistic patterns P , candidate type T_j .

Output: a boolean value, true/false means that T_j can/cannot be pruned

1 : curMinCohSum(4) - minimum sum of all coherence scores involving column A_i in current top-k P

2:

- maximum sum of all coherence scores if the type of column A_i is T_t

3: if

< curMinCohSum(4) then

4: return true

5: else

6: return false

Method 1 is guaranteed to produce the iop-k holistic patterns since we keep the current iop-k patterns in P , and we terminate when we are sure that it will not produce any new holistic pattern with a higher score. In the worst case, we still have to exhaustively go through all the ranked lists to produce the top-k holistic patterns . However, in most cases the top ranked holistic patterns involve only candidates types/relationships with high support, which are at the top of the lists. Since computing coherence scores for a type and a relationship is an expensive operation that requires set intersection and set union, we compute offline the coherence score for every type and every relationship in K . For each relationship, we also keep the maximum coherence score it can achieve with any type to efficiently compute the bound B .

Once the top-k candidate holistic patterns are generated they are forwared to the validation module to be validated.

Validation module

The validation module uses the set P of candidate holistic patterns, a table T , a KB K , and a crowdsourcing framework ( e.g., MTurk) as input, we identify the most appropriate holistic pattern for T w.r.t. K , with the objective of minimizing the number of questions to the crowd.

A naive approach to generate crowdsourcing questions is to express candidate holistic patterns as questions to the crowd who would then selects the best one. However, crowd workers can only answer simple questions. In addition, holistic patterns graphs can be hard to validate for crowd users ( e.g., 301 ). Hence, we decompose holistic patterns into two simple tasks: (1 ) column type validation, i.e., to validate the type of a column; and (2) column pair relationship validation, i.e., to validate the relationship between two columns.

In column type validation we use a set of candidate types candT(4) for column A_n one type T_t e candT(4) needs to be chosen. We consider a type T_t to be the correct type for A_i if it has been confirmed by the crowd that at least ε percentage of the tuples in T have type T_t in column A_r

We formulate the following question to the crowd about the type of a column; What is the most accurate type of the highlighted column? along with k_t randomly chosen tuples from T and all candidate types from candT(4) A sample question is given below.

Q_l : What is the most accurate type of the highlighted ( ) column? (A, (B), C, D, E, F , ...)

(Paolo Rossi, (Italy), Rome, Hellas Verona, Italian, Proto, ...)

(Andrea Pirlo, (Italy), Madrid, Juventus, Italian, Flero,, ...)

country

economy

state method 3, referred to as CrowdTypeVal , takes as input a table T , a set of candidate types candT(4) for a column A_n and outputs the most appropriate type for this column according to the crowd. Since we require at least ε percentage of tuples supporting a type, the algorithm first generates f --i questions asking about the types of column A_i in ε\Τ\ tuples, where | T | is the number of tuples in T , and k_t is the number of tuples to be shown in one question, ( e.g., k_t =3 in Q) (lines 1-2). Votes for each candidate type are collected (lines 3-5), and the candidate type T_t with the maximum number of votes is recorded (line 6). If the T_t 's votes do not exceed ε\Τ\, additional questions are issued for the remaining set of tuples that have not been asked until T_t 's votes exceed ε\Τ\ or there are no more tuples left (lines 8-13). If T_t reaches the minimum vote threshold, it is selected as the type for A_i (lines 13-

14). Otherwise, none of the candidates in canTs(4) have the minimum number of votes; the type of A_i is not validated, and we remove the type of A_i from all candidate holistic atterns lines 15-16). Thus, the total number of

Method 3

Input: Table T of relation ϋ^,.,.,Α^ , and candT^)

Output: One type T_j for column A_j

1 : let vote(x) = 0, for each x e candT^)

ε \Τ\

2: generate—— - questions Q 3: for all Question q_i e Q do

4: x — the answer of ^

5: Vote(x) <— Vote(x) + k_t

6: let T_j e canT{A_j ) be the type with maximum Vote

7: let I_r be the remaining set of tuples that have not been asked

8: while Vote(T_t ) < ε \ T | and I_r≠ 0 do

9: generate another question q

10: x — the answer of q

11 : Fofe(x) Vote(x) + k_t

12: update T_t and 7_r

13: if Vote(T_i ) > s \ T \ then

14: return T_j

15: else

16: return null

We validate the relationship for column pairs in a similar fashion, where the term T_t e canT(A_i ) is replace with the term

. The details are omitted due to space constraint. We give an example question for relationship validation as follows.

Q₂ : What is the most accurate relationship for highlighted columns ( A , ( B , C ), D , E , F , ... )

(Paolo Rossi, (Italy, Rome), Hellas Verona, Italian, Proto, ...)

(Andrea Pirlo, (Italy, Madrid), Juventus, Italian, Flero, ...)

o B hasCapital C

o C located In B

Candidate types and candidate relationships are stored as URIs in KBs ; these are not directly consumable by the crowd. For example, the type capital for Column C is stored as http://yago- knowledge.org/resource/wordnet_capital_10851850, and the relationship hasCapital for column pair B and C is stored as http://yago- knowledge.org/resource/hasCapital. We look for types and relationship descriptions, e.g., Capital and hasCapital , by querying the KB for the labels of the corresponding URIs. If no label exists, we process the URI itself by removing the text before the last slash and punctuation symbols.

For each question, we also expose some contextual attribute values that help crowd workers better understand the question. For example, we expose the values for attributes C,D, E in question Q_l when we validate the type of attribute B . If the table has a small number of attributes, we simply put all attributes in the question; otherwise, we identify attributes that are related to the ones in the question.

The column type and realtionship validation questions are minimised by the Validation. We minimise the total number of questions to get the correct holistic pattern by scheduling which column (resp. relationship) to validate first. Recall that once a column type (resp. relationship) is validated, we can prune from P all holistic patterns that have a different type (resp. relationship) for that column (resp. column pair). Therefore, a natural choice is to choose those columns (resp. column pairs) that result in the maximum uncertainty reduction. Consider φ as a variable, which takes values from P = {φ_ι, φ₂, ...,<¾} . We translate the score associated with each holistic pattern to probability by

score ((P ^

normalizing the scores, i.e., Ρτ(φ = φ_; ) = —— . We define the

^^scoreifp_j)

uncertainty of φ w.r.t. P as the entropy. Η_Ρ φ) = -Σ_φ^Ρτ φ = φ₁)\ο§,₂Ρτ(φ = φ,)

Example 9: Consider an input list of five holistic patterns P = {φ_ι, ..., φ₅) as follows with normalized probability of each holistic pattern shown in the last column. type (B) type (C) P (B,C) score Prob country Capital hasCapital 2.8 0.35 φ₂ economy Capital hasCapital 2 0.25 φ₃ country City locatedln 2 0.25 φ* country Capital locatedln 0.8 0.1 ψ₅ state Capital hasCapital 0.4 0.05

We use variables v_A and v_{A A} to denote the type of the column A_t and the relationship between A_i and A . respectively. The set of all variables is denoted as V . In Example 9

V = {v_B, v_c, v_BC} , ν_Β e {country, economy, state} , v_c = {Capital, City} ,

v_BC e {hasCapital, isLocatedln} .

The probability of an assignment of a variable v = a is obtained by aggregating the probability of those holistic patterns that have that assignment for v . For example,

Pr(v_B = country) = Prfa) + Pr{(p₃) + r(<p₄) = 0.35 + 0.25 + 0.1 = 0.7 ,

Pr(v_B = economy) = 0.25 , and Pr(v_B = state) = 0.05

After validating a variable v to have value a , we remove from P those holistic patterns that have different assignment for v . The remaining holistic patterns is denoted as P_v=a . Suppose column B is validated to be of type country , then

= {<¾,<¾,<¾} . Since we do not know what value a variable can take, we

measure the expected reduction of uncertainty of variable φ after validating variable v , formally defined as:

E(AH(<p))(v) =∑_aPr(v = a)H_P (φ) -Η_Ρ(φ)

In each iteration, we choose the variable v (column or column pair) with the maximum uncertainty reduction, i.e., Ε(ΑΗ(φ))(ν) . Finding such variable v would require us to iterate over all variables in V , and calculate their E(AH (v)) , which has a complexity of 0(\ V \\ P |²) . The following theorem simplifies the calculation for E(AH (v)) with a complexity of 0{\ V \\ P |) .

Theorem 1. The expected uncertain reduction after validating a column (column pair) v is the same as the entropy of the variable.

Ε(ΑΗ(φ))(ν) = H(v) , where H(v) = -∑_aPr(v = a) log₂ Pr(v = a) .

Method 4 describes the overall procedure for pattern validation. At each iteration: (1 ) we choose the best variable v_best to validate next based on the expected reduction of uncertainty of φ (lines 4-10); (2) we remove from P_re those holistic patterns that have a different assignment for variable v than the validated value a (lines 1 1 -12); and (3) we renormalize the probability distribution of the remaining holistic patterns in P_re (line 13). We terminate when we are left with only one holistic pattern (line 3).

Method 4

Input: a set of holistic patterns P

Output: one holistic pattern φ e P

\ . P_re be the remaining holistic patterns, initialized P

2: initialize all variables V , representing column or column pairs, and calculate their probability distributions.

3: while | P_re |> 1 do

4 ^Jbest - 0

5 best null

for all v e F do

compute the entropy H(v) .

if H(v) > E_best then 10: E_best ^ H{v)

1 1 : validate the variable v , suppose the result is a , let P_v=a to be the set of holistic patterns with v = a

12: P re = P v=a

13: normalize the probability distribution of holistic patterns in P_re . 14: return the only holistic pattern φ in P_re

Example 10 : To validate five patterns in Example 9, we first calculate the entropy of every variable.

H(v_B) = -0.71og₂0.7 - 0.25/og₂0.25 - 0.05/og₂0.05 = 1.07 , H(v_c) = 0.81 , and

H(v_BC) = 0.93 . Thus column B is validated, say the answer is country . The remaining set of holistic patterns, and their normalized probabilities are:

Now P_re = {(¾,(¾,(¾} . The new entropies for all variables are: H(v_B) = 0 , H(v_c) = 0.93 and H(v_BC) = 1 . Therefore, column pair B,C is chosen, say the answer is hasCapital . We are now left with only one pattern φ_ι , thus we return <¾ .

Thus, we do no need to validate v_c following our scheduling strategy.

Furthermore, after validating certain variables, other variables may become less uncertain, thus requiring less number of questions to validate.

Annotation module

Once a holistic pattern has been validated, the validated holistic pattern is used to annotate tuples in the data set. Annotated tuples may then be repaired where appropriate.

In the preferred embodiment, the annotation module annotates tuples into three categories: (/^') correct data validated by KBs ; (/^'/^') correct data jointly validated by KBs and the crowd; (//^'/^') errorneous data. In the preferred embodiment the annotation module uses the following steps to annoated the tuples in the dataset. We use the table in Fig 1 in the examples that follow.

Step 1: Validation by KBs . For each tuple t and a holistic pattern φ , a SPARQL query checks whether t is fully covered by a K . If it is fully covered, the annotation module annotates it as a correct tuple validated by KB(case (/^')). Otherwise, it goes to step 2.

Example 1 1 : Consider tuple t in Fig. 1 and the holistic pattern in (301 ), a boolean SPARQL query is run:

ASK {

(Italy) rdf : type/rdfs : subClassOf * (country) .

(Paolo Rossi) rdf : type/rdfs : subClassOf * (Person) .

(Hellas Verona) rdf : type/rdfs : subClassOf * (footballclub) .

(Rome) rdf : type/rdfs : subClassOf * (Capital) .

(Italian) rdf : type/rdfs : subClassOf * (language) .

(Proto) rdf : type/rdfs : subClassOf * (City) .

(Italy) (hasCapitalOf) (C) .

(Paolo Rossi) (hasNationalityOf ) (Italy) .

(Paolo Rossi) (hasClub) (Hellas Verona) .

(Paolo Rossi) (bornln) (Proto) .

(Proto) (locatedln) (Italy) .

(Italy) (hasOffical Language) (Italian) . }

Assume that the above query returns true, which validates that the KB has a full coverage on tuple t_{l t} i.e., t is annotated as correct w.r.t. the given holistic pattern.

Step 2: Validation by KBs and Crowd . For each node ( i.e., type) and edge ( i.e., relationship) that is missing from K , the annotation module asks the crowd that whether the relationship holds between the given two values. If crowd says yes, the annotation module annotates the tuple as a correct tuple, jointly validated by KB and crowd (case (ii)). Otherwise, it is certain that there exist errors in this tuple (case (iii)).

Example 12 : Consider tuple t₂ (resp. t₃ ) in Fig. 1 and the holistic pattern in

(301 ), which has been validated. The information about whether Pretoria (resp. Madrid) is a capital of South Africa (resp. Italy) is not in KB . To verify this information, we issue a boolean question Q_t (resp. Q_t ) to the crowd as:

Does South Africa hasCapital Pretoria?

Yes

No Q : Does Italy hasCapital Madrid?

o Yes

o No

In such case, the crowd will say yes (resp. no) to question Q (resp. Q ).

For each question that crowd says yes ( e.g., Q in Example 12), it is a new fact that is not in current KB. The annotation module collects all these facts and then enriches the given KB afterwards (207).

The annotation module may determine that a repair is required for the tuple and generates a list of candidate repairs. The enriched KB may be used to determine repairs.

We start by introducing two tools that are necessary to explain our approach to possible repairs generation. Given a KB K and a holistic pattern G(V,E) , an instance graph G_jiV^E_j) is a graph with nodes V_t and edges E_I , such that (i) each node v, G F₍ is a resource in K ; (ii) each edge e_i ε £₇ is a property nK ; (iii) there is a one-to- one correspondence / from each node V G V to a node v, e , i.e., f( ) = v_i ; and (iv) for each edge (u,v) e E , there is an edge (f(u), /(v)) e E₇ with the same property.

An instance graph is one possible instantiation of a holistic pattern in a given KB.

Example 13 : Figure 6, (601 ) and (602) are two instance graphs of the holistic pattern of (301 ) in Yago for players Andrea Pirlo and Xabi Alonso, respectively. Given an instance graph G , a tuple t , and a holistic pattern φ , the repair cost of aligning t to G w.r.t. φ , denoted by cost(t, φ, G) , is the number of value changes in t to align it with G .

The less a repair cost is, the closer the updated tuple is to the original tuple, hence more likely to be correct.

Example 14 : Consider tuple t₃ in Fig. 1 , the holistic pattern q>_s in (301 ), and two instance graphs G_l and G₂ in (601 ) and (602). One can verify that the repair cost from t₃ to G_l is 1 , i.e.,

= 1 , by updating t₃[C] from Madrid to Rome. Similarly, the repair cost from t₃ to G₂ is 5, i.e.,

∞st(t₃,q>_s,G₂) = 5.

Note that, the possible repairs are ranked based on repair cost in ascending order, e.g., the repair using G_l to t₃ with cost 1 is ranked higher that the repair using G₂ to t₃ with cost 5. In this work, instead of using the top-1 repair to update a value, we provide top-k possible repairs, while leave it to the users about which repair to pick. In the following, we describe embodiments to generate top-k repairs for every tuple identified as erroneous.

Naive Top-k Possible Repair Generation . In one aspect of the preferred embodiment given a KB K and a holistic pattern φ , we compute offline all instance graphs G in K w.r.t. φ . For each tuple t , a naive solution is to compute the distance between t and each instance graph G in set G . The k graphs with smallest repair cost are returned as top-k possible repairs.

Optimized Top-k Possible Repair Generation . In another aspect of the preferred embodiment we only retrieve instance graphs that can have reasonable repairs, i.e., the instance graphs whose values have overlap with a given erroneous tuple.

We use an inverted list to perform the above optimization for top-k possible repairs. inverted lists. Each inverted list is a mapping from a key to a set G of graph instances. Each key is a pair (A, a) where A is an attribute and a is a constant value. Each instance graph G in set G has value a on attribute A .

For example, an inverted list w.r.t. G_l in 601 is as: country , Italy→ G_l We optimize using the following steps for a tuple t .

Step 1 . Find all possible repairs by retrieving all inverted lists related to t . Step 2 . Count the occurrence of instance graphs in all retrieved inverted lists.

Step 3 . Return the top-k instance graphs with the highest

occurrences.

Intuitively, step 1 gets all possible repairs, since each instance graph in the retrieved inverted list has some value overlap with the given tuple. Moreover, all instance graphs that have value overlap with t will be indexed and thus retrieved. Step 2 will compute the repair cost of each possible repair and step 3 will return iop-k possible repairs.

Example 15 : Consider t₃ in Fig. 1 and pattern q>_s in 301 . The inverted lists retrieved are given below. A , Andrea Pirlo→ G_l

B , Italy→ G_l

C , Madrid→G₂

D , Juventus→ G_l

E , Italian→ G_l

F , Verona→ G_l

It is easy to see that the occurrences of instance graph G_l is 5 and G₂ is 1 . In other words, the repair cost to t₃ w.r.t. G_l is 6-5 = 1 , while the repair cost to t₃ w.r.t. G₂ is 6-1 = 5 . Hence, the top-2 possible repairs for t₃ are G_l and G₂ .

The annotation module hence outputs two possible repairs for t₃ as G_l and

G₂ . Experimental Study

The study demonstrates the effectiveness of the preferred embodiment of the present invention using holistic patterns and shows the efficiency of the holistic pattern and validation module. Furthermore the study demonstrates the effectiveness of data annotation and its use to repair and clean a dataset. In the experiment we use we use three datasets:

1 . Country contains countries and capitals with currency and language, obtained from a wikipedia page(http://en.wikipedia.org/wiki/List_of_countries_ and_capitals_with_currency_and_language);

2. Soccercontains information about soccer players and their clubs

scraped from the Web (http://www.premierleague.com/en- gb. html, http://www.legaseriea.it/en/,http://www. premierleague.com/en-gb. html ); and

3. University contains all universities in the US with their

addresses(http://ope. ed.gov/accreditation/GetDownLoadFile.aspx).

Knowledge bases. We used two KBs , Yago and DBpedia . They were transformed to Jena format (https://jena.apache.org/). We ran all datasets on both KBs , and choose for a dataset the KB with better coverage. DBpedia was used for soccer and university , Yago was used for country . Pattern determination

We use three different embodiments to determine holistic patterns. The first embodiment, denoted as Support , ranked the candidate types and

relationships solely on their support score. The second embodiment, denoted as MaxLike , infers the type of a column and the relationship between a column pair separately using maximum likelihood approach. The third embodiment is the preffered emobidment, denoted as RankJoin , in the methods 1 to 4. We denote byRankJoinBase the rank join method without type pruning optimization as presented in method 2.

Gold standard. For a given table T and a given KB K , we retrieved all candidate types and relationships by issuing SPARQL queries, we manually annotated each candidate as vital, okay, or incorrect. For example, for column B in Fig. 1 , country is vital; economy is okay; while state is incorrect. Evaluation. We compared the iop-k holistic patterns returned by each embodiment with the labelled candidates in the gold standard. For MaxLileand Support , top-k holistic patterns were constructed based on the support scores and the maximum likelihood scores, respectively. To compute precision, a candidate type or relationship has a score 1.0 if it was labelled vital; 0.5 if it was labelled okay; or 0 otherwise.

As shown in Fig 7 (701 to 703), Support has the lowest precision because it looks for those candidates that cover the most number of tuples, which could be those general types or relationships, such as the root type owl: Thing in Yago . MaxLike does better than Support because it maximizes the likelihood of the values given the candidates. RankJoin is even better than MaxLike because it takes into account the coherence scores between types and relationships, and seeks the most coherent patterns.

As shown in Fig. 7 (704), RankJoin, MaxLike, Support have similar running time since the dominating computation is to retrieve candidates from the KBs , which is common to all embodiments. RankJoinBase is quite expensive since it needs to enumerate holistic patterns despite the advantage of rank join, thus proving that type pruning is extremely beneficial.

It is evident that the more expressive the pattern language is, the more errors they can possibly capture. On the other hand, allowing more expressive patterns may incur high cost in pattern discovery and data annotation.

The holistic pattern deteremined by the holistic pattern module may produce holistic pattern variants depending on the cost required. Either one or all holsitic pattern variants may be considered as aspects of the present invention. We use the second aspect in the preferred embodiment. In a first aspect, a first type of pattern contains only the type of each column, denoted as Typel .

In a second aspect, a second type of pattern, used in the embodiments above, containing the type of each column and relationship of every column pair, denoted as Type2 .

In a third aspect, a third type of pattern considers indirect relationship between a column pair, denoted as Type3 . As an example for the third type of patterns, consider a table with two columns A and B , representing two actors have played in the same movie before. Thus, the pattern for Column A of type Actor and B of type Actorare connected via an intermediate node X of type Movie .

We measured the running time for generating candidates patterns, as well as the size of the patterns. Pattern size is defined as the sum of the number of candidates types, number of candidate relationships, and number of candidate indirect relationships above a minimum support threshold (0.1 ). Patterns with larger size have richer semantics, and thus more powerful in capturing errors.

Above we show that our current second aspect allows us to capture the semantics of the table, validate and repair the data efficiently. Moreover, more expressive pattern languages, such as allowing indirect relationship between two columns in the pattern, will considerably increase the cost of pattern discovery without much gain in capturing more errors in the data. Further, validating data using Type3 for the crowd is much harder ( e.g., "does there exist a movie that actor 1 and actor 2 are both in?").

Validation module

Methods To evaluate the savings in crowdsourcing pattern validation that are achieved by our scheduling methods, we compared our method (denoted MUVF , short for most-uncertain-variable-first) to a baseline (denoted AVI ) that validates every variable independently.

Crowd. We validated countryand university datasets using internal crowd, i.e., a group of researchers from QCRI and University of Waterloo; and we validate soccer using Amazon Mechanical Turk (MTurk). MTurk is a widely used crowdsourcing platform. Each question was asked three times, and the majority was taken as the answer.

We required at least 10% of tuples to validate a type or a relationship. For each dataset, we compared the number of variables needs to validate until there is only one holistic pattern left. We also reported the number of questions that each method asked in total to validate all variables.

Considering the above data MUVF performs consistently better than AVI in terms of the number of questions needed. There are two reasons. First, MUVF may save the number of variables, thus saving the number of questions considerably, e.g., countryand soccer, since validating each variable requires the number of questions linear to the number of tuples.

Second, even the number of variables needed to validate is the same, MUVF still wins, e.g., university , since after validating certain variables, other variables become less uncertain, thus requiring less number of questions to reach the minimum support threshold. Figure 8, (801 to 803) shows validated patterns for the three datasets.

Annotation module

We evaluated data annotation in three aspects: information covered by KBs, accuracy of data validation via crowdsourcing, and effectiveness of possible repairs.

Data validated via KBs . Figure 9 shows data coverage of KBs for three datasets. For each dataset, the left bar in Fig. 9, (901 ) depicts the percentage of tuples that are fully/partially covered by KBs . It shows that for country , about 40% of tuples are fully covered; for soccerand university , only around 5% of tuples are fully covered. For each dataset, the right bar in Fig. 9 (901 ) depicts the percentage of relationships that are covered by KBs . For example, for a tuple t and the pattern in Fig. 8 (801 ) if only two relationships appear in a KB, its coverage on relationship is 2/3. It tells that although for some datasets ( e.g., soccer and university ), only a small number of tuples are fully covered, around 50% of relationships between tuples are covered.

Figure 9 (902) complements the above experiment by showing, in a finer granularity, the percentage of tuples that are partially covered by KBs . For example, the bar of number 0 (resp. 2) for country means that for all tuples that are partially covered by KBs , 30% (resp. 50%) of tuples have coverage on 0 relationship (resp. 2 relationships). This figure shows that for most tuples, the number of relationships covered by KBs is good, e.g., 50% of partially covered tuples have 2 out of 3 relationships covered by KBs .

True False country 98% 88%

soccer 97% 100%

university 99% 100%

Data validated via KBs and crowd . In order to quantify data validation using crowd, we did it in a controlled way, i.e., we need to know the ground truth to evaluate the reliability of the crowd. We first manually examined each dataset to identify 37 tuples for country , 67 tuples for soccer , and 65 tuples for university , that are ensured to be correct. We then randomly picked 100 relationships that are ensured to be correct for each dataset, shown as "True relationships" in the above table. Moreover, we injected errors from active domain for these selected tuples, and also picked 100 questions for each dataset that are ensured to be wrong, shown as "False relationships" in the above table. We mixed the above two categories of questions and posted them on our internal crowd. The result in the above table that an internal crowd is good in answering questions that contain both true relationships and false relationships.

Possible repairs . This experiment is to validate the effectiveness of possible repairs. We first generated top-k repairs for each dirty tuple. If the ground truth of this dirty tuple falls in this top-k repairs, we consider these possible repairs correct. Here, recall is defined as the number of correct possible repairs over the number of dirty tuples. The result for three datasets is given in Fig. 10, where the jc-axis is for the parameter k and y -axis is the recall. For example, when £ =1 , 60% of university tuples' ground truth is in generated possible repairs. Figure 10 tells that when k is small ( e.g., 3), the preferrred

embodiment generates possible repairs with good recall ( e.g., 90% for country , 80% for university , and 40% for soccer).

The present invention is a data cleaning system using knowledge bases and crowdsourcing. Correspondence between the possibly dirty database and the KBs is established using modules by determining and validating the holistic patterns . Then each tuple in the database is verified using a holistic pattern against a KB with possible crowd involvement when the KB has no coverage. Experimental results have demonstrated that the present invention is able to detect data errors, suggest high quality possible repairs, and repair errorneous tuples.

Embodiments of the present invention also encompass systems incorporating a database stored on a tangible computer readable medium which is coupled to the system by a communicative coupling (such as a computer network (e.g. the internet or a local area network)). The system may, therefore, include a network interface.

When used in this specification and claims, the terms "comprises" and

"comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.

The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof. Techniques for Implementing Aspects of Embodiments of the Invention:

[1 ] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

[2] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. [3] A. Cal, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010.

[4] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB,

[5] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013.

[6] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.

[7] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.

[8] D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013.

[9] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013.

[10] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 201 1 . [1 1 ] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21 (2), 2012.

[12] W. Fan. Dependencies revisited for improving data quality.

PODS, 2008.

[13] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004. [14] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga.

Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.

[15] K. Kerr, T. Norris, and R. Stockdale. Data quality information and decision making: a healthcare case study. In ACIS, 2007.

[16] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 2010.

[17] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008.

[18] V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001 .

[19] B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar.

Crowdsourced enumeration queries. In ICDE, 2013.

[20] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 201 1 .

[21 ] M. White. Enterprise information portals. Electronic Library, The, 18(5), 2000.

[22] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. [23] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F.

Ilyas. Guided data repair. PVLDB, 201 1 .

[24] C. J. Zhang, L. Chenx, H. V. Jagadish, and C. C. Caox. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 2013.

Claims

1 . A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising:

annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and

repairing the erroneous annotated tuples to generate a clean database instance.

2. The method of claim 1 further comprising: matching data-types in the knowledge base to unknown attribute data-types in the database instance using the attribute values; 3. The method of claim 2 further comprising: assigning a binary

relationship value between the matched attribute data-types.

4. The method of any preceding claim wherein the method further comprises updating the knowledge base using at least one of: the valid holistic pattern; and the crowd of users.

5. The method of claim 1 wherein generating candidate holistic patterns comprises determining support values of a candidate type and a candidate relationship and ranking the candidate holistic patterns using the support values.

6. The method of claim 5 wherein the plurality of holistic patterns further comprise determining coherence values using the support values.

7. The method of claim 6 wherein the holistic pattern module further comprises determining a score of the candidate holistic pattern.

8. The method of claim 7 wherein the method comprises determining the score of the candidate holistic pattern is determined offline.

9. The method of any preceding claim further comprising dynamically generating questions to ask the crowd of users using the plurality of candidate holistic patterns.

10. A method of claim 9 further comprising aggregating answers to the questions and determining the valid holistic pattern.

1 1 . The method of any preceding claim further comprising:

generating a plurality of instance graphs of the erroneous tuples using the knowledge base and the valid holistic pattern; and

determining a repair cost of the erroneous tuple based on each instance graph and valid holistic pattern combination.

12. The method of claim 1 1 wherein the method comprises ranking candidate repair tuples computed from an inverted repair list based on repair cost.

13. A tangible machine readable storage medium comprising instructions that, when executed, cause a machine to perform the method of any one claims 1 to 12.

14. A system for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the system comprising:

a holistic pattern module configured to generate a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising the data-types and data-type relationships;

a validation module configured to determine a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data-types and the data-type relationships;

an annotation module configured to annotate tuples in the database instance using the valid holistic pattern, wherein the annotated tuples indicate at least one of: knowledge base validated; or jointly validate, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and

a repair module operable to generate a clean database instance using the annotated erroneous tuples.

15. The system of claim 14 further comprising: the holistic pattern module operable to match data-types in the knowledge base to unknown attribute data-types in the database instance using the attribute values;

16. The system of claim 15 further comprising: the holistic pattern module operable to assign a binary relationship value between the matched attribute data-types.

17. The system of claims 14 to 16 wherein the system further comprises updating the knowledge base using at least one of: the valid holistic pattern; and the crowd of users.

18. The system of claim 14 wherein the holistic pattern module is operable to generate candidate holistic patterns having a support value for: candidate data type; and candidate relationship, and ranking the candidate holistic patterns using the support values.

19. The system of claim 18 wherein the plurality of holistic patterns further comprise determining coherence values using the support values.

20. The system of claim 19 wherein the holistic pattern module further comprises determining a score of the candidate holistic pattern.

21 . The system of claim 20 wherein the score of the candidate holistic pattern is determined offline.

22. The system of claim 14 to 21 further comprising the validation module operable to dynamically generate questions to ask the crowd of users using at least one of: the plurality of candidate holistic patterns; and the data instance.

23. A system of claim 22 further comprising aggregating answers to the questions and determining the valid holistic pattern.

24. The system of claims 14 to 22 further comprising:

25. The system of claim 24 wherein the system comprises ranking candidate repair tuples in an inverted repair list based on repair cost.