WO2015181511A1 - Data cleaning methods and systems - Google Patents

Data cleaning methods and systems Download PDF

Info

Publication number
WO2015181511A1
WO2015181511A1 PCT/GB2014/051670 GB2014051670W WO2015181511A1 WO 2015181511 A1 WO2015181511 A1 WO 2015181511A1 GB 2014051670 W GB2014051670 W GB 2014051670W WO 2015181511 A1 WO2015181511 A1 WO 2015181511A1
Authority
WO
WIPO (PCT)
Prior art keywords
holistic
candidate
pattern
data
patterns
Prior art date
Application number
PCT/GB2014/051670
Other languages
French (fr)
Inventor
Nan TANG
Mourad Ouzzani
Paolo Papotti
Ihab Francis Ilyas Kaldas
Xu Chu
Original Assignee
Qatar Foundation
Hoarton, Lloyd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qatar Foundation, Hoarton, Lloyd filed Critical Qatar Foundation
Priority to PCT/GB2014/051670 priority Critical patent/WO2015181511A1/en
Publication of WO2015181511A1 publication Critical patent/WO2015181511A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • ICs integrity constraints
  • CFDS dependencies
  • DCs denial constraints
  • the present invention seeks to provide an improved computer implemented method and a system to clean a database instance.
  • the knowledge base comprising data-types and data-type relationships
  • One embodiment of the present invention addresses the above mentioned issues and is an end-to-end data cleaning system that effectively leverages prevalent trustworthy master data,/.e., KBs , and judiciously involves users to resolve conflicts and remove ambiguities.
  • the resulting clean data instance improves performace of the in retrieving data and the load resulting on the underlying hardware by reducing the number of incorrect queries returned.
  • Figure 1 is a sample dataset table T for soccer players.
  • Figure 4 shows the coverage of attributes in the table T by a knowledge base in a holistic pattern.
  • Figure 5 shows a schematic of Top-k results of various column data-types and their support scores.
  • Figure 6 shows candidate holistic patterns repairs patterns for updating the tuple attributes in T.
  • Figure 7 shows results for the precision of holistic patterns using different ranking methods to generate top-k holistic patterns.
  • Figure 8 shows validated holistic patterns for different data sets used in the experiment.
  • Figure 9 shows the data coverage of the knowledge bases of three different data sets, by three aspects of the preferred embodiment.
  • Figure 10 shows the effectiveness of the possible repairs by using recall, defined as the number of correct possible repairs over the number of dirty tuples.
  • holistic patterns convey direct mappings from a given data instance to the constructs of a KB.
  • Holistic patterns are used to explain relationships between tables. The coherence of binary relationships together are taken to produce the pattern.
  • the holistic pattern is more than the sum of its parts and represent a key component in determining the quality of fixes in a give data instance. For example, holistic patterns recognize that the types of columns A , B , and C are Person, Country , and Capital in the KB, respectively, and two
  • the data values in the database instance / are mapped to those in the KBs.
  • "Rome” could be either city , capital , or club in the KBs.
  • These ambiguous mappings generate several holistic patterns.
  • the holistic patterns are used to explain table semantics using KBs ⁇ e.g., Yago or RxNorm). Each holistic pattern assigns a type to a column and a binary relationship to a pair of columns up to the coverage provided by the KB.
  • Attributes in a holistic pattern are connected by binary relationships. This global information can explotited using several methods, for example in a a preferred embodiment we use a rank-join based method to efficiently discover holistic patterns taking into account their coherence.
  • the validation module preferably uses crowd involvement to validate holistic patterns as well as to validate data when the KBs do not have enough coverage.
  • the validation module generates questions, preferably automatically, to ask the crowd. The questions must be easy to understand.
  • crowd involvement is optimized to reduce cost, and is discussed in in the detailed description.
  • KBs are usually incomplete in terms of covering all values in / , which complicates several tasks. First, it makes it more complex to find correct holistic patterns. Second, in case of a mismatch, it is unclear if the values in / cannot be found in the KB because they are incorrect or because the KB is incomplete. Third, it makes data repairing harder, since the KB does not have enough correct values to suggest as possible repairs.
  • the validation module presents the crowd with specific questions to enable the validation module to select a valid holistic pattern from the candidate holistic patterns.
  • the validation module minimises the number of questions to ask and dynamically schedules the questions, we can achieve this results for example using entropy theory.
  • the validation module maximizes the uncertainty reduction of candidate holistic patterns and ouputs a valid holistic pattern.
  • the holistic pattern module determines holistic patterns between a
  • the validation module (204) leverages crowdsourcing to select the best holistic pattern.
  • the annotation module (205) uses the selected holistic pattern and interacts with the KB and the crowd to annotate data. It may also generates possible repairs for erroneous tuples. New facts verified by the crowd are used to enrich KBs.
  • Example 1 Considering the table T for soccer players Fig 1 .
  • T has no type table header (101 ) ,i.e., each column is untyped, and thus its semantics is completely unknown.
  • a KB K does have type information ⁇ e.g., Yago) and contains some information related to T , although it may not completely cover the table T .
  • the holistic pattern module determines candidate holistic patterns, with each containing the types (or classes) of columns and the relationships between pairs of columns.
  • a holistic pattern can be represented as a labelled graph (301 ).
  • a node represents an attribute and its associated type, e.g., "C ( Capital )" means that the type of attribute C in KB K is Capital ; while a directed edge between two nodes represents the relationship between two attributes, e.g.,
  • a column could have multiple candidate types, e.g., C could also be of type City . However, knowing the relationship from B to C is
  • hasCapital indicates that Capital is a better choice for C than City .
  • KBs are often incomplete.
  • the discovered patterns may not cover all attributes of a table, e.g., attributes G,H of table T are not described by the pattern in (301 ).
  • the holistic pattern module finds two similar patterns: the one in (301 ), and a variant with type locationfor column C (302).
  • the candidate holistic patterns we forward the candidate holistic patterns to the validation module and ask the crowd the question " Which type ( Capital or location) is more precise for values ( Rome, Pretoria and Madrid)?". If the crowd picks Capital , the holistic pattern in (301 ) is selected.
  • the crowd validate and select the holistic pattern in (301 ).
  • the valid holistic pattern is forwarded to the data annotation module where each tuple is annoated.
  • each tuple is annoated.
  • the annotation module annoates the tuples in the data set using one of the following:
  • (i) Validated to be correct by the KB. For example, by mapping tuple t in table T to K , we find a full match, shown in (303). It means that e.g., Paolo Rossi (resp. Italy) is in K as a Person(resp. country), and the relationship from
  • the new fact is used to enrich the KBs.
  • T Fig 1 hasCapital Madrid, join them and generate a ranking of possible repairs for this tuple based on the number of changes.
  • the tuples in T Fig 1 are then updated using the releveant annotation and where appropriate the KB is enriched using the answers from the crowd.
  • KBs knowledge bases
  • RDFS Resource Description Framework Schema
  • a resource is an identifier for a real-word entity, e.g., Paolo Rossi, Rome, and Italy.
  • a literal is a string, date, or number, e.g., 1 .78.
  • a property is a binary predicate that represents a relationship between two resources or between a resource and a literal. We denote by P(x, y) the property ( a.k.a. relationship) between resource x and resource (or literal) y . For example, locatedIn( Milan, Italy) indicates that Milan is located in Italy.
  • An RDFS ontology distinguishes between classes and instances.
  • a class is a resource that represents a set of objects, e.g., the class of countries.
  • a resource that is a member of a class is called an instance of that class.
  • a more specific class c can be specified as a subclass of a more general class d by using the statement subclassOf (c, d) . This means that all instances of c are also instances of d , e.g., subclassOf (Capital, location) .
  • a property P l can be a sub-property of a property P 2 by the statement
  • a holistic pattern of a relation R may be represented by a labelled directed graph G ⁇ V,E) with nodes V and edges E .
  • Each node U G V corresponds to an attribute in R .possibly typed, and each edge (w, v) e E from u to v has a label P , denoting the relationship between two attributes that u and v represent.
  • Type of an attribute A i is an annotation that represents the class of attribute values in A i .
  • the type of attribute B in (301 ) is country .
  • relationship between two attributes is an annotation that represents how A i and A . are related through a directed binary relationship. For example, the relationship from attribute B to C in (301 ) is hasCapital .
  • a holistic pattern is a connected graph.
  • two holistic patterns that do not share any common node we treat them independently.
  • a tuple t of R matches a holistic pattern ⁇ (with m nodes ⁇ ⁇ , ... , ⁇ ⁇ ⁇ ) w.r.t. a knowledge base K , denoted by , if there exist m attributes ⁇ ⁇ , ... , ⁇ ⁇ ⁇ in R and m resources ⁇ ⁇ , ... , ⁇ ⁇ ⁇ in K such that: 1 . there is a one-to-one mapping from A i (and x, ) to v, for i e [ ⁇ , m] ;
  • each corresponding attribute value of t will map to a resource r in KB under a given distance function; and r is a (sub-)type by the type given in ⁇ (conditions 1 and 2). Moreover, for each relationship P that appears in a pattern, the property between the two corresponding resources must be P of one of its sub-properties (condition 3).
  • a tuple t of R partially matches a holistic pattern ⁇ w.r.t. a knowledge base K , denoted by t p c , if at least one of the conditions 1-3 does not hold.
  • Fig. 4 shows T s attributes and tuples are covered using the holistic pattern: (1 ) Attributes covered by K iAO * ! ). Attributes A -F in Fig. 4 are covered by the pattern in (301 ). We consider two cases for the tuples. 1 . Fully covered by K . We annotate such tuples as semantically correct w.r.t. ⁇ and K (303).
  • Attributes not covered by K (402). Attributes G and H in Fig. 4 are not covered by the pattern in (301 ). In this case, these attributes are not annoated due to the missing information in K .
  • the holistic pattern module deteremines candidate holistic patterns of a dataset, for example such as table T from a KB K in the preferred
  • Query Q ⁇ eXs retrieves relationships where the second attribute is a resource in KBs ; and Q h retrieves relationships where the second attribute is a literal value, i.e., untyped in KBs.
  • the support of a candidate type T t for column A n denoted as is the percentage of tuples whose values onA j are of type T t .
  • the support of a candidate relationship P for column pair A i and A . denoted as , is the percentage of tuples whose two values of attributes A i and A . abide by the relationship P .
  • the candidate types and relationships are ranked according to their support. When two candidate types have the same support, we rank higher the more discriminative type, i.e., the one with less number of instances in K .
  • Example 5 Consider attribute B in Fig. 1 , the top-3 candidate types for B in Yago (not shown) are economy, country and state, with support 1.0,1.0 and 0.66, respectively. While economy and country have the same support, the former is ranked higher; economy has 356 resources in Yago, which is more discriminative than country which has 5504 resources.
  • a holistic pattern contains types of attributes and properties between
  • columns are not independent of each other.
  • the choice of the type for a column 4 affects the choice of the relationship for column pair 4 and 4 , and vice versa.
  • Example 6 Consider the two columns B,C in Fig. 1 , B has candidate types economy, country, state , C has candidate types City, Capital , and B,C has candidate relationships hasCapital .
  • countryas a candidate type for column B is more compatible with hasCapital than economy because capitals are associated with countries, not economies.
  • Capital is also more compatible with hasCapital than City because not all cities are capitals.
  • resource( ) denote the set of resources in K that are of type T
  • subResource( ) the set of resources in K that participate as subject resources in a triple whose predicate is P
  • objResource(P) the set of resources in K that participate in a triple as object resource whose predicate is P .
  • the above scores reflect how suitable a type T is for all those resources that appear as the subject or the object of P .
  • Example 7 We show some example coherence scores computed from Yago. As one can see, the score reflects our intuition in Example 6. country is more suitable than economy to act as a type for the subject resources of hasCapital ; and Capital is more suitable than city to act as a type for the object resources of hasCapital . subSC(economy, hasCapital)
  • the preferred embodiment of the present invention takes as input to the holistic pattern module a table T , a KB K , and a number k , and produces top-£ holistic patterns.
  • the preferred embodiment uses method 1 .
  • Input A table T , a KB K , and a number k .
  • each input list i.e., candidate types for a column, and candidate relationships for a column pair
  • Two lists are joinable if they agree on one common column, and the scoring for joining is the coherence score between a type and a
  • the rank-join algorithm scans the ranked input lists in descending order of their support scores (lines 3-4), holistic patterns are generated incrementally as we move down the input lists. Holistic patterns that cannot be used to produce top-A: patterns will be pruned (lines 5-6). For each join result, i.e., each holistic pattern ⁇ , the score scored) is computed (lines 7-8). We also maintain an upper bound B of the scores of all unseen join results, i.e., holistic patterns (line 10). We terminate the join process if either we have exhaustively scanned every input list, or we have obtained iop-k holistic patterns and the score of the k th holistic pattern is greater or equal to B (line 1 1 ). Since each list is ranked, B can be computed by adding up the support scores of the current positions in the ranked lists, plus the maximum
  • Lines 5-6 in method 1 check whether a candidate type T t for column A i can be pruned without generating holistic patterns involving T t by calling method 2.
  • T t is useful if it is more coherent with any relationship P a than previously examined types for A i .
  • T t can be pruned if we are sure that the maximum sum of coherence is still smaller than the current smallest sum of coherence scores involving column A t (lines 3-6).
  • Example 8 Consider the rank-join graph in Fig. 5 for a table with just two columns B and C as in (100). Support scores are shown in the braces.
  • Method 1 is guaranteed to produce the iop-k holistic patterns since we keep the current iop-k patterns in P , and we terminate when we are sure that it will not produce any new holistic pattern with a higher score. In the worst case, we still have to exhaustively go through all the ranked lists to produce the top-k holistic patterns . However, in most cases the top ranked holistic patterns involve only candidates types/relationships with high support, which are at the top of the lists. Since computing coherence scores for a type and a relationship is an expensive operation that requires set intersection and set union, we compute offline the coherence score for every type and every relationship in K . For each relationship, we also keep the maximum coherence score it can achieve with any type to efficiently compute the bound B .
  • top-k candidate holistic patterns are generated they are forwared to the validation module to be validated.
  • the validation module uses the set P of candidate holistic patterns, a table T , a KB K , and a crowdsourcing framework (e.g., MTurk) as input, we identify the most appropriate holistic pattern for T w.r.t. K , with the objective of minimizing the number of questions to the crowd.
  • a crowdsourcing framework e.g., MTurk
  • a naive approach to generate crowdsourcing questions is to express candidate holistic patterns as questions to the crowd who would then selects the best one.
  • crowd workers can only answer simple questions.
  • holistic patterns graphs can be hard to validate for crowd users ( e.g., 301 ).
  • column type validation i.e., to validate the type of a column
  • column pair relationship validation i.e., to validate the relationship between two columns.
  • CrowdTypeVal takes as input a table T , a set of candidate types candT(4) for a column A n and outputs the most appropriate type for this column according to the crowd. Since we require at least ⁇ percentage of tuples supporting a type, the algorithm first generates f --i questions asking about the types of column A i in ⁇ tuples, where
  • is the number of tuples in T , and k t is the number of tuples to be shown in one question, ( e.g., k t 3 in Q) (lines 1-2). Votes for each candidate type are collected (lines 3-5), and the candidate type T t with the maximum number of votes is recorded (line 6).
  • T t 's votes do not exceed ⁇ , additional questions are issued for the remaining set of tuples that have not been asked until T t 's votes exceed ⁇ or there are no more tuples left (lines 8-13). If T t reaches the minimum vote threshold, it is selected as the type for A i (lines 13-
  • Candidate types and candidate relationships are stored as URIs in KBs ; these are not directly consumable by the crowd.
  • the type capital for Column C is stored as http://yago- knowledge.org/resource/wordnet_capital_10851850
  • the relationship hasCapital for column pair B and C is stored as http://yago- knowledge.org/resource/hasCapital.
  • types and relationship descriptions e.g., Capital and hasCapital , by querying the KB for the labels of the corresponding URIs. If no label exists, we process the URI itself by removing the text before the last slash and punctuation symbols.
  • type (B) type (C) P (B,C) score Prob country Capital hasCapital 2.8 0.35 ⁇ 2 economy Capital hasCapital 2 0.25 ⁇ 3 country City locatedln 2 0.25 ⁇ * country Capital locatedln 0.8 0.1 ⁇ 5 state Capital hasCapital 0.4 0.05
  • Theorem 1 The expected uncertain reduction after validating a column (column pair) v is the same as the entropy of the variable.
  • Method 4 describes the overall procedure for pattern validation. At each iteration: (1 ) we choose the best variable v best to validate next based on the expected reduction of uncertainty of ⁇ (lines 4-10); (2) we remove from P re those holistic patterns that have a different assignment for variable v than the validated value a (lines 1 1 -12); and (3) we renormalize the probability distribution of the remaining holistic patterns in P re (line 13). We terminate when we are left with only one holistic pattern (line 3).
  • Example 10 To validate five patterns in Example 9, we first calculate the entropy of every variable.
  • H(v c ) 0.81
  • the validated holistic pattern is used to annotate tuples in the data set. Annotated tuples may then be repaired where appropriate.
  • the annotation module annotates tuples into three categories: (/ ' ) correct data validated by KBs ; (/ ' / ' ) correct data jointly validated by KBs and the crowd; (// ' / ' ) errorneous data.
  • the annotation module uses the following steps to annoated the tuples in the dataset. We use the table in Fig 1 in the examples that follow.
  • Step 1 Validation by KBs .
  • a SPARQL query checks whether t is fully covered by a K . If it is fully covered, the annotation module annotates it as a correct tuple validated by KB(case (/ ' )). Otherwise, it goes to step 2.
  • Example 1 1 Consider tuple t in Fig. 1 and the holistic pattern in (301 ), a boolean SPARQL query is run:
  • Step 2 Validation by KBs and Crowd .
  • the annotation module asks the crowd that whether the relationship holds between the given two values. If crowd says yes, the annotation module annotates the tuple as a correct tuple, jointly validated by KB and crowd (case (ii)). Otherwise, it is certain that there exist errors in this tuple (case (iii)).
  • Example 12 Consider tuple t 2 (resp. t 3 ) in Fig. 1 and the holistic pattern in
  • the crowd will say yes (resp. no) to question Q (resp. Q ).
  • the annotation module collects all these facts and then enriches the given KB afterwards (207).
  • the annotation module may determine that a repair is required for the tuple and generates a list of candidate repairs.
  • the enriched KB may be used to determine repairs.
  • An instance graph is one possible instantiation of a holistic pattern in a given KB.
  • Example 13 Figure 6, (601 ) and (602) are two instance graphs of the holistic pattern of (301 ) in Yago for players Andrea Pirlo and Xabi Alonso, respectively.
  • the repair cost of aligning t to G w.r.t. ⁇ denoted by cost(t, ⁇ , G) , is the number of value changes in t to align it with G .
  • Example 14 Consider tuple t 3 in Fig. 1 , the holistic pattern q> s in (301 ), and two instance graphs G l and G 2 in (601 ) and (602).
  • the repair cost from t 3 to G 2 is 5, i.e.,
  • the possible repairs are ranked based on repair cost in ascending order, e.g., the repair using G l to t 3 with cost 1 is ranked higher that the repair using G 2 to t 3 with cost 5.
  • the top-1 repair instead of using the top-1 repair to update a value, we provide top-k possible repairs, while leave it to the users about which repair to pick.
  • Naive Top-k Possible Repair Generation In one aspect of the preferred embodiment given a KB K and a holistic pattern ⁇ , we compute offline all instance graphs G in K w.r.t. ⁇ . For each tuple t , a naive solution is to compute the distance between t and each instance graph G in set G . The k graphs with smallest repair cost are returned as top-k possible repairs.
  • Optimized Top-k Possible Repair Generation In another aspect of the preferred embodiment we only retrieve instance graphs that can have reasonable repairs, i.e., the instance graphs whose values have overlap with a given erroneous tuple.
  • Each inverted list is a mapping from a key to a set G of graph instances.
  • Each key is a pair (A, a) where A is an attribute and a is a constant value.
  • Each instance graph G in set G has value a on attribute A .
  • an inverted list w.r.t. G l in 601 is as: country , Italy ⁇ G l
  • Step 1 Find all possible repairs by retrieving all inverted lists related to t .
  • Step 2 Count the occurrence of instance graphs in all retrieved inverted lists.
  • Step 3 Return the top-k instance graphs with the highest
  • step 1 gets all possible repairs, since each instance graph in the retrieved inverted list has some value overlap with the given tuple. Moreover, all instance graphs that have value overlap with t will be indexed and thus retrieved. Step 2 will compute the repair cost of each possible repair and step 3 will return iop-k possible repairs.
  • Example 15 Consider t 3 in Fig. 1 and pattern q> s in 301 .
  • the inverted lists retrieved are given below.
  • A Andrea Pirlo ⁇ G l
  • the annotation module hence outputs two possible repairs for t 3 as G l and
  • the study demonstrates the effectiveness of the preferred embodiment of the present invention using holistic patterns and shows the efficiency of the holistic pattern and validation module. Furthermore the study demonstrates the effectiveness of data annotation and its use to repair and clean a dataset. In the experiment we use we use three datasets:
  • Country contains countries and capitals with currency and language, obtained from a wikipedia page(http://en.wikipedia.org/wiki/List_of_countries_ and_capitals_with_currency_and_language);
  • the first embodiment denoted as Support , ranked the candidate types and
  • the second embodiment denoted as MaxLike , infers the type of a column and the relationship between a column pair separately using maximum likelihood approach.
  • the third embodiment is the preffered emobidment, denoted as RankJoin , in the methods 1 to 4. We denote byRankJoinBase the rank join method without type pruning optimization as presented in method 2.
  • Gold standard For a given table T and a given KB K , we retrieved all candidate types and relationships by issuing SPARQL queries, we manually annotated each candidate as vital, okay, or incorrect. For example, for column B in Fig. 1 , country is vital; economy is okay; while state is incorrect. Evaluation. We compared the iop-k holistic patterns returned by each embodiment with the labelled candidates in the gold standard. For MaxLileand Support , top-k holistic patterns were constructed based on the support scores and the maximum likelihood scores, respectively. To compute precision, a candidate type or relationship has a score 1.0 if it was labelled vital; 0.5 if it was labelled okay; or 0 otherwise.
  • Support has the lowest precision because it looks for those candidates that cover the most number of tuples, which could be those general types or relationships, such as the root type owl: Thing in Yago .
  • MaxLike does better than Support because it maximizes the likelihood of the values given the candidates.
  • RankJoin is even better than MaxLike because it takes into account the coherence scores between types and relationships, and seeks the most coherent patterns.
  • RankJoin, MaxLike, Support have similar running time since the dominating computation is to retrieve candidates from the KBs , which is common to all embodiments.
  • RankJoinBase is quite expensive since it needs to enumerate holistic patterns despite the advantage of rank join, thus proving that type pruning is extremely beneficial.
  • the holistic pattern deteremined by the holistic pattern module may produce holistic pattern variants depending on the cost required. Either one or all holsitic pattern variants may be considered as aspects of the present invention. We use the second aspect in the preferred embodiment.
  • a first type of pattern contains only the type of each column, denoted as Typel .
  • a second type of pattern used in the embodiments above, containing the type of each column and relationship of every column pair, denoted as Type2 .
  • a third type of pattern considers indirect relationship between a column pair, denoted as Type3 .
  • Type3 considers indirect relationship between a column pair.
  • the third type of patterns consider a table with two columns A and B , representing two actors have played in the same movie before.
  • Pattern size is defined as the sum of the number of candidates types, number of candidate relationships, and number of candidate indirect relationships above a minimum support threshold (0.1 ). Patterns with larger size have richer semantics, and thus more powerful in capturing errors.
  • MUVF performs consistently better than AVI in terms of the number of questions needed. There are two reasons. First, MUVF may save the number of variables, thus saving the number of questions considerably, e.g., countryand soccer, since validating each variable requires the number of questions linear to the number of tuples.
  • Figure 9 shows data coverage of KBs for three datasets.
  • the left bar in Fig. 9, (901 ) depicts the percentage of tuples that are fully/partially covered by KBs . It shows that for country , about 40% of tuples are fully covered; for soccerand university , only around 5% of tuples are fully covered.
  • the right bar in Fig. 9 (901 ) depicts the percentage of relationships that are covered by KBs .
  • a tuple t and the pattern in Fig. 8 (801 ) if only two relationships appear in a KB, its coverage on relationship is 2/3. It tells that although for some datasets ( e.g., soccer and university ), only a small number of tuples are fully covered, around 50% of relationships between tuples are covered.
  • Figure 9 complements the above experiment by showing, in a finer granularity, the percentage of tuples that are partially covered by KBs .
  • the bar of number 0 (resp. 2) for country means that for all tuples that are partially covered by KBs , 30% (resp. 50%) of tuples have coverage on 0 relationship (resp. 2 relationships).
  • This figure shows that for most tuples, the number of relationships covered by KBs is good, e.g., 50% of partially covered tuples have 2 out of 3 relationships covered by KBs .
  • embodiment generates possible repairs with good recall (e.g., 90% for country , 80% for university , and 40% for soccer).
  • good recall e.g. 90% for country , 80% for university , and 40% for soccer.
  • the present invention is a data cleaning system using knowledge bases and crowdsourcing. Correspondence between the possibly dirty database and the KBs is established using modules by determining and validating the holistic patterns . Then each tuple in the database is verified using a holistic pattern against a KB with possible crowd involvement when the KB has no coverage. Experimental results have demonstrated that the present invention is able to detect data errors, suggest high quality possible repairs, and repair errorneous tuples.
  • Embodiments of the present invention also encompass systems incorporating a database stored on a tangible computer readable medium which is coupled to the system by a communicative coupling (such as a computer network (e.g. the internet or a local area network)).
  • a communicative coupling such as a computer network (e.g. the internet or a local area network)).
  • the system may, therefore, include a network interface.

Abstract

A end-to-end system to annotate unknown type data instances using a knowledge base and crowdsourcing. A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising: generating a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising data-types and data-type relationships; determining a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data- types and the data-type relationships; annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and repairing the erroneous annotated tuples to generate a clean database instance.

Description

Title: Data Cleaning Methods and Systems Description of Invention Background
As businesses generate and consume more data than ever, enforcing and maintaining data quality is of paramount importance to any organization. For example, in healthcare, inaccurate or incorrect data may threaten patient safety. Data cleaning is thus a key task towards improving data quality and is estimated to cost between 30% and 80% of a typical data warehouse project.
Data cleaning leads to correct information which results in quick and reliable searches. The performace increase resulting from clean data is well known in the art.
Various constraints are used to check data integrity. The variety of integrity constraints (ICs), include check constraints, conditional functional
dependencies (CFDS), and denial constraints (DCs), and are used to capture data errors in the form of IC violations.
A sample problem uses Fig 1 (Table 1 ) as a data set with some tuples which have erroneous attribute values. Consider a functional dependency f : B→C defined on the soccer players in Fig 1 ; it states that column B uniquely determines C . Tuples ( tl t t3 ) violate / since they have the same B -value (
Italy) but different C -values ( Rome for t ] and Madrid and t3[C] ). The error
(Madrid in t3[C] ) is captured by a violation with four values
{i1[5], i1[C], i3[5], i3[C]} . Automatic repair methods, resolve such violations by finding a consistent database with minimum changes; Randomly changing one value in {i1[5], i1[C], i3[5], i3[C]} would be such a change. However, it is known in the art that such methods cannot guarantee correct fixes ( i.e., ground truth). To guarantee correct fixes, it is a common practice to consult authority, including trustworthy master data, (i.e., knowledge bases KBs ), and domain experts. However, leveraging these two resources in data cleaning faces several limitations. (1 ) Experts have limited capacity, thus cannot manually go through all violations and resolve them; (2) The semantics of ICs is hard to understand for people not familiar with databases and their semantics. In our experience with IC validation, even simple ICs like FDs are not comprehensible to domain experts without a database background; and (3) KBs often have low coverage w.r.t. the database instance.
Suppose a KB states that Italy hasCapital Rome and Spain hasCapital Madrid, this information cannot be directly used to fix the FD violation of t and t3 because more evidences are needed to identify which cell is erroneous.
The present invention seeks to provide an improved computer implemented method and a system to clean a database instance. A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising:
generating a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising data-types and data-type relationships;
determining a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data-types and the data-type relationships;
annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and repairing the erroneous annotated tuples to generate a clean database instance.
One embodiment of the present invention addresses the above mentioned issues and is an end-to-end data cleaning system that effectively leverages prevalent trustworthy master data,/.e., KBs , and judiciously involves users to resolve conflicts and remove ambiguities. The resulting clean data instance improves performace of the in retrieving data and the load resulting on the underlying hardware by reducing the number of incorrect queries returned. We present three main modules which integrate with each other to perform the end-to-end cleaning, a: holistic pattern module; validation module; and an annotation module.
Drawings
So that the present invention may be more readily understood, embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Figure 1 is a sample dataset table T for soccer players.
Figure 2 is a flow diagram of the end-to-end workflow for updating a table T and an enriching a knowledge base KB.
Figure 3 shows examples of holistic patterns derived using the holistic pattern module and example validations.
Figure 4 shows the coverage of attributes in the table T by a knowledge base in a holistic pattern. Figure 5 shows a schematic of Top-k results of various column data-types and their support scores.
Figure 6 shows candidate holistic patterns repairs patterns for updating the tuple attributes in T.
Figure 7 shows results for the precision of holistic patterns using different ranking methods to generate top-k holistic patterns. Figure 8 shows validated holistic patterns for different data sets used in the experiment.
Figure 9 shows the data coverage of the knowledge bases of three different data sets, by three aspects of the preferred embodiment.
Figure 10 shows the effectiveness of the possible repairs by using recall, defined as the number of correct possible repairs over the number of dirty tuples. We present an overview of each of the modules which fit into the end-to-end system.
Holistic pattern module overview
We introduce holistic patterns; holistic patterns convey direct mappings from a given data instance to the constructs of a KB. Holistic patterns are used to explain relationships between tables. The coherence of binary relationships together are taken to produce the pattern. The holistic pattern is more than the sum of its parts and represent a key component in determining the quality of fixes in a give data instance. For example, holistic patterns recognize that the types of columns A , B , and C are Person, Country , and Capital in the KB, respectively, and two
relationships are in place between these columns in the KB, namely, A is related to B via nationality and B is related to C via hasCapital . Given such patterns, we may immediately detect that t3 might be wrong since it does not conform to the master data, in this case the KB, which states that Italy hasCapital Rome and Pirlo nationality Italy.
To deteremine holistic patterns, the data values in the database instance / are mapped to those in the KBs. There may be more than one possible mapping for a data value. For example, "Rome" could be either city , capital , or club in the KBs. These ambiguous mappings generate several holistic patterns. The holistic patterns are used to explain table semantics using KBs {e.g., Yago or RxNorm). Each holistic pattern assigns a type to a column and a binary relationship to a pair of columns up to the coverage provided by the KB.
Attributes in a holistic pattern are connected by binary relationships. This global information can explotited using several methods, for example in a a preferred embodiment we use a rank-join based method to efficiently discover holistic patterns taking into account their coherence.
At the same time, to verify KB data or to supplement where there is
inadequate coverage by the KB the types and binary relationships in a holistic pattern are easily verifiable by humans, using crowdsourcing.
Validation module overview
There are several possibilities of matching different types and binary relationships between a data set and a KB. This produces a plurality of candidate holistic patterns. In the present invention we use at least one of: crowdsourcing; and the KB to validate the best holistic pattern. Determining the best holistic pattern is discussed in the detailed description. Preferably, the crowd, users, are used to validate the best holistic pattern from the several candidate holistic patterns determined by the holistic pattern module.
The validation module preferably uses crowd involvement to validate holistic patterns as well as to validate data when the KBs do not have enough coverage. To effectively involve the crowd, the validation module generates questions, preferably automatically, to ask the crowd. The questions must be easy to understand. Moreover, crowd involvement is optimized to reduce cost, and is discussed in in the detailed description.
KBs are usually incomplete in terms of covering all values in / , which complicates several tasks. First, it makes it more complex to find correct holistic patterns. Second, in case of a mismatch, it is unclear if the values in / cannot be found in the KB because they are incorrect or because the KB is incomplete. Third, it makes data repairing harder, since the KB does not have enough correct values to suggest as possible repairs.
The validation module presents the crowd with specific questions to enable the validation module to select a valid holistic pattern from the candidate holistic patterns. Preferably the validation module minimises the number of questions to ask and dynamically schedules the questions, we can achieve this results for example using entropy theory.
The validation module maximizes the uncertainty reduction of candidate holistic patterns and ouputs a valid holistic pattern. Annotation module overview
We annotate data of the dataset with different categories (i) correct data validated by the KB; (ii) correct data jointly validated by the KB and the crowd; and (iii) erroneous data jointly identified by the KB and the crowd. Data in categories (i) and (ii) are traditionally obtained through cumbersome manual work; these annotated correct data are an important yet scarce resource for IC discovery methods as well as machine learning-based data cleaning solutions. A positive side-effect from (ii) is that the crowd provides new facts, which can be used to enrich the KB. The validated holistic pattern and annotated data may be used to generate possible repairs and to further annotate erroneous data identified in (iii).
Using the valid holistic pattern and a given dataset to clean, we distinguish between the mismatches, i.e., data not conforming to the holistic pattern, due to the lack of coverage in the KB, and the actual data errors. In the first case, we enrich the KB, and in the latter we suggest possible fixes for actual errors.
We now discuss embodiments of the invention for each module and preferred aspects.
Referring now to figure 2, a data set with table T (201 ); a knowledge base KB (202), although it can a collection of KBs; the holistic pattern module (203) ; the validation module(204); the annotation module (205); a crowd of users (206); a clean data set T'(207); and an enriched KB K (208), where enrichment is comprises updating and/or adding knowledge to the database.
• The holistic pattern module determines holistic patterns between a
table T (201 ) and a KB (202). • The validation module (204) leverages crowdsourcing to select the best holistic pattern.
• The annotation module (205) uses the selected holistic pattern and interacts with the KB and the crowd to annotate data. It may also generates possible repairs for erroneous tuples. New facts verified by the crowd are used to enrich KBs.
To better understand how the present invention works, we use examples based on the above embodiment of the invention.
Example 1 : Considering the table T for soccer players Fig 1 . T has no type table header (101 ) ,i.e., each column is untyped, and thus its semantics is completely unknown. A KB K , does have type information {e.g., Yago) and contains some information related to T , although it may not completely cover the table T .
The holistic pattern module determines candidate holistic patterns, with each containing the types (or classes) of columns and the relationships between pairs of columns.
A holistic pattern can be represented as a labelled graph (301 ). A node represents an attribute and its associated type, e.g., "C ( Capital )" means that the type of attribute C in KB K is Capital ; while a directed edge between two nodes represents the relationship between two attributes, e.g.,
"B hasCapital C "means that the relationship from B to C in KB K is
hasCapital . A column could have multiple candidate types, e.g., C could also be of type City . However, knowing the relationship from B to C is
hasCapital indicates that Capital is a better choice for C than City . Note that KBs are often incomplete. Hence, the discovered patterns may not cover all attributes of a table, e.g., attributes G,H of table T are not described by the pattern in (301 ).
In a case where the holistic pattern module finds two similar patterns: the one in (301 ), and a variant with type locationfor column C (302). To select the best holistic pattern we forward the candidate holistic patterns to the validation module and ask the crowd the question " Which type ( Capital or location) is more precise for values ( Rome, Pretoria and Madrid)?". If the crowd picks Capital , the holistic pattern in (301 ) is selected.
The crowd validate and select the holistic pattern in (301 ). The valid holistic pattern is forwarded to the data annotation module where each tuple is annoated. In one aspect of this emobodiment there are three possible annotations, although additonal annotations are possible.
The annotation module annoates the tuples in the data set using one of the following:
(i) : Validated to be correct by the KB. For example, by mapping tuple t in table T to K , we find a full match, shown in (303). It means that e.g., Paolo Rossi (resp. Italy) is in K as a Person(resp. country), and the relationship from
Paolo Rossi to Italy is nationality. Similarly, all other information for ^ w.r.t. attributes A -F can be found in K. In such case, we consider t to be correct w.r.t. the given pattern in 301 , relatively to attributes A -F .
(ii) : Jointly validated to be correct by the KB and crowd. The case in (i) is not always found and validated, in which case we annoate the data to be jointly validated by the KB and crowd. Consider t2 about Klate, whose explanation is depicted in (304). In KBK , we can find that South Africa is a country , Pretoria is a Capital . However, the relationship from South Africa to Pretoria is missing.
In such case, we ask the crowd a boolean question as " Does South Africa hasCapital Pretoria?". A positive answer from the crowd completes the semantic mapping that is missing from K . In this case, we consider t2 correct and generate a new fact " South Africa hasCapital Pretoria" from the crowd.
The new fact is used to enrich the KBs.
(iii): Erroneous tuple. Similar to case (ii). For tuple t3 about Pirlo, there is no link from Italy to Madrid in K . We then ask the crowd a boolean question such as " Does Italy hasCapital Madrid?". A negative answer from the crowd confirms that there is an error in t3. That is, the problem is not a missing relationship in the KB as in case (ii). At this point, however, we cannot decide which value in t3 is wrong, Italy or Madrid. The annoation module then derives evidences from K related to the relationshsip and nodes, e.g., Italy hasCapital Rome and Spain
hasCapital Madrid, join them and generate a ranking of possible repairs for this tuple based on the number of changes. The tuples in T Fig 1 are then updated using the releveant annotation and where appropriate the KB is enriched using the answers from the crowd.
Detailed Description We continue to use the example data set in Fig 1 in the description of embodiments of the present invention.
In a preferred embodiment of the present invention we consider knowledge bases (KBs) as RDF-based data consisting of resources, whose schema is defined using Resource Description Framework Schema (RDFS). Although we are not limited to RDFS.
A resource is an identifier for a real-word entity, e.g., Paolo Rossi, Rome, and Italy. A literal is a string, date, or number, e.g., 1 .78. A property is a binary predicate that represents a relationship between two resources or between a resource and a literal. We denote by P(x, y) the property ( a.k.a. relationship) between resource x and resource (or literal) y . For example, locatedIn( Milan, Italy) indicates that Milan is located in Italy.
An RDFS ontology distinguishes between classes and instances. A class is a resource that represents a set of objects, e.g., the class of countries. A resource that is a member of a class is called an instance of that class. The type relation associates an instance to a class e.g., type(Italy) = country .
A more specific class c can be specified as a subclass of a more general class d by using the statement subclassOf (c, d) . This means that all instances of c are also instances of d , e.g., subclassOf (Capital, location) . Similarly, a property Pl can be a sub-property of a property P2 by the statement
subpropertyOf^, ^) . In this work, we consider KBs that are expressed with RDFS. While it is possible to use more expressive languages such as OWL, which could offer more reasoning opportunities, most KBs in the industry as well as popular ones such as Yago, Freebase, and DBpedia use RDFS.
A holistic pattern of a relation R may be represented by a labelled directed graph G{V,E) with nodes V and edges E . Each node U G V corresponds to an attribute in R .possibly typed, and each edge (w, v) e E from u to v has a label P , denoting the relationship between two attributes that u and v represent. We use 4 e R to denote that At is an attribute in R . There are two basic semantic annotations on a relational table.
(1 ) Type of an attribute Ai . The type of an attribute is an annotation that represents the class of attribute values in Ai . For example, the type of attribute B in (301 ) is country .
(2) Relationship from attribute At to attribute Aj . The
relationship between two attributes is an annotation that represents how Ai and A . are related through a directed binary relationship. For example, the relationship from attribute B to C in (301 ) is hasCapital .
In the preferred embodiment a holistic pattern is a connected graph. When there exist multiple disconnected patterns, i.e., two holistic patterns that do not share any common node, we treat them independently. In the preferred embodiment we focus on discussing the case of a single holistic pattern.
For a holistic pattern φ , we denote by cpu a node u in φ , cp(u v) an edge in φ , φν all nodes in φ , and φΕ all edges in φ .
Semantics . A tuple t of R matches a holistic pattern φ (with m nodes {νγ, ... , νΜ} ) w.r.t. a knowledge base K , denoted by
Figure imgf000013_0001
, if there exist m attributes {Αγ, ... ,ΑΜ} in R and m resources {χγ, ... , χΜ} in K such that: 1 . there is a one-to-one mapping from Ai (and x, ) to v, for i e [\, m] ;
2. Dist(t[Ai ], xi ) < 0 and subclassOf (type(x, ), type(v, )) ; 3. for each edge (ν,.,ν .) in φΕ with property P , there exists a property P' for corresponding resources x, and x} in K such that
subpropertyOf (Ρ',Ρ)■ If t matches φ , each corresponding attribute value of t will map to a resource r in KB under a given distance function; and r is a (sub-)type by the type given in φ (conditions 1 and 2). Moreover, for each relationship P that appears in a pattern, the property between the two corresponding resources must be P of one of its sub-properties (condition 3).
Example 2 : Consider tuple t in Fig 1 and the pattern q>s in (301 ). We have t matches φ8 , since as depicted in (303), for each attribute value ( e.g., t^A] = Paolo Rossi and t^B] = Italy), there is a resource in K that has a similar value with corresponding type ( e.g., Personfor Paolo Rossi and countryfor Italy) for conditions 1 and 2, and the property nationality holds from Paolo Rossi to Italy in K (condition 3). Similarly, the conditions 1-3 hold for other attribute values in t . Hence, we have tx q>s .
We say that a tuple t of R partially matches a holistic pattern φ w.r.t. a knowledge base K , denoted by tpc , if at least one of the conditions 1-3 does not hold.
Example 3: Consider t2 in Fig 1 and q>s in (301 ) . We say that t2 partially matches q>s , since the property hasCapital from t2[B] = South Africa to t2[C] = Pretoria does not exist in K , i.e., condition 3 does not hold.
Given a table Γ (201 ), a KB K (202), and a holistic pattern φ between them, Fig. 4 shows T s attributes and tuples are covered using the holistic pattern: (1 ) Attributes covered by K iAO*! ). Attributes A -F in Fig. 4 are covered by the pattern in (301 ). We consider two cases for the tuples. 1 . Fully covered by K . We annotate such tuples as semantically correct w.r.t. φ and K (303).
2. Partially covered by K . We use crowdsourcing to verify whether the uncovered information is caused by the incompleteness of K (304) or by actual data errors (305).
(2) Attributes not covered by K (402). Attributes G and H in Fig. 4 are not covered by the pattern in (301 ). In this case, these attributes are not annoated due to the missing information in K .
For the uncovered attributes, we ask the crowd open-ended questions, such as " what are the possible relationships between Paolo Rossi and Striker?".
Candidate holistic patterns
The holistic pattern module deteremines candidate holistic patterns of a dataset, for example such as table T from a KB K in the preferred
embodiment. We first describe how to generate candidate types and candidate relationships. We then discuss the scoring function used to rank candidate holistic patterns. We use a rank-join method in the preferred embodiment to efficiently compute iop-k candidate holistic patterns from the candidate types/relationships.
For generality of deteremining a holistic pattern, we use an instance based approach that does not require the availability of meaningful labels for the columns. For each column Ai of table T and for each value ί[ΑΛ of a tuple t , we can map this value to several resources in the KB K whose type can be extracted. We issue the following SPARQL query which returns the types and supertypes of entities whose label ( i.e., value) is t[At] . select ?c,
where { ?x, rdfs : label t[At ] , ?x, rdfs : type/rdfs : subClassOf * ? c, }
Similarly, the relationship between two values t[At] and t[A .] from a KB K can be retrieved via the two following SPARQL queries.
QL select wu
where { ?x, rdfs : label 4] , Ίχ] rdfs : label
Figure imgf000016_0001
?x, ?i>./rdfs : subPropertyOf * ?x . }
Qlls select ?^.
where { ?x, rdfs : label t[At ] , ?x, ?i>. / rdfs : subPropertyOf * t[Aj ] }
Query Q\eXs retrieves relationships where the second attribute is a resource in KBs ; and Q h retrieves relationships where the second attribute is a literal value, i.e., untyped in KBs.
We use example 4 to describe an embodiment of the holistic pattern module using the above queries: Both Italy and Rome are stored as resources in KBs , thus their relationship hasCapital can be discovered by query Q\eXs ; while numerical values such as 1 .78 are stored as literals in the KBs , thus the relationship between Paolo Rossi and 1 .78, if available, would be discovered by query . In addition, for two values t[At] and t[A .] , we consider them as an ordered pair, thus in total four queries are issued between two values. Given table T (201 ) and KB : (202), the support of a candidate type Tt for column An denoted as
Figure imgf000017_0001
, is the percentage of tuples whose values onAj are of type Tt . Similarly, the support of a candidate relationship P for column pair Ai and A . , denoted as
Figure imgf000017_0002
, is the percentage of tuples whose two values of attributes Ai and A . abide by the relationship P .
The candidate types and relationships are ranked according to their support. When two candidate types have the same support, we rank higher the more discriminative type, i.e., the one with less number of instances in K .
Example 5 : Consider attribute B in Fig. 1 , the top-3 candidate types for B in Yago (not shown) are economy, country and state, with support 1.0,1.0 and 0.66, respectively. While economy and country have the same support, the former is ranked higher; economy has 356 resources in Yago, which is more discriminative than country which has 5504 resources.
A holistic pattern contains types of attributes and properties between
attributes. The space of all candidate holistic patterns is huge (up to the
Cartesian product of all possible types and relationships), which is expensive for human verification. Since not all candidate patterns make sense in practice, we only generate iop-k holistic patterns . We use a meaningful scoring function to rank candidate holistic patterns . An example of a meaningful scoring function is the naive scoring model. In a naive scoring model a candidate holistic pattern φ , consisting of type Tt for column An and relationship P for column pair Ai and A . , is to simply add up all support scores of the candidate types and relationships in φ : naiveScorefo) =∑ 0sup(J; , A, ) +∑y sup(i> , ,4)
However, columns are not independent of each other. The choice of the type for a column 4 affects the choice of the relationship for column pair 4 and 4 , and vice versa.
Example 6: Consider the two columns B,C in Fig. 1 , B has candidate types economy, country, state , C has candidate types City, Capital , and B,C has candidate relationships hasCapital . Intuitively, countryas a candidate type for column B is more compatible with hasCapital than economy because capitals are associated with countries, not economies. In addition, Capital is also more compatible with hasCapital than City because not all cities are capitals.
To quantify the "compatibility" between a type T and relationship P , where T serves as the type for the resources appearing as subjects of the relationship P , we introduce the coherence score subSC( , ) . Similarly, to quantify the
"compatibility" between a type T and relationship P , where T serves as the type for the entities appearing as objects of the relationship P , we introduce the coherence scores objSC( , ) .
Let resource( ) denote the set of resources in K that are of type T , subResource( ) the set of resources in K that participate as subject resources in a triple whose predicate is P , and objResource(P) the set of resources in K that participate in a triple as object resource whose predicate is P . Formally, for a given type T , and a given relationship P in the KB K , we define the semantic coherence of T as a subject for P as
_ I resource( ) o subResource( ) |
I resource( ) u subResource( ) | and the semantic coherence of T as an object for P as resource( ) n objResource(P) |
objSC(r, )
resource( ) u objResource(P) Intuitively, the above scores reflect how suitable a type T is for all those resources that appear as the subject or the object of P .
Example 7: We show some example coherence scores computed from Yago. As one can see, the score reflects our intuition in Example 6. country is more suitable than economy to act as a type for the subject resources of hasCapital ; and Capital is more suitable than city to act as a type for the object resources of hasCapital . subSC(economy, hasCapital)
sub SC (country, hasCapital) = 0.17
objSC(City, hasCapital)
objSC(Capital, hasCapital) However, as one can see the coherence scores are relatively small. This is because usually KBs are incomplete. For example, if every country appearing in the KB had its capital stored, then subSC(country, hasCapital) would be 1.0.
We can put a weight factor wc to boost up the contribution of the coherence scores. In the experiments, we used a factor of two, however alternative emboidments may use a suitable factor.
Taking the coherence of types and relationships into account, we define the score of a holistic pattern φ as: scored) = ∑ 0sup(J;, ) +∑i/sup( , , ) + M c .∑y (subSC(7: ,^ ) + objSC(rj ,^ )) Using the scoring model of holistic patterns , in an embodiment we retrieve top-£ holistic patterns without having to enumerate all candidate ones. We formulate this as a rank-join problem.
The preferred embodiment of the present invention takes as input to the holistic pattern module a table T , a KB K , and a number k , and produces top-£ holistic patterns. The preferred embodiment uses method 1 .
Method 1
Input: A table T , a KB K , and a number k .
Output: Top- holistic patterns based on their scores
1 : types(^4, ) — get a ranked list of candidate types for Aj
2: properties^,., ^4 .) — get a ranked list of candidate relationships for Aj and A 3: Let P be the top- holistic patterns , initialized empty
4: for all Ti e types (^4,. ) , and Py e properties^,., ^4 .) do
5: if I P \> k and TypePruning^) then
6: continue
7: generate all holistic patterns P'
8: compute the score for each holistic pattern P in P'
9: update P using P
10: compute the upper bound B of unseen patterns, and let ¾ ε be the holistic pattern with lowest score
1 1 : halt when score(¾) > B
12: return P
To start, each input list, i.e., candidate types for a column, and candidate relationships for a column pair, is ordered according to the respective supports (lines 1 -2). Two lists are joinable if they agree on one common column, and the scoring for joining is the coherence score between a type and a
relationship. The rank-join algorithm scans the ranked input lists in descending order of their support scores (lines 3-4), holistic patterns are generated incrementally as we move down the input lists. Holistic patterns that cannot be used to produce top-A: patterns will be pruned (lines 5-6). For each join result, i.e., each holistic pattern φ , the score scored) is computed (lines 7-8). We also maintain an upper bound B of the scores of all unseen join results, i.e., holistic patterns (line 10). We terminate the join process if either we have exhaustively scanned every input list, or we have obtained iop-k holistic patterns and the score of the kth holistic pattern is greater or equal to B (line 1 1 ). Since each list is ranked, B can be computed by adding up the support scores of the current positions in the ranked lists, plus the maximum
coherence scores of all relationships.
Lines 5-6 in method 1 check whether a candidate type Tt for column Ai can be pruned without generating holistic patterns involving Tt by calling method 2.
The intuition behind type pruning (Method 2) is that a candidate type Tt is useful if it is more coherent with any relationship Pa than previously examined types for Ai . We first calculate the current minimum sum of coherence scores involving column Ai in the current top-k patterns (line 1 ). We then calculate the maximum possible sum of coherence scores involving type Tt (line 2). Tt can be pruned if we are sure that the maximum sum of coherence is still smaller than the current smallest sum of coherence scores involving column At (lines 3-6). Example 8 : Consider the rank-join graph in Fig. 5 for a table with just two columns B and C as in (100). Support scores are shown in the braces.
Consider k = 2.
Suppose we are currently examining type state for column B . We do not need to generate holistic patterns involving state since the maximum coherence between stateand hasCapital or isLocatedh is less than the the current minimum coherence score between type of Column B and relationship between B,C in the current top-2 patterns. Suppose we are currently examining type whole for column C , and we have reached type state for B and hasCapital for relationship B,C . The bound score 5 = 0.7 + 0.9+ 0.5 + 0.18+ 0.12 = 2.4 , where 0.18 (0.12) is the maximum coherence score between any type in types(5) ( types(C) ) and any relationship in properties^, C) . B is smaller than score((p2) = \ .0 + 0.9 + 0.9 + 0.09 + 0.12 = 3.01. Therefore, we early terminate without having to go over each list.
Method 2 Input: current top- holistic patterns P , candidate type Tj .
Output: a boolean value, true/false means that Tj can/cannot be pruned
1 : curMinCohSum(4) - minimum sum of all coherence scores involving column Ai in current top-k P
2:
Figure imgf000022_0001
- maximum sum of all coherence scores if the type of column Ai is Tt
3: if
Figure imgf000022_0002
< curMinCohSum(4) then
4: return true
5: else
6: return false
Method 1 is guaranteed to produce the iop-k holistic patterns since we keep the current iop-k patterns in P , and we terminate when we are sure that it will not produce any new holistic pattern with a higher score. In the worst case, we still have to exhaustively go through all the ranked lists to produce the top-k holistic patterns . However, in most cases the top ranked holistic patterns involve only candidates types/relationships with high support, which are at the top of the lists. Since computing coherence scores for a type and a relationship is an expensive operation that requires set intersection and set union, we compute offline the coherence score for every type and every relationship in K . For each relationship, we also keep the maximum coherence score it can achieve with any type to efficiently compute the bound B .
Once the top-k candidate holistic patterns are generated they are forwared to the validation module to be validated.
Validation module
The validation module uses the set P of candidate holistic patterns, a table T , a KB K , and a crowdsourcing framework ( e.g., MTurk) as input, we identify the most appropriate holistic pattern for T w.r.t. K , with the objective of minimizing the number of questions to the crowd.
A naive approach to generate crowdsourcing questions is to express candidate holistic patterns as questions to the crowd who would then selects the best one. However, crowd workers can only answer simple questions. In addition, holistic patterns graphs can be hard to validate for crowd users ( e.g., 301 ). Hence, we decompose holistic patterns into two simple tasks: (1 ) column type validation, i.e., to validate the type of a column; and (2) column pair relationship validation, i.e., to validate the relationship between two columns.
In column type validation we use a set of candidate types candT(4) for column An one type Tt e candT(4) needs to be chosen. We consider a type Tt to be the correct type for Ai if it has been confirmed by the crowd that at least ε percentage of the tuples in T have type Tt in column Ar
We formulate the following question to the crowd about the type of a column; What is the most accurate type of the highlighted column? along with kt randomly chosen tuples from T and all candidate types from candT(4) A sample question is given below.
Ql : What is the most accurate type of the highlighted ( ) column? (A, (B), C, D, E, F , ...)
(Paolo Rossi, (Italy), Rome, Hellas Verona, Italian, Proto, ...)
(Andrea Pirlo, (Italy), Madrid, Juventus, Italian, Flero,, ...)
country
economy
state method 3, referred to as CrowdTypeVal , takes as input a table T , a set of candidate types candT(4) for a column An and outputs the most appropriate type for this column according to the crowd. Since we require at least ε percentage of tuples supporting a type, the algorithm first generates f --i questions asking about the types of column Ai in ε\Τ\ tuples, where | T | is the number of tuples in T , and kt is the number of tuples to be shown in one question, ( e.g., kt =3 in Q) (lines 1-2). Votes for each candidate type are collected (lines 3-5), and the candidate type Tt with the maximum number of votes is recorded (line 6). If the Tt 's votes do not exceed ε\Τ\, additional questions are issued for the remaining set of tuples that have not been asked until Tt 's votes exceed ε\Τ\ or there are no more tuples left (lines 8-13). If Tt reaches the minimum vote threshold, it is selected as the type for Ai (lines 13-
14). Otherwise, none of the candidates in canTs(4) have the minimum number of votes; the type of Ai is not validated, and we remove the type of Ai from all candidate holistic atterns lines 15-16). Thus, the total number of
Figure imgf000024_0001
Method 3
Input: Table T of relation ϋ^,.,.,Α^ , and candT^)
Output: One type Tj for column Aj
1 : let vote(x) = 0, for each x e candT^)
ε \Τ\
2: generate—— - questions Q 3: for all Question qi e Q do
4: x — the answer of ^
5: Vote(x) <— Vote(x) + kt
6: let Tj e canT{Aj ) be the type with maximum Vote
7: let Ir be the remaining set of tuples that have not been asked
8: while Vote(Tt ) < ε \ T | and Ir≠ 0 do
9: generate another question q
10: x — the answer of q
11 : Fofe(x) Vote(x) + kt
12: update Tt and 7r
13: if Vote(Ti ) > s \ T \ then
14: return Tj
15: else
16: return null
We validate the relationship for column pairs in a similar fashion, where the term Tt e canT(Ai ) is replace with the term
Figure imgf000025_0001
. The details are omitted due to space constraint. We give an example question for relationship validation as follows.
Q2 : What is the most accurate relationship for highlighted columns ( A , ( B , C ), D , E , F , ... )
(Paolo Rossi, (Italy, Rome), Hellas Verona, Italian, Proto, ...)
(Andrea Pirlo, (Italy, Madrid), Juventus, Italian, Flero, ...)
o B hasCapital C
o C located In B
Candidate types and candidate relationships are stored as URIs in KBs ; these are not directly consumable by the crowd. For example, the type capital for Column C is stored as http://yago- knowledge.org/resource/wordnet_capital_10851850, and the relationship hasCapital for column pair B and C is stored as http://yago- knowledge.org/resource/hasCapital. We look for types and relationship descriptions, e.g., Capital and hasCapital , by querying the KB for the labels of the corresponding URIs. If no label exists, we process the URI itself by removing the text before the last slash and punctuation symbols.
For each question, we also expose some contextual attribute values that help crowd workers better understand the question. For example, we expose the values for attributes C,D, E in question Ql when we validate the type of attribute B . If the table has a small number of attributes, we simply put all attributes in the question; otherwise, we identify attributes that are related to the ones in the question.
The column type and realtionship validation questions are minimised by the Validation. We minimise the total number of questions to get the correct holistic pattern by scheduling which column (resp. relationship) to validate first. Recall that once a column type (resp. relationship) is validated, we can prune from P all holistic patterns that have a different type (resp. relationship) for that column (resp. column pair). Therefore, a natural choice is to choose those columns (resp. column pairs) that result in the maximum uncertainty reduction. Consider φ as a variable, which takes values from P = {φι, φ2, ...,<¾} . We translate the score associated with each holistic pattern to probability by
score ((P ^
normalizing the scores, i.e., Ρτ(φ = φ; ) = —— . We define the
^^scoreifpj)
uncertainty of φ w.r.t. P as the entropy. ΗΡ φ) = -Σφ^Ρτ φ = φ1)\ο§,2Ρτ(φ = φ,)
Example 9: Consider an input list of five holistic patterns P = {φι, ..., φ5) as follows with normalized probability of each holistic pattern shown in the last column. type (B) type (C) P (B,C) score Prob country Capital hasCapital 2.8 0.35 φ2 economy Capital hasCapital 2 0.25 φ3 country City locatedln 2 0.25 φ* country Capital locatedln 0.8 0.1 ψ5 state Capital hasCapital 0.4 0.05
We use variables vA and vA A to denote the type of the column At and the relationship between Ai and A . respectively. The set of all variables is denoted as V . In Example 9
V = {vB, vc, vBC} , νΒ e {country, economy, state} , vc = {Capital, City} ,
vBC e {hasCapital, isLocatedln} .
The probability of an assignment of a variable v = a is obtained by aggregating the probability of those holistic patterns that have that assignment for v . For example,
Pr(vB = country) = Prfa) + Pr{(p3) + r(<p4) = 0.35 + 0.25 + 0.1 = 0.7 ,
Pr(vB = economy) = 0.25 , and Pr(vB = state) = 0.05
After validating a variable v to have value a , we remove from P those holistic patterns that have different assignment for v . The remaining holistic patterns is denoted as Pv=a . Suppose column B is validated to be of type country , then
= {<¾,<¾,<¾} . Since we do not know what value a variable can take, we
Figure imgf000027_0001
measure the expected reduction of uncertainty of variable φ after validating variable v , formally defined as:
E(AH(<p))(v) =∑aPr(v = a)HP (φ) -ΗΡ(φ)
In each iteration, we choose the variable v (column or column pair) with the maximum uncertainty reduction, i.e., Ε(ΑΗ(φ))(ν) . Finding such variable v would require us to iterate over all variables in V , and calculate their E(AH (v)) , which has a complexity of 0(\ V \\ P |2) . The following theorem simplifies the calculation for E(AH (v)) with a complexity of 0{\ V \\ P |) .
Theorem 1. The expected uncertain reduction after validating a column (column pair) v is the same as the entropy of the variable.
Ε(ΑΗ(φ))(ν) = H(v) , where H(v) = -∑aPr(v = a) log2 Pr(v = a) .
Method 4 describes the overall procedure for pattern validation. At each iteration: (1 ) we choose the best variable vbest to validate next based on the expected reduction of uncertainty of φ (lines 4-10); (2) we remove from Pre those holistic patterns that have a different assignment for variable v than the validated value a (lines 1 1 -12); and (3) we renormalize the probability distribution of the remaining holistic patterns in Pre (line 13). We terminate when we are left with only one holistic pattern (line 3).
Method 4
Input: a set of holistic patterns P
Output: one holistic pattern φ e P
\ . Pre be the remaining holistic patterns, initialized P
2: initialize all variables V , representing column or column pairs, and calculate their probability distributions.
3: while | Pre |> 1 do
4 Jbest - 0
5 best null
for all v e F do
compute the entropy H(v) .
if H(v) > Ebest then 10: Ebest ^ H{v)
1 1 : validate the variable v , suppose the result is a , let Pv=a to be the set of holistic patterns with v = a
12: P re = P v=a
13: normalize the probability distribution of holistic patterns in Pre . 14: return the only holistic pattern φ in Pre
Example 10 : To validate five patterns in Example 9, we first calculate the entropy of every variable.
H(vB) = -0.71og20.7 - 0.25/og20.25 - 0.05/og20.05 = 1.07 , H(vc) = 0.81 , and
H(vBC) = 0.93 . Thus column B is validated, say the answer is country . The remaining set of holistic patterns, and their normalized probabilities are:
Figure imgf000029_0001
Now Pre = {(¾,(¾,(¾} . The new entropies for all variables are: H(vB) = 0 , H(vc) = 0.93 and H(vBC) = 1 . Therefore, column pair B,C is chosen, say the answer is hasCapital . We are now left with only one pattern φι , thus we return <¾ .
Thus, we do no need to validate vc following our scheduling strategy.
Furthermore, after validating certain variables, other variables may become less uncertain, thus requiring less number of questions to validate.
Annotation module
Once a holistic pattern has been validated, the validated holistic pattern is used to annotate tuples in the data set. Annotated tuples may then be repaired where appropriate.
In the preferred embodiment, the annotation module annotates tuples into three categories: (/') correct data validated by KBs ; (/'/') correct data jointly validated by KBs and the crowd; (//'/') errorneous data. In the preferred embodiment the annotation module uses the following steps to annoated the tuples in the dataset. We use the table in Fig 1 in the examples that follow.
Step 1: Validation by KBs . For each tuple t and a holistic pattern φ , a SPARQL query checks whether t is fully covered by a K . If it is fully covered, the annotation module annotates it as a correct tuple validated by KB(case (/')). Otherwise, it goes to step 2.
Example 1 1 : Consider tuple t in Fig. 1 and the holistic pattern in (301 ), a boolean SPARQL query is run:
ASK {
(Italy) rdf : type/rdfs : subClassOf * (country) .
(Paolo Rossi) rdf : type/rdfs : subClassOf * (Person) .
(Hellas Verona) rdf : type/rdfs : subClassOf * (footballclub) .
(Rome) rdf : type/rdfs : subClassOf * (Capital) .
(Italian) rdf : type/rdfs : subClassOf * (language) .
(Proto) rdf : type/rdfs : subClassOf * (City) .
(Italy) (hasCapitalOf) (C) .
(Paolo Rossi) (hasNationalityOf ) (Italy) .
(Paolo Rossi) (hasClub) (Hellas Verona) .
(Paolo Rossi) (bornln) (Proto) .
(Proto) (locatedln) (Italy) .
(Italy) (hasOffical Language) (Italian) . }
Assume that the above query returns true, which validates that the KB has a full coverage on tuple tl t i.e., t is annotated as correct w.r.t. the given holistic pattern.
Step 2: Validation by KBs and Crowd . For each node ( i.e., type) and edge ( i.e., relationship) that is missing from K , the annotation module asks the crowd that whether the relationship holds between the given two values. If crowd says yes, the annotation module annotates the tuple as a correct tuple, jointly validated by KB and crowd (case (ii)). Otherwise, it is certain that there exist errors in this tuple (case (iii)).
Example 12 : Consider tuple t2 (resp. t3 ) in Fig. 1 and the holistic pattern in
(301 ), which has been validated. The information about whether Pretoria (resp. Madrid) is a capital of South Africa (resp. Italy) is not in KB . To verify this information, we issue a boolean question Qt (resp. Qt ) to the crowd as:
Does South Africa hasCapital Pretoria?
Yes
No Q : Does Italy hasCapital Madrid?
o Yes
o No
In such case, the crowd will say yes (resp. no) to question Q (resp. Q ).
For each question that crowd says yes ( e.g., Q in Example 12), it is a new fact that is not in current KB. The annotation module collects all these facts and then enriches the given KB afterwards (207).
The annotation module may determine that a repair is required for the tuple and generates a list of candidate repairs. The enriched KB may be used to determine repairs.
We start by introducing two tools that are necessary to explain our approach to possible repairs generation. Given a KB K and a holistic pattern G(V,E) , an instance graph GjiV^Ej) is a graph with nodes Vt and edges EI , such that (i) each node v, G F( is a resource in K ; (ii) each edge ei ε £7 is a property nK ; (iii) there is a one-to- one correspondence / from each node V G V to a node v, e , i.e., f( ) = vi ; and (iv) for each edge (u,v) e E , there is an edge (f(u), /(v)) e E7 with the same property.
An instance graph is one possible instantiation of a holistic pattern in a given KB.
Example 13 : Figure 6, (601 ) and (602) are two instance graphs of the holistic pattern of (301 ) in Yago for players Andrea Pirlo and Xabi Alonso, respectively. Given an instance graph G , a tuple t , and a holistic pattern φ , the repair cost of aligning t to G w.r.t. φ , denoted by cost(t, φ, G) , is the number of value changes in t to align it with G .
The less a repair cost is, the closer the updated tuple is to the original tuple, hence more likely to be correct.
Example 14 : Consider tuple t3 in Fig. 1 , the holistic pattern q>s in (301 ), and two instance graphs Gl and G2 in (601 ) and (602). One can verify that the repair cost from t3 to Gl is 1 , i.e.,
Figure imgf000032_0001
= 1 , by updating t3[C] from Madrid to Rome. Similarly, the repair cost from t3 to G2 is 5, i.e.,
∞st(t3,q>s,G2) = 5.
Note that, the possible repairs are ranked based on repair cost in ascending order, e.g., the repair using Gl to t3 with cost 1 is ranked higher that the repair using G2 to t3 with cost 5. In this work, instead of using the top-1 repair to update a value, we provide top-k possible repairs, while leave it to the users about which repair to pick. In the following, we describe embodiments to generate top-k repairs for every tuple identified as erroneous.
Naive Top-k Possible Repair Generation . In one aspect of the preferred embodiment given a KB K and a holistic pattern φ , we compute offline all instance graphs G in K w.r.t. φ . For each tuple t , a naive solution is to compute the distance between t and each instance graph G in set G . The k graphs with smallest repair cost are returned as top-k possible repairs.
Optimized Top-k Possible Repair Generation . In another aspect of the preferred embodiment we only retrieve instance graphs that can have reasonable repairs, i.e., the instance graphs whose values have overlap with a given erroneous tuple.
We use an inverted list to perform the above optimization for top-k possible repairs. inverted lists. Each inverted list is a mapping from a key to a set G of graph instances. Each key is a pair (A, a) where A is an attribute and a is a constant value. Each instance graph G in set G has value a on attribute A .
For example, an inverted list w.r.t. Gl in 601 is as: country , Italy→ Gl We optimize using the following steps for a tuple t .
Step 1 . Find all possible repairs by retrieving all inverted lists related to t . Step 2 . Count the occurrence of instance graphs in all retrieved inverted lists.
Step 3 . Return the top-k instance graphs with the highest
occurrences.
Intuitively, step 1 gets all possible repairs, since each instance graph in the retrieved inverted list has some value overlap with the given tuple. Moreover, all instance graphs that have value overlap with t will be indexed and thus retrieved. Step 2 will compute the repair cost of each possible repair and step 3 will return iop-k possible repairs.
Example 15 : Consider t3 in Fig. 1 and pattern q>s in 301 . The inverted lists retrieved are given below. A , Andrea Pirlo→ Gl
B , Italy→ Gl
C , Madrid→G2
D , Juventus→ Gl
E , Italian→ Gl
F , Verona→ Gl
It is easy to see that the occurrences of instance graph Gl is 5 and G2 is 1 . In other words, the repair cost to t3 w.r.t. Gl is 6-5 = 1 , while the repair cost to t3 w.r.t. G2 is 6-1 = 5 . Hence, the top-2 possible repairs for t3 are Gl and G2 .
The annotation module hence outputs two possible repairs for t3 as Gl and
G2 . Experimental Study
The study demonstrates the effectiveness of the preferred embodiment of the present invention using holistic patterns and shows the efficiency of the holistic pattern and validation module. Furthermore the study demonstrates the effectiveness of data annotation and its use to repair and clean a dataset. In the experiment we use we use three datasets:
1 . Country contains countries and capitals with currency and language, obtained from a wikipedia page(http://en.wikipedia.org/wiki/List_of_countries_ and_capitals_with_currency_and_language);
2. Soccercontains information about soccer players and their clubs
scraped from the Web (http://www.premierleague.com/en- gb. html, http://www.legaseriea.it/en/,http://www. premierleague.com/en-gb. html ); and
3. University contains all universities in the US with their
addresses(http://ope. ed.gov/accreditation/GetDownLoadFile.aspx).
Knowledge bases. We used two KBs , Yago and DBpedia . They were transformed to Jena format (https://jena.apache.org/). We ran all datasets on both KBs , and choose for a dataset the KB with better coverage. DBpedia was used for soccer and university , Yago was used for country . Pattern determination
We use three different embodiments to determine holistic patterns. The first embodiment, denoted as Support , ranked the candidate types and
relationships solely on their support score. The second embodiment, denoted as MaxLike , infers the type of a column and the relationship between a column pair separately using maximum likelihood approach. The third embodiment is the preffered emobidment, denoted as RankJoin , in the methods 1 to 4. We denote byRankJoinBase the rank join method without type pruning optimization as presented in method 2.
Gold standard. For a given table T and a given KB K , we retrieved all candidate types and relationships by issuing SPARQL queries, we manually annotated each candidate as vital, okay, or incorrect. For example, for column B in Fig. 1 , country is vital; economy is okay; while state is incorrect. Evaluation. We compared the iop-k holistic patterns returned by each embodiment with the labelled candidates in the gold standard. For MaxLileand Support , top-k holistic patterns were constructed based on the support scores and the maximum likelihood scores, respectively. To compute precision, a candidate type or relationship has a score 1.0 if it was labelled vital; 0.5 if it was labelled okay; or 0 otherwise.
As shown in Fig 7 (701 to 703), Support has the lowest precision because it looks for those candidates that cover the most number of tuples, which could be those general types or relationships, such as the root type owl: Thing in Yago . MaxLike does better than Support because it maximizes the likelihood of the values given the candidates. RankJoin is even better than MaxLike because it takes into account the coherence scores between types and relationships, and seeks the most coherent patterns.
As shown in Fig. 7 (704), RankJoin, MaxLike, Support have similar running time since the dominating computation is to retrieve candidates from the KBs , which is common to all embodiments. RankJoinBase is quite expensive since it needs to enumerate holistic patterns despite the advantage of rank join, thus proving that type pruning is extremely beneficial.
It is evident that the more expressive the pattern language is, the more errors they can possibly capture. On the other hand, allowing more expressive patterns may incur high cost in pattern discovery and data annotation.
The holistic pattern deteremined by the holistic pattern module may produce holistic pattern variants depending on the cost required. Either one or all holsitic pattern variants may be considered as aspects of the present invention. We use the second aspect in the preferred embodiment. In a first aspect, a first type of pattern contains only the type of each column, denoted as Typel .
In a second aspect, a second type of pattern, used in the embodiments above, containing the type of each column and relationship of every column pair, denoted as Type2 .
In a third aspect, a third type of pattern considers indirect relationship between a column pair, denoted as Type3 . As an example for the third type of patterns, consider a table with two columns A and B , representing two actors have played in the same movie before. Thus, the pattern for Column A of type Actor and B of type Actorare connected via an intermediate node X of type Movie .
We measured the running time for generating candidates patterns, as well as the size of the patterns. Pattern size is defined as the sum of the number of candidates types, number of candidate relationships, and number of candidate indirect relationships above a minimum support threshold (0.1 ). Patterns with larger size have richer semantics, and thus more powerful in capturing errors.
Figure imgf000037_0001
Above we show that our current second aspect allows us to capture the semantics of the table, validate and repair the data efficiently. Moreover, more expressive pattern languages, such as allowing indirect relationship between two columns in the pattern, will considerably increase the cost of pattern discovery without much gain in capturing more errors in the data. Further, validating data using Type3 for the crowd is much harder ( e.g., "does there exist a movie that actor 1 and actor 2 are both in?").
Validation module
Methods To evaluate the savings in crowdsourcing pattern validation that are achieved by our scheduling methods, we compared our method (denoted MUVF , short for most-uncertain-variable-first) to a baseline (denoted AVI ) that validates every variable independently.
Crowd. We validated countryand university datasets using internal crowd, i.e., a group of researchers from QCRI and University of Waterloo; and we validate soccer using Amazon Mechanical Turk (MTurk). MTurk is a widely used crowdsourcing platform. Each question was asked three times, and the majority was taken as the answer.
We required at least 10% of tuples to validate a type or a relationship. For each dataset, we compared the number of variables needs to validate until there is only one holistic pattern left. We also reported the number of questions that each method asked in total to validate all variables.
Figure imgf000038_0001
Considering the above data MUVF performs consistently better than AVI in terms of the number of questions needed. There are two reasons. First, MUVF may save the number of variables, thus saving the number of questions considerably, e.g., countryand soccer, since validating each variable requires the number of questions linear to the number of tuples.
Second, even the number of variables needed to validate is the same, MUVF still wins, e.g., university , since after validating certain variables, other variables become less uncertain, thus requiring less number of questions to reach the minimum support threshold. Figure 8, (801 to 803) shows validated patterns for the three datasets.
Annotation module
We evaluated data annotation in three aspects: information covered by KBs, accuracy of data validation via crowdsourcing, and effectiveness of possible repairs.
Data validated via KBs . Figure 9 shows data coverage of KBs for three datasets. For each dataset, the left bar in Fig. 9, (901 ) depicts the percentage of tuples that are fully/partially covered by KBs . It shows that for country , about 40% of tuples are fully covered; for soccerand university , only around 5% of tuples are fully covered. For each dataset, the right bar in Fig. 9 (901 ) depicts the percentage of relationships that are covered by KBs . For example, for a tuple t and the pattern in Fig. 8 (801 ) if only two relationships appear in a KB, its coverage on relationship is 2/3. It tells that although for some datasets ( e.g., soccer and university ), only a small number of tuples are fully covered, around 50% of relationships between tuples are covered.
Figure 9 (902) complements the above experiment by showing, in a finer granularity, the percentage of tuples that are partially covered by KBs . For example, the bar of number 0 (resp. 2) for country means that for all tuples that are partially covered by KBs , 30% (resp. 50%) of tuples have coverage on 0 relationship (resp. 2 relationships). This figure shows that for most tuples, the number of relationships covered by KBs is good, e.g., 50% of partially covered tuples have 2 out of 3 relationships covered by KBs .
True False country 98% 88%
soccer 97% 100%
university 99% 100%
Data validated via KBs and crowd . In order to quantify data validation using crowd, we did it in a controlled way, i.e., we need to know the ground truth to evaluate the reliability of the crowd. We first manually examined each dataset to identify 37 tuples for country , 67 tuples for soccer , and 65 tuples for university , that are ensured to be correct. We then randomly picked 100 relationships that are ensured to be correct for each dataset, shown as "True relationships" in the above table. Moreover, we injected errors from active domain for these selected tuples, and also picked 100 questions for each dataset that are ensured to be wrong, shown as "False relationships" in the above table. We mixed the above two categories of questions and posted them on our internal crowd. The result in the above table that an internal crowd is good in answering questions that contain both true relationships and false relationships.
Possible repairs . This experiment is to validate the effectiveness of possible repairs. We first generated top-k repairs for each dirty tuple. If the ground truth of this dirty tuple falls in this top-k repairs, we consider these possible repairs correct. Here, recall is defined as the number of correct possible repairs over the number of dirty tuples. The result for three datasets is given in Fig. 10, where the jc-axis is for the parameter k and y -axis is the recall. For example, when £ =1 , 60% of university tuples' ground truth is in generated possible repairs. Figure 10 tells that when k is small ( e.g., 3), the preferrred
embodiment generates possible repairs with good recall ( e.g., 90% for country , 80% for university , and 40% for soccer).
The present invention is a data cleaning system using knowledge bases and crowdsourcing. Correspondence between the possibly dirty database and the KBs is established using modules by determining and validating the holistic patterns . Then each tuple in the database is verified using a holistic pattern against a KB with possible crowd involvement when the KB has no coverage. Experimental results have demonstrated that the present invention is able to detect data errors, suggest high quality possible repairs, and repair errorneous tuples.
Embodiments of the present invention also encompass systems incorporating a database stored on a tangible computer readable medium which is coupled to the system by a communicative coupling (such as a computer network (e.g. the internet or a local area network)). The system may, therefore, include a network interface.
When used in this specification and claims, the terms "comprises" and
"comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.
The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof. Techniques for Implementing Aspects of Embodiments of the Invention:
[1 ] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[2] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. [3] A. Cal, G. Gottlob, and A. Pieris. Advanced processing for ontological queries. PVLDB, 2010.
[4] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB,
[5] X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 2013.
[6] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.
[7] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007.
[8] D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 2013.
[9] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD Conference, 2013.
[10] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 201 1 . [1 1 ] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB J., 21 (2), 2012.
[12] W. Fan. Dependencies revisited for improving data quality.
PODS, 2008.
[13] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. VLDB J., 13(3), 2004. [14] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga.
Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, 2004.
[15] K. Kerr, T. Norris, and R. Stockdale. Data quality information and decision making: a healthcare case study. In ACIS, 2007.
[16] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 2010.
[17] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10, 2008.
[18] V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, 2001 .
[19] B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar.
Crowdsourced enumeration queries. In ICDE, 2013.
[20] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 201 1 .
[21 ] M. White. Enterprise information portals. Electronic Library, The, 18(5), 2000.
[22] M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. [23] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F.
Ilyas. Guided data repair. PVLDB, 201 1 .
[24] C. J. Zhang, L. Chenx, H. V. Jagadish, and C. C. Caox. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 2013.

Claims

1 . A computer implemented method for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the method comprising:
generating a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising data-types and data-type relationships;
determining a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data-types and the data-type relationships;
annotating tuples in the database instance using the valid holistic pattern, wherein the method annotates the tuples with annotations indicating at least one of: knowledge base validated; jointly validated, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and
repairing the erroneous annotated tuples to generate a clean database instance.
2. The method of claim 1 further comprising: matching data-types in the knowledge base to unknown attribute data-types in the database instance using the attribute values; 3. The method of claim 2 further comprising: assigning a binary
relationship value between the matched attribute data-types.
4. The method of any preceding claim wherein the method further comprises updating the knowledge base using at least one of: the valid holistic pattern; and the crowd of users.
5. The method of claim 1 wherein generating candidate holistic patterns comprises determining support values of a candidate type and a candidate relationship and ranking the candidate holistic patterns using the support values.
6. The method of claim 5 wherein the plurality of holistic patterns further comprise determining coherence values using the support values.
7. The method of claim 6 wherein the holistic pattern module further comprises determining a score of the candidate holistic pattern.
8. The method of claim 7 wherein the method comprises determining the score of the candidate holistic pattern is determined offline.
9. The method of any preceding claim further comprising dynamically generating questions to ask the crowd of users using the plurality of candidate holistic patterns.
10. A method of claim 9 further comprising aggregating answers to the questions and determining the valid holistic pattern.
1 1 . The method of any preceding claim further comprising:
generating a plurality of instance graphs of the erroneous tuples using the knowledge base and the valid holistic pattern; and
determining a repair cost of the erroneous tuple based on each instance graph and valid holistic pattern combination.
12. The method of claim 1 1 wherein the method comprises ranking candidate repair tuples computed from an inverted repair list based on repair cost.
13. A tangible machine readable storage medium comprising instructions that, when executed, cause a machine to perform the method of any one claims 1 to 12.
14. A system for cleaning a database instance using a plurality of holistic patterns, the database instance comprising a plurality of dirty tuples with unknown attribute data types, the system comprising:
a holistic pattern module configured to generate a plurality of candidate holistic patterns using the database instance and a knowledge base, the knowledge base comprising the data-types and data-type relationships;
a validation module configured to determine a valid holistic pattern from the plurality of candidate holistic patterns using at least one of: the knowledge base; and a crowd of users which validate the data-types and the data-type relationships;
an annotation module configured to annotate tuples in the database instance using the valid holistic pattern, wherein the annotated tuples indicate at least one of: knowledge base validated; or jointly validate, wherein the crowd of users who at least partially validate the holistic pattern; or erroneous, and
a repair module operable to generate a clean database instance using the annotated erroneous tuples.
15. The system of claim 14 further comprising: the holistic pattern module operable to match data-types in the knowledge base to unknown attribute data-types in the database instance using the attribute values;
16. The system of claim 15 further comprising: the holistic pattern module operable to assign a binary relationship value between the matched attribute data-types.
17. The system of claims 14 to 16 wherein the system further comprises updating the knowledge base using at least one of: the valid holistic pattern; and the crowd of users.
18. The system of claim 14 wherein the holistic pattern module is operable to generate candidate holistic patterns having a support value for: candidate data type; and candidate relationship, and ranking the candidate holistic patterns using the support values.
19. The system of claim 18 wherein the plurality of holistic patterns further comprise determining coherence values using the support values.
20. The system of claim 19 wherein the holistic pattern module further comprises determining a score of the candidate holistic pattern.
21 . The system of claim 20 wherein the score of the candidate holistic pattern is determined offline.
22. The system of claim 14 to 21 further comprising the validation module operable to dynamically generate questions to ask the crowd of users using at least one of: the plurality of candidate holistic patterns; and the data instance.
23. A system of claim 22 further comprising aggregating answers to the questions and determining the valid holistic pattern.
24. The system of claims 14 to 22 further comprising:
generating a plurality of instance graphs of the erroneous tuples using the knowledge base and the valid holistic pattern; and
determining a repair cost of the erroneous tuple based on each instance graph and valid holistic pattern combination.
25. The system of claim 24 wherein the system comprises ranking candidate repair tuples in an inverted repair list based on repair cost.
PCT/GB2014/051670 2014-05-30 2014-05-30 Data cleaning methods and systems WO2015181511A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/GB2014/051670 WO2015181511A1 (en) 2014-05-30 2014-05-30 Data cleaning methods and systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/GB2014/051670 WO2015181511A1 (en) 2014-05-30 2014-05-30 Data cleaning methods and systems

Publications (1)

Publication Number Publication Date
WO2015181511A1 true WO2015181511A1 (en) 2015-12-03

Family

ID=50942701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2014/051670 WO2015181511A1 (en) 2014-05-30 2014-05-30 Data cleaning methods and systems

Country Status (1)

Country Link
WO (1) WO2015181511A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203593A (en) * 2017-04-26 2017-09-26 深圳市华傲数据技术有限公司 A kind of data cleaning method and system
CN110543634A (en) * 2019-09-02 2019-12-06 北京邮电大学 corpus data set processing method and device, electronic equipment and storage medium
US10585933B2 (en) 2017-08-16 2020-03-10 International Business Machines Corporation System and method for classification of low relevance records in a database using instance-based classifiers and machine learning
US11200452B2 (en) 2018-01-30 2021-12-14 International Business Machines Corporation Automatically curating ground truth data while avoiding duplication and contradiction

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323866A1 (en) * 2011-02-28 2012-12-20 International Machines Corporation Efficient development of a rule-based system using crowd-sourcing
GB2502768A (en) * 2012-04-12 2013-12-11 Qatar Foundation Correcting database errors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323866A1 (en) * 2011-02-28 2012-12-20 International Machines Corporation Efficient development of a rule-based system using crowd-sourcing
GB2502768A (en) * 2012-04-12 2013-12-11 Qatar Foundation Correcting database errors

Non-Patent Citations (25)

* Cited by examiner, † Cited by third party
Title
A. CAL; G. GOTTLOB; A. PIERIS: "Advanced processing for ontological queries", PVLDB, 2010
A. POGGI; D. LEMBO; D. CALVANESE; G. D. GIACOMO; M. LENZERINI; R. ROSATI.: "Linking data to ontologies", J. DATA SEMANTICS, 2008, pages 10
B. TRUSHKOWSKY; T. KRASKA; M. J. FRANKLIN; P. SARKAR.: "Crowdsourced enumeration queries", ICDE, 2013
C. J. ZHANG; L. CHENX; H. V. JAGADISH; C. C. CAOX: "Reducing uncertainty of schema matching via crowdsourcing", PVLDB, 2013
D. DENG; Y. JIANG; G. LI; J. LI; C. YU: "Scalable column concept determination for web tables using large knowledge bases", PVLDB, 2013
F. CHIANG; R. J. MILLER: "Discovering data quality rules", PVLDB, 2008
G. CONG; W. FAN; F. GEERTS; X. JIA; S. MA: "Improving data quality: Consistency and accuracy", VLDB, 2007
G. LIMAYE; S. SARAWAGI; S. CHAKRABARTI.: "Annotating and searching web tables using entities, types and relationships", PVLDB, 2010
I. F. ILYAS; V. MARKL; P. J. HAAS; P. BROWN; A. ABOULNAGA: "Cords: Automatic discovery of correlations and soft functional dependencies", SIGMOD, 2004
I. F. ILYAS; W. G. AREF; A. K. ELMAGARMID.: "Supporting top-k join queries in relational databases", VLDB J., vol. 13, no. 3, 2004
K. KERR; T. NORRIS; R. STOCKDALE: "Data quality information and decision making: a healthcare case study", ACIS, 2007
M. WHITE: "Enterprise information portals", ELECTRONIC LIBRARY, vol. 18, no. 5, 2000
M. YAKOUT; A. K. ELMAGARMID; J. NEVILLE; M. OUZZANI; I. F. ILYAS: "Guided data repair", PVLDB, 2011
M. YAKOUT; L. BERTI-EQUILLE; A. K. ELMAGARMID: "Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes", SIGMOD, 2013
O. DESHPANDE; D. S. LAMBA; M. TOURN; S. DAS; S. SUBRAMANIAM; A. RAJARAMAN; V. HARINARAYAN; A. DOAN: "Building, maintaining, and using knowledge bases: a report from the trenches", SIGMOD CONFERENCE, 2013
P. BOHANNON; W. FAN; M. FLASTER; R. RASTOGI.: "A cost-based model and effective heuristic for repairing constraints by value modification", SIGMOD, 2005
P. VENETIS; A. Y. HALEVY; J. MADHAVAN; M. PASCA; W. SHEN; F. WU; G. MIAO; C. WU: "Recovering semantics of tables on the web", PVLDB, 2011
S. ABITEBOUL; R. HULL; V. VIANU: "Foundations of Databases", 1995, ADDISON-WESLEY
V. RAMAN; J. M. HELLERSTEIN: "Potter's Wheel: An interactive data cleaning system", VLDB, 2001
W. FAN: "Dependencies revisited for improving data quality", PODS, 2008
W. FAN; J. LI; S. MA; N. TANG; W. YU: "Interaction between record matching and data repairing", SIGMOD, 2011
W. FAN; J. LI; S. MA; N. TANG; W. YU: "Towards certain fixes with editing rules and master data", VLDB J., vol. 21, no. 2, 2012
X. CHU; I. F. ILYAS; P. PAPOTTI: "Discovering denial constraints", PVLDB, 2013
X. CHU; I. F. ILYAS; P. PAPOTTI: "Holistic data cleaning: Putting violations into context", ICDE, 2013
XU CHU ET AL: "Holistic data cleaning: Putting violations into context", DATA ENGINEERING (ICDE), 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON, IEEE, 8 April 2013 (2013-04-08), pages 458 - 469, XP032430887, ISBN: 978-1-4673-4909-3, DOI: 10.1109/ICDE.2013.6544847 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203593A (en) * 2017-04-26 2017-09-26 深圳市华傲数据技术有限公司 A kind of data cleaning method and system
US10585933B2 (en) 2017-08-16 2020-03-10 International Business Machines Corporation System and method for classification of low relevance records in a database using instance-based classifiers and machine learning
US11227003B2 (en) 2017-08-16 2022-01-18 International Business Machines Corporation System and method for classification of low relevance records in a database using instance-based classifiers and machine learning
US11200452B2 (en) 2018-01-30 2021-12-14 International Business Machines Corporation Automatically curating ground truth data while avoiding duplication and contradiction
CN110543634A (en) * 2019-09-02 2019-12-06 北京邮电大学 corpus data set processing method and device, electronic equipment and storage medium
CN110543634B (en) * 2019-09-02 2021-03-02 北京邮电大学 Corpus data set processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Chu et al. Katara: A data cleaning system powered by knowledge bases and crowdsourcing
Qian et al. Sample-driven schema mapping
Tang et al. Dynamic query forms for database queries
US20090019000A1 (en) Query based rule sets
Chu et al. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing
Wang et al. Synthesizing mapping relationships using table corpus
Ortona et al. Wadar: Joint wrapper and data repair
Bellare et al. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis
Anam et al. Adapting a knowledge-based schema matching system for ontology mapping
Lambrix et al. A session-based ontology alignment approach enabling user involvement 1
WO2015181511A1 (en) Data cleaning methods and systems
Zhang et al. OIM-SM: A method for ontology integration based on semantic mapping
Wang et al. Dependable data repairing with fixing rules
Magnani et al. Uncertainty in data integration: current approaches and open problems.
Song et al. Auto-validate: Unsupervised data validation using data-domain patterns inferred from data lakes
Fan et al. Linking entities across relations and graphs
Ortona et al. Joint repairs for web wrappers
Mountantonakis Services for Connecting and Integrating Big Numbers of Linked Datasets
Xue et al. Schema matching for context-aware computing
Aggoune Intelligent data integration from heterogeneous relational databases containing incomplete and uncertain information
US11113300B2 (en) System and method for enabling interoperability between a first knowledge base and a second knowledge base
Nguyen et al. Mtab4d: Semantic annotation of tabular data with dbpedia
CN113254725A (en) Data management and retrieval enhancement method for graph database
Shaukat et al. Comment extraction using declarative crowdsourcing (CoEx Deco)
Trojahn et al. An extended value-based argumentation framework for ontology mapping with confidence degrees

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14730180

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14730180

Country of ref document: EP

Kind code of ref document: A1