WO2020008180A1 - Method and system for integrating data sets - Google Patents

Method and system for integrating data sets Download PDF

Info

Publication number
WO2020008180A1
WO2020008180A1 PCT/GB2019/051866 GB2019051866W WO2020008180A1 WO 2020008180 A1 WO2020008180 A1 WO 2020008180A1 GB 2019051866 W GB2019051866 W GB 2019051866W WO 2020008180 A1 WO2020008180 A1 WO 2020008180A1
Authority
WO
WIPO (PCT)
Prior art keywords
mappings
attributes
mapping
target
source
Prior art date
Application number
PCT/GB2019/051866
Other languages
French (fr)
Inventor
Martin Koehler
Norman PATON
Lacramioara MAZILU
Alvaro FERNANDES
Original Assignee
Data Value Factory Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Value Factory Limited filed Critical Data Value Factory Limited
Publication of WO2020008180A1 publication Critical patent/WO2020008180A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management

Definitions

  • the invention relates to a method and system for integrating data sets, for example by populating a target database with information from a plurality of sources.
  • mappings for populating a target database schema from source schemas. Such algorithms may be informed by different information, such as schema matches, integrity constraints or instance data.
  • Clio project as described in US7010539 and as reviewed for example in“Clio: Schema mapping creation and data exchange", by Fagin et al published in Conceptual Modeling: Foundations and Applications, volume 5600 of LNCS, pages 198-236. This project has given rise to comprehensive results on subjects such as data exchange, merging of mappings, debugging of mappings and mapping verification.
  • mapping generation can benefit from precise and exhaustive descriptions of relationships within the source schemas, as well as human-curated matches between the source and the target.
  • the goal is to provide tool support that helps experts curate matches and develop mappings between (potentially complex) source and target schemas for example as described in “Information integration in the enterprise” by Bernstein et al published in CACM, 51 (9): 72-79, 2008.
  • mappings Such diverse relationships give rise to a large space of candidate mappings, which are likely of variable utility. Where there can be arbitrary numbers of source schemas, there tends to be less reliable or comprehensive information with which to inform mapping generation, and the focus has tended to shift toward managing the resulting uncertainty that arises from many alternative mappings.
  • mappings are producing appropriate values. For example, given an initial set of mappings between individual sources and a target that could have been produced by Clio,“Incrementally improving dataspaces based on user feedback" by Belhajjame et al published in Inf. Syst., 38(5):656-687, 2013 explores how these mappings can be combined, for example by union, intersection or difference, to produce new mappings that may have higher precision or recall than their predecessors with respect to user feedback on the correctness of result tuples.
  • the meta-data about the target schema may include tuples annotated to the effect that they are true positives, false positives or false negatives, and a search seeks to identify sets of mappings that together maximise precision subject to some recall threshold (or vice versa).
  • the user provides additional information about the meta-data for the target schema in the form of example values, and subsequently feedback on results of generated mappings. As feedback is obtained on the correctness or otherwise of result tuples, this is used to refine the scores of candidate mappings.
  • mappings are not assumed to be correct, but rather are scored by users based on the suitability of their results.
  • mappings for a probabilistic mediated schema.
  • there is no pre-existing target schema and the technical focus is on aligning source attributes with counterparts in a mediated schema, and not so much on the generation of mappings that may combine data from different sources.
  • mappings tend not to be expressive.
  • mapping the approach described in that work focuses the integration effort at the instance-level, through entity resolution and fusion, with ongoing human input, for example in the form of training data.
  • a data integration method for populating a target database with data from a plurality of source databases with each source relation in a source database having a plurality of source attributes, the method comprising: inputting a target schema for the target database, wherein the target schema comprises a plurality of target attributes; generating a set of mappings which populates the target database with data from at least a subset of the plurality of source relations by defining a plurality of sub-solutions wherein for each sub-solution a different number of the source relations are merged and each sub-solution comprises a set of intermediate mappings to map data from at least a subset of the source relations to the target scheme using source attributes of each source relation which match target attributes in the target schema; iteratively solving each of the plurality of sub-solutions to generate, at each iteration, a set of intermediate mappings which maps more source relations to the target schema than a set of intermediate mappings generated in the previous iteration by merging pairs of mappings from each set of intermediate mappings generated in the
  • mapping selection the preferred mappings are selected based on various features such as size of output, fitness of result, etc. Examples of such techniques can be found in the work of Edward Abel et al. in“User driven multi-criteria source selection” published in Information Sciences, Volumes 430-431 , 2018, Pages 179-199, and Khalid Belhajjame et al, Feedback-based annotation, selection and refinement of schema mappings for dataspaces”, 13th International Conference on Extending Database Technology, ACM, 2010: 573-584.
  • the next step may comprise using the base sub-solution to build a first set of intermediate mappings, e.g. by selectively merging pairs of mappings from the base set of intermediate mappings.
  • the next step may then comprise using both the base set of intermediate mappings and the first set of intermediate mappings to build a second set of intermediate mappings, for example by selectively merging pairs of mappings from the base set with the first set.
  • the final set of mappings which is generated by the iterative process may thus comprise a selected set of intermediate mappings which were generated in previous steps and may comprise each of the mappings generated in each iteration.
  • the final set of mappings may thus be a set of best-effort mappings which might include an overall solution which merges all input source relations but more typically comprises a subset of the source relations in light of overlap between the source relations.
  • the method may comprise identifying a plurality of candidate keys within the plurality of sources, wherein each candidate key is a single or a plurality of source attributes.
  • the method may further comprise calculating inclusion dependencies for the sources, wherein an inclusion dependency is a measure of the overlap between the values of two sets of sources attributes.
  • the candidate keys and/or the inclusion dependencies may be inputs to the method.
  • the candidate keys and the inclusion dependencies may be considered to be profiling data and/or metadata for the sources.
  • Each candidate key may be defined as a set of source attributes which has unique values within that source relation, i.e. a candidate key for a relation in the form of a table with rows and columns may be a column (or a combination thereof) that has unique values in the relation.
  • An inclusion dependency may be partial, i.e., there is some overlap between the values of a set of source attributes and the values of another set of attributes, or total, i.e., the values of one set are fully contained by the values of another set of attributes.
  • Selectively merging pairs of mappings may comprise determining whether the pair of mappings are to be merged using at least one of the source attributes, the candidate keys and the inclusion dependencies. For example, if there is no possibility of inferring a relationship between the source attributes within the pair of mappings, no pairwise merging is possible. In other words, not all intermediate mappings within a set may be paired with intermediate mappings in another set. However, when it is determined that the pair of mappings are to be merged, pairwise merging may comprise selecting a merging which is one of a union, a join or a full outer join of the sources within the pair of mappings. Union, join, and full outer join are standard relational operators that can be applied between two relations to merge them, and they are defined as in“Database Systems: The Complete Book, 2nd Edition" by H.Garcia-Molina et al, Pearson, 2014.
  • Selectively merging pairs of mappings may comprise determining whether a first mapping within the pair of mappings is subsumed by a second mapping in the pair of mappings, i.e. by applying a subsumption method.
  • subsumed it is meant that the values of one mapping are completely contained within the values of the other mapping.
  • the method may further comprise discarding the first mapping to prevent merging when it is determined to be subsumed by the second mapping. Again, where appropriate, this reduces the number of mappings which are to be merged.
  • the subsumption method may comprise identifying a first set of attributes which are the attributes in the first mapping which match the target attributes and identifying a second set of attributes which are the attributes in the second mapping which match the same target attributes.
  • the method may further comprise selecting the inclusion dependency between the first and second set of attributes having the highest value; and comparing the selected inclusion dependency to a threshold value, e.g., one. This comparison can be used to determine whether or not a first mapping is subsumed with respect to the target by a second mapping or vice versa. When the inclusion dependency equals the threshold value, this means that one mapping is subsumed by the other mapping.
  • the selected inclusion dependency may be between candidate keys of the first and second mappings.
  • the method may further comprise determining whether the first set of attributes match the same target attributes as the second set of attributes, and when there is a match, selectively merging using a union. Alternatively, when they do not match exactly the same target attributes, selectively merging using a full outer join when the selected inclusion dependency between the candidate keys of the two mappings is less than the threshold value, and selectively merging using a join when the selected inclusion dependency equals the threshold value, i.e. when a foreign key relationship can be inferred between the two mappings. In other words, the nature of the selective merging may be dependent on whether or not the first and second sets of attributes match the same or different target attributes.
  • the method may comprise identifying a first set of candidate keys which are the candidate keys in the first mapping which match a subset of target attributes, identifying a second set of candidate keys which are the candidate keys in the second mapping which match a subset of target attributes, determining whether the first set of candidate keys match the same target attributes as the second set of candidate keys, and if there is a match, selectively merging using a full outer join. If there is no match, it may be determined that there is no possible or useful pairwise merging in such a case.
  • Generating each set of intermediate mappings may further comprise calculating a fitness value for each intermediate mapping.
  • the fitness value may be calculated using the equation: where m is the intermediate mapping, t is the target relation, arity(t) is the total number of attributes in target t, matches(m) is the number of matched attributes that mapping m has w.r.t. target t, nulls(m) is the estimated number of nulls that mapping m will generate on the values of the matched attributes and T(m) is the estimated size of mapping m, i.e., the estimated number of tuples that will result after running mapping m.
  • the fitness value may be calculated using an estimate of the number of nulls that the intermediate mapping will generate for the target attributes in the target schema which are matched to source attributes in the sources within the intermediate mapping.
  • the method may further comprise comparing calculated fitness values for all intermediate mappings based on the same sources and retaining only the intermediate mapping having the highest fitness value. This reduces the number of intermediate mappings which need to be stored at each iteration.
  • Figure 1 a is a flowchart showing the overall steps of the method
  • Figure 1 b is an example showing a target schema and some input schemas
  • Figure 2 is a flowchart showing the detail of the step of generating mappings in the method of Figure 1 a;
  • Figures 3a and 3b are flowcharts showing the detail of the step of merging mappings in the method of Figure 2;
  • Figure 4 is a flowchart showing the detail of the step of choosing an operator in the method of Figure 3a;
  • Figure 5a is a flowchart showing the detail of a steps in the method of Figure 4.
  • Figures 5b and 5c are flowcharts showing the detail of another step in the method of Figure 4.
  • Figure 6 is a flowchart showing the detail of a step which is common to the method of Figures 5a and 5b; and [0032]
  • Figure 7 is a flowchart showing the detail of the step of determining whether a mapping is fittest in the method of Figure 3b.
  • Figure 8 is a schematic block diagram of a system for implementing the described method.
  • Figure 1 a illustrates the overall method for integrating data sets, for example by populating a target database with information from a plurality of sources.
  • the target schema t is read and, in step S102, the input schemas with their relations are read.
  • Each input relation has one or more attributes. It will be appreciated that although these steps are shown sequentially, some of the steps could be carried out in different orders.
  • the target schema 20 consists of a single relation labelled“MA_CA_statistics” (which is illustrated as a table in this example) and contains a plurality of attributes: postcode, price, street, income rank and crime rank which are then to be populated with data.
  • a relation may be considered to be a set of tuples (di ,d2,...,d n ), where each element dj is a member of Dj, a data domain.
  • the input schema includes: Manchester Agency (MA) 30 which has the following attributes: city, street, postcode and price; Cambridge Agency (CA) 32 which has the following attributes: postcode, city, street, price and county; UK Deprivation (UKD) 34 which has the following attributes: crime rank, crimedecile, county and postcode; and UK quality of life statistics (UKQ) 36 which has the following attributes: county, income rank, economic activity rank and health rank.
  • MA Manchester Agency
  • CA Cambridge Agency
  • UKQ UK quality of life statistics
  • the next step S104 is to create an empty set of sub-solutions for mapping the data in the source schemas into the target schema.
  • Mapping the data is a complex or compound problem because it involves finding mappings involving multiple input relations with attributes that match the same target relation.
  • Each member of the set of sub-solutions represents a method for dividing the complex problem of mapping the data into a collection of simpler subproblems for which it is easier to find a solution.
  • the sub-problems involve pairs of subsets of the initial input relations. The results from each pair of sub-problems are then merged to provide a solution to the complex problem.
  • Steps S106 to S1 10 show some pre-processing steps which generate profiling data for each source before mapping generation begins.
  • step S106 the matches between the attributes of the source and target schemas are identified and read. These matches can be thought of as metadata.
  • the candidate keys are identified and read.
  • a candidate key for a relation is a column (or a combination thereof) that has unique values in the relation. For example, in Figure 1 b, the candidate keys are identified using the [ck] attribute annotation.
  • the inclusion dependencies are identified and read. The candidate keys and inclusion dependencies can be thought of as metadata on the source.
  • V(R) is the number of distinct values on R.
  • the partial inclusion dependencies: postcode in UKD and county in UKQ are shown.
  • the overlap is the fraction of the values in the dependent attribute that are found in the referenced attribute: the direction of the arrow is from the dependent attribute to the referenced attribute.
  • the overlap from UKD to the source MA with the same attribute is 0.5 (i.e. , half of the distinct values within the column overlap: M1 5BD and M1 5EB).
  • the overlap is the same in the opposite direction because each column has four different values.
  • the overlap between the postcode columns in UKD and CA differs because CA has two repeated entries.
  • the profile data may also include statistics, such as the cardinality of each relation, the number of distinct values for each column, and the number of nulls for each column, as required for a fitness function that chooses between candidate mappings.
  • Statistics such as the cardinality of each relation, the number of distinct values for each column, and the number of nulls for each column, as required for a fitness function that chooses between candidate mappings.
  • Access to the profiling data over the input sources which could be produced using a tool such as Metanome (“Data profiling with metanome” by Papenbrock et al published in Proc. VLDB Endow., 8(12):1860-1863, August 2015) or SINDY (“banksystem fur Business, Technologie und Web (BTW), 16.
  • the problem can be defined as given a collection of source schemas S, each source schema with many relations (i.e. , tables 30 to 36 in Figure 1 b), a target relation T (i.e. table 20 in Figure 1 b), metadata on the sources obtained through data profiling, including candidate keys and (partial) inclusion dependencies, metadata in the form of matches between the source relations and the target, generate a set of candidate mappings to populate the target relation with instance data from the source schemas.
  • step S112 a set of initial mappings is created where for each source relation a 1 -to-1 mapping is created, i.e., each 1 -to-1 mapping maps only one source relation to the target relation, without merging it with other source relations.
  • This step is considered to be the creation of the base sub-solutions for the mapping problem, i.e., if the input relations cannot be merged, then the algorithm will be able to output the base solution.
  • Each element in initMaps is a sub-solution for each of the source relations that have at least one match with the target relation.
  • Each sub-solution is a solution to a sub-problem which is trying to find a mapping for fewer relations than in the initial input. Each sub-problem thus represents an iteration in the mapping generation process.
  • the method determines the number of the last iteration, i.e. finaliterationidx, which is equal to the number of input relations.
  • the determined finaliterationidx is an input to the recursive algorithm which is used to generate the mappings - genMaps - step S118.
  • the output of this algorithm consists of all the subsolutions which are generated and retained in all iterations of the algorithm.
  • these sub-solution mappings can be outputted (step S120) and the method ends.
  • a selection of these intermediate mappings may also be output.
  • the final set of mappings comprises at least some of the mappings from the sets of intermediate mappings generated at each iteration.
  • the table below shows how the inputs in Figure 1 b could be combined to create two complete target rows. It will be appreciated that this is not a complete output of a mapping and is merely provided for illustration.
  • the first three data entries may be from the MA table and the other two data entries from the UKD and UKQ tables respectively.
  • the first three data entries may be from the CA table and the other two data entries from the UKD and UKQ tables respectively.
  • the candidate key of the postcode may be used to merge the MA/CA and UKD tables and the candidate key of country to merge the UKD and UKQ tables as explained in more detail below.
  • Figure 2 shows the detail of the recursive algorithm named“genMaps” which is the top-level algorithm that uses dynamic programming to divide the complex problem into a series of subproblems.
  • Dynamic programming is an approach to solving a complex problem by diving it into a series of subproblems.
  • the algorithm will try to break down the complex problem, i.e., finding a mapping for all initial sources (finaliterationidx), into smaller sub-problems, i.e., finding mappings for fewer initial relations by recursively calling genMaps with a parameter idx ⁇ finaliterationidx.
  • genMaps(1) When genMaps for iteration 1 (genMaps(1)) is called, it means that it reached the base solution, and the initial mappings that were previously set as the subsolution for iteration 1 will be returned. By trying to build the mapping for the final iteration, intermediate subsolution mappings need to be created and used in sequent iterations.
  • the algorithm Given N initial source relations, in each iteration idx, 1 ⁇ idx ⁇ N, the algorithm searches for the best way to merge any idx source relations.
  • the mappings that result from combining each subset of source relations characterize new relations that are referred to as intermediate mappings, the collection of which comprises the solution at iteration idx.
  • the flowchart can also be expressed as an algorithm as follows:
  • step S200 the input to the algorithm is idx (the iteration index) and the output collection of mappings is termed“resultMaps”.
  • the set of mappings is empty (step S202).
  • mappings which merge all or subsets of the initial source relations can be obtained from the collection of sub-solutions.
  • the schema of the output mappings is the same as the schema of the target relation, though some attributes may contain only null values.
  • ni2_2 ⁇ — merge (UKQ, UKD)
  • ni2_3 ⁇ — merge CA, UKD
  • the algorithm will try to merge (rri2_i, rri2_2), (rri2_2, m ⁇ _3) and (rri2_i, rri2_3). It can be observed that by merging rri2_i, with rri2_2, a mapping that incorporates all initial sources will be obtained.
  • Figures 3a and 3b shows the detail of the algorithm named“mergeMaps” which implements the steps that take place within an iteration of the main loop of genMaps shown in Figure 2.
  • the detailed algorithm can also be expressed as follows:
  • the inputs to the mergeMaps algorithm are batchl and batch2 and the output is iMaps (step S300).
  • the initial step is that iMaps is empty (step S302).
  • / is set to 0 to perform the iterative algorithm as explained below.
  • iMaps is output (step S309). If this is true, j is set to 0 (step S308) and a determination as to whether j is smaller than the size of batch2 is made (step S310).
  • step S31 1 This represents the creation of the pairs of mappings from the two batches by pairing each / mappings from batchl with all j mappings from batch2.
  • MERGEMAPPINGS uses another algorithm chooseOperator (line 5 and step 312) to choose an operator with which to combine map / and map j , from batchl and batch2, respectively, if they can usefully be merged (in Figure 3a, these maps are labelled batchl [i] and batch2[j]).
  • map / and map j represent pairs of mappings from batches of mappings from two previous iterations.
  • the chooseOperator algorithm is described in more detail below. If the mappings cannot be usefully combined, i.e. , if at step S314 the output is null, j is incremented by 1 and steps S310, S312 and S314 are repeated.
  • mappings can be usefully combined, the new mapping is generated (line 7 and step S318) and the metadata for the resulting intermediate mapping is computed (line 8 and step S320 beginning on Figure 3b).
  • the intermediate mapping is retained if it has the highest fitness of any mapping involving the same source relations (lines 9-10 and steps S322 to S326).
  • the isFittest function selects the mapping with the highest fitness and is explained in more detail below. It can be defined in different ways using table statistics and profiling data, for example by preferring candidate mappings with fewer nulls.
  • step S324 if the new mapping satisfies the isFittest function, the old mappings that stem from the same initial relations as iMap are removed (step S326) and the new iMap is added to the set of retained iMaps (step S328). However, if the new mapping does not satisfy the conditions in the isFittest function, the mapping is not retained, j is incremented by 1 , and steps S310, S312 and S314 are repeated. Both loops terminate when all pairs of mappings are created and merged.
  • FIG 4 shows the detail of the algorithm named “chooseOperator” which is used in the mergeMaps algorithm shown in Figure 3a to combine two intermediate mappings taking into account how these relate to the given target table.
  • the detailed algorithm can be expressed as follows:
  • map1 function chooseOperator (map1 , map2)
  • the inputs to the algorithm are iMapl and iMap2 (i.e., 2 mappings) and the output is the op (operator) as set out in step S400.
  • the operator is set to NULL value.
  • the sets of matched target attributes are retrieved for each mapping w.r.t. the target (steps S404 and S406, lines 3 to 4 above).
  • the input in step S404 is thus the matches of the mapping iMapl with the attributes of the target and similarly for step S406 is the matches of the mapping iMap2 with the attributes of the target.
  • the outputs are labelled tm1 (map1_ma) and tm2 (map2_ma) respectively, representing the two sets of matched target attributes.
  • the two outputs are then compared (step S408) and if they do not match, i.e., if the two mappings match different attributes within the target schema, they are considered candidates for joining by the algorithm GenOperatorDiff (Step S410 and line 10). Alternatively, if the outputs match and thus the two mappings match the same attributes within the target schema, they are considered candidates for unioning by the algorithm GenOperatorSame (Step S412 and line 7).
  • the matches for the MA relations are thus postcode, price and street, and CA also has the same matches so these are candidates for GenOperatorSame.
  • UKD has different matches, namely postcode and crime_rank to those of MA, and thus MA and UKD are candidates for joining by the algorithm GenOperatorDiff.
  • chooseOperator will either be an operator that represents the way to merge two input mappings or NULL if no useful way could be found to merge them with respect to the target relation. If the algorithm merges the two mappings, then the resulting mapping that is generated will be the result of union, join or full outer join. Another possible action is to discard subsumed mappings; a relation A is subsumed by a relation B if A £ B, so the subsumed relation need not be considered in subsequent merges.
  • Figure 5a shows the detail of the algorithm named“genOperatorSame” which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between two source relations when they match identical target attributes.
  • the detailed algorithm can also be expressed as follows:
  • the inputs to the algorithm are iMapl and iMap2 (i.e. 2 mappings) and the output is the op (operator) as set out in step S500.
  • the operator is set to NULL value.
  • An algorithm entitled“isSubsumed” (line 2, step S504, the algorithm is described in more detail in Figure 6) tests whether the values for the matched attributes from the first relation are fully contained in the attributes in the second relation, i.e., are subsumed.
  • the inputs are also iMapl and iMap2 and the output is the subsumed mapping out of the two given as input (subsumedMap).
  • the output subsumedMap will not be null and no merge will take place because one relation is subsumed by the other relation and thus the subsumed relation can be safely discarded (lines 4,5 and step 510).
  • the chosen operator is the union of the two projected mappings on the attributes that are needed in the target and that both mappings match. In other words, the union of iMapl (also expressed as map1) and iMap2 (also expressed as map2) is returned.
  • Figures 5b and 5c show the detail of the algorithm named“genOperatorDiff which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between mappings that do not match the same target attributes, i.e., the sets of matched attributes can either be disjoint or have common attributes.
  • the detailed algorithm can also be expressed as follows:
  • mapljnd ⁇ getMaxlnd (map1 keys, map2. attributes)
  • map2_ind ⁇ getMaxlnd (map2 keys, map1 .attributes)
  • the inputs and output of the genOperatorDiff algorithm are the same as those to the genOperatorSame algorithm of Figure 5a.
  • steps 552, 554, 556 and 560 correspond to steps 502, 504, 506 and 510 of Figure 5a.
  • the subsumed relation is discarded and the method returns null as no operator was found that yields useful information. If the subsumption conditions are not met, the algorithm tries to merge the input relations.
  • the candidate keys i.e. iMapl .ckeys and iMap2.ckeys - also expressed as map1_keys and map2_keys
  • the algorithm chooses the IND with the highest overlap (step S562 and line 10), i.e., the maximum is chosen (getMaxIND).
  • the algorithm checks whether an inclusion dependency has been found (line 1 1 and step S564).
  • the level of overlap is checked, i.e., compared to 1 (step S566, line 12). If the overlap is 1 , the chosen operator is a join because a foreign key relationship has been inferred between the two input mappings on their candidate key attributes (step S568 and line 13). Alternatively, if the overlap is between 0 and 1 (but not equal to 0 or 1), the chosen operator is a full outer join because a foreign key relationship could not be inferred, but some tuples can be correlated due to the overlap between the candidate keys.
  • the algorithm merges the two mappings by applying a full outer join, thus joining the tuples that can be joined while keeping all the data (step S570 and line 15).
  • the join condition will compare the candidate key attributes that are involved in the chosen inclusion dependency, IND.
  • step S572 the inclusion dependencies (IND) for all candidate keys in the first source and the attributes in the second source are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind1).
  • step S574 the inclusion dependencies (IND) for all candidate keys in the second source and the attributes in the first source are identified and the IND with the highest overlap is chosen, i.e. the maximum is chosen (getMaxIND with an output ind2).
  • the outputs from steps S572 and S574 are compared to choose the overall maximum at step S576 (line 19).
  • step S578 There may be no inclusion dependencies between key and non-key attributes and thus the step of checking whether an inclusion dependency has been found is repeated as step S578 (line 20). If there is an inclusion dependency, steps S566 to S570 of Figure 5c are repeated to check the overlap. If the overlap is 1 , a join is used as before (step S568 and line 22). Alternatively, as illustrated in the example, the value of the partial inclusion dependency is 0.5 and thus the result of the merge will be a mapping: OuterJoin(MA, UKD, ⁇ MA. postcode, UKD. postcod e>) (step S570 and line 24).
  • the algorithm retrieves the candidate keys from both relations that match the target attributes.
  • the candidate keys from the first source are retrieved (iMapl .ckeys or map1_keys) as the input and the ones which match the target attributes are output (tckeysl or map1_mk) as set out in step S580 and line 26.
  • step S582 the candidate keys from the second source are retrieved (iMap2.ckeys or map2_keys) as the input and the ones which match the target attributes are output (tckeys2 or rel2_mk) as set out in step S582 and line 27.
  • These output keys are compared in step S584 (line 29) and if they match, the two relations are merged by applying the full outer join operation as shown in step S586 where the join is on the attributes that meet the above mentioned condition (line 30). Otherwise, no operator which can join the sources is output and the operator is output as null.
  • FIG. 6 shows the detail of the algorithm“isSubsumed” which is used as described above.
  • the inputs are iMapl and iMap2 and the output is the subsumed mapping of the two given as input, i.e., subsumedMap will be equal to iMapl or iMap2.
  • subsumedMap is set to NULL (step S602).
  • steps S604 and S606 the attributes from each of the first and second mappings which match the target schema are identified as sal and sa2 respectively. It will be appreciated that although shown as two separate steps, these steps could be carried out simultaneously or in a different order.
  • the inclusion dependencies (IND) from the matched attributes for the first mapping to the matched attributes for the second mapping are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind).
  • IND inclusion dependencies
  • step S610 The level of overlap is next checked, i.e., compared to 1 (step S610). If the overlap is 1 , the subsumedMap is set as the first intermediate mapping (step S612). Alternatively, if the overlap is not 1 , the inclusion dependencies (IND) from the matched attributes for the second mapping to the matched attributes for the first mapping are identified and the IND with the highest overlap is chosen (step S614). The level of overlap is checked again (step S616) and if the overlap is 1 , the subsumedMap is set as the second intermediate mapping (step S618). Alternatively, if the overlap is not 1 the subsumedMap remains NULL and it is returned because neither mapping is subsumed by the other.
  • IND inclusion dependencies
  • FIG. 7 shows the detail of the algorithm“IsFittest” which is used as described above.
  • Each intermediate mapping will contain values that are of different quality with respect to the chosen target relation, so, in order to prune the search space, the“IsFittest” algorithm eliminates the intermediate mappings that have poor quality in comparison with other intermediate mappings which stem from the same initial source relations.
  • the algorithm will always keep the best way of merging a subset of source relations by keeping only the intermediate mapping with the highest fitness result.
  • the fitness result will be computed after the new intermediate mapping is created.
  • the inputs are iMap, oldMaps and iterationidx and the output is a Boolean output as to whether or not the mapping“isFittest”. Initially, the output is set to FALSE (step S702).
  • step S704 there is a function to getMapsWithlnitRels, in other words all the intermediate mappings which stem from the same initial source relations are retrieved.
  • the inputs are thus iMap and subsolutions[iterationldx] and the output is memoizedMaps.
  • this output is added to the set of existing intermediate maps, i.e., oldMaps (step S706).
  • a fitness function for each generated mapping is then calculated for example using the function below:
  • mapping m is the intermediate mapping
  • t is the target relation
  • arity(t) is the total number of attributes in target t
  • matches(m) is the number of matched attributes that mapping m has w.r.t. target t
  • nulls(m) is the estimated number of nulls that mapping m will generate on the values of the matched attributes
  • T(m) is the estimated size of mapping m, i.e., the estimated number of tuples that will result after running mapping m.
  • the fitness function has as input a generated mapping and the output will represent the ratio of non-NULL cells, i.e., meaningful values, that will be considered to populate the target table t. It is noted that the calculated term (arity(t)-matches(m))*T(m) will add the necessary NULLs to populate the target t if mapping m does not match all target attributes.
  • mapping m is expected to produce an estimate of 62.5% non-NULL values for target relation t. This low value is mostly due to missing one match to the target which means that the result of mapping m needs a padding of 10 NULLs for the unpopulated target column.
  • step S708 Once the fitness results have been calculated, they are compared at step S708. If the iMap fitness is greater than the oldMaps fitness, the output is TRUE (step S710) and the new map is retained. Alternatively, if the iMap fitness is less than the fitness of previously retained mappings, the output is FALSE (step S712) and the new intermediate mapping is not retained.
  • FIG. 8 is a schematic illustration of such a hardware system which can be used to implement the data integration method described above.
  • the system may comprise a computing device 10 which may be any suitable device, e.g. a user device or a server is connected to a plurality of databases 20a, 20b, to 20n each which contain at least one data source 22a, 22b to 22m.
  • the number of databases and sources shown in the Figure is merely for illustration.
  • the plurality of data sources may be all stored on a single database, in one or more schemas.
  • some or all of the databases are integrated could be stored on the computing device itself.
  • the computing device 10 comprises the standard components such as a processor 12, user interface 16 and memory 14.
  • a mapping module 40 is also illustrated which stores and/or implements the algorithms for the data integration method outlined above.
  • the memory 14 may be used to store the intermediate data generated by implementing the method, e.g. any intermediate mappings, fitness scores, inclusion dependencies, metadata and so on.
  • Terms such as‘component’,‘module’ or‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors.
  • These functional elements may in some embodiments include, by way of example, components, such as software components, object- oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • components such as software components, object- oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a system for integrating data sets are disclosed. The present techniques relate to populating a target database with information from a plurality of sources. The problem of mapping the data has been divided into a collection of simpler sub-problems which involve merging pairs of subsets of the input sources. Iteratively solving the problem comprises solving a base sub-solution in which a minimum number of source relations are used, e.g. one, to create a base set of intermediate mappings. The base sub-solution may be used to build a first set of intermediate mappings. The base set of intermediate mappings and the first set of intermediate mappings may be used to build a second set of intermediate mappings. The final set of mappings which is generated by the iterative process may thus comprise a selected set of intermediate mappings which were generated in previous steps and may comprise each of the mappings generated in each iteration.

Description

Method and System for Integrating Data Sets
Technical field
[0001] The invention relates to a method and system for integrating data sets, for example by populating a target database with information from a plurality of sources.
Background
[0002] There exist known algorithms for generating mappings for populating a target database schema from source schemas. Such algorithms may be informed by different information, such as schema matches, integrity constraints or instance data. One such example is the Clio project as described in US7010539 and as reviewed for example in“Clio: Schema mapping creation and data exchange", by Fagin et al published in Conceptual Modeling: Foundations and Applications, volume 5600 of LNCS, pages 198-236. This project has given rise to comprehensive results on subjects such as data exchange, merging of mappings, debugging of mappings and mapping verification. Examples are discussed in“Foundations of Data Exchange” by Arenas et al published by Cambridge University Press in 2014, “Mapmerge: correlating independent schema mappings” by Alexe et al published in VLDB J., 21 (2):191-21 1 , 2012; “Debugging schema mappings with routes” by Chiticariu et al published in Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 79-90, 2006 and“Schema mapping verification: the spicy way” by Bonifati et al published in EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, March 25-29, 2008, Proceedings, pages 85-96, 2008.
[0003] This line of research has tended to focus on supporting experts who are developing mappings between a single source schema and target schema, where the source and the target are well designed database schemas (for example, with declared keys and foreign keys). In practice, this tends to mean that mapping generation can benefit from precise and exhaustive descriptions of relationships within the source schemas, as well as human-curated matches between the source and the target. In such a setting, the goal is to provide tool support that helps experts curate matches and develop mappings between (potentially complex) source and target schemas for example as described in “Information integration in the enterprise” by Bernstein et al published in CACM, 51 (9): 72-79, 2008.
[0004] More recently, the development of techniques for web data extraction, the publication of extensive open data sets, and the adoption of data lakes means that organisations typically have access to numerous sources in a domain of interest, and a requirement to integrate the key data on a topic at manageable cost. In such a setting, the relevant data may come from many tables, from many independently developed data sources, and thus the relationships between these sources and the ways they can be used together are unlikely to be declared in the sources. Instead, relationships such as inclusion dependencies between sources may have to be inferred from data profiling for example as described in“Profiling relational data: a survey” by Abedjan et al as published in VLDB J., 24(4):557-581 , 2015, and may be partial. Such diverse relationships give rise to a large space of candidate mappings, which are likely of variable utility. Where there can be arbitrary numbers of source schemas, there tends to be less reliable or comprehensive information with which to inform mapping generation, and the focus has tended to shift toward managing the resulting uncertainty that arises from many alternative mappings.
[0005] Some proposals use feedback on results to identify which of the generated mappings are producing appropriate values. For example, given an initial set of mappings between individual sources and a target that could have been produced by Clio,“Incrementally improving dataspaces based on user feedback" by Belhajjame et al published in Inf. Syst., 38(5):656-687, 2013 explores how these mappings can be combined, for example by union, intersection or difference, to produce new mappings that may have higher precision or recall than their predecessors with respect to user feedback on the correctness of result tuples. As such, in this proposal the meta-data about the target schema may include tuples annotated to the effect that they are true positives, false positives or false negatives, and a search seeks to identify sets of mappings that together maximise precision subject to some recall threshold (or vice versa). In“Learning to create data-integrating queries" by Talukdar et al published in PVLDB, 1 (1 ):785— 796, 2008, the user provides additional information about the meta-data for the target schema in the form of example values, and subsequently feedback on results of generated mappings. As feedback is obtained on the correctness or otherwise of result tuples, this is used to refine the scores of candidate mappings. As such, as in Belhajjame et al., mappings are not assumed to be correct, but rather are scored by users based on the suitability of their results.
[0006] Some other proposals have sought to fully automate mapping generation at scale, for example motivated by web scale integration problems. For example,“Bootstrapping pay-as-you- go data integration systems" by Sarma et al published in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 861-874, 2008 describes an approach to generating mappings for a probabilistic mediated schema. Here there is no pre-existing target schema, and the technical focus is on aligning source attributes with counterparts in a mediated schema, and not so much on the generation of mappings that may combine data from different sources. Also seeking to operate at web scale, “Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems" by Mahmoud and Aboulnaga published in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 41 1-422, 2010 cluster single table sources, and then map keyword queries to the domains represented by the clusters. In such approaches, there is an attempt to provide some measure of integration with little additional information about the user’s requirements (e.g. the target schema may not be provided) and with little additional information about the sources (e.g. the meta-data about the source schema may be minimal). In such a setting there is little evidence with which to consider how to combine data from different sources, and thus mappings tend not to be expressive. [0007] There is some other work that seeks to combine data from multiple structured data sources. For example,“Data curation at scale: The data tamer system” by Stonebraker et al published in CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013 targets the integration of significant numbers of enterprise data sets and is also described in WO2015/148304. In relation to mapping, the approach described in that work focuses the integration effort at the instance-level, through entity resolution and fusion, with ongoing human input, for example in the form of training data. The goals of the Data Civilizer project seem similar, although with a greater emphasis on discovery and cleaning as described in“ A demo of the data civilizer system" by Fernandez et al published in Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1639-1642, 2017. There is also research being carried out within the VADA project as described in “The VADA architecture for cost- effective data wrangling” by Konstantinou et al published in Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1599- 1602, 2017. This research has initially been using the paper by Bonifati above to generate mappings between small groups of sources and the target, which are then subject to mapping selection.
[0008] The applicant has recognised the need for an improved method and system which is able to explore the space of candidate mappings, making informed decisions as to which tables should be combined and how, based on the available matches and profiling data.
Summary
[0009] According to the present invention there is provided an apparatus and method as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows.
[0010] We describe a data integration method for populating a target database with data from a plurality of source databases with each source relation in a source database having a plurality of source attributes, the method comprising: inputting a target schema for the target database, wherein the target schema comprises a plurality of target attributes; generating a set of mappings which populates the target database with data from at least a subset of the plurality of source relations by defining a plurality of sub-solutions wherein for each sub-solution a different number of the source relations are merged and each sub-solution comprises a set of intermediate mappings to map data from at least a subset of the source relations to the target scheme using source attributes of each source relation which match target attributes in the target schema; iteratively solving each of the plurality of sub-solutions to generate, at each iteration, a set of intermediate mappings which maps more source relations to the target schema than a set of intermediate mappings generated in the previous iteration by merging pairs of mappings from each set of intermediate mappings generated in the previous iteration, and outputting as the set of mappings at least some of the mappings within the sets of intermediate mappings which are generated in previous iterations. [0011] The output mappings can then be processed by a subsequent step such as mapping selection. Through mapping selection, the preferred mappings are selected based on various features such as size of output, fitness of result, etc. Examples of such techniques can be found in the work of Edward Abel et al. in“User driven multi-criteria source selection” published in Information Sciences, Volumes 430-431 , 2018, Pages 179-199, and Khalid Belhajjame et al, Feedback-based annotation, selection and refinement of schema mappings for dataspaces”, 13th International Conference on Extending Database Technology, ACM, 2010: 573-584.
[0012] In other words, the complex problem of mapping the data has been divided into a collection of simpler sub-problems which involve merging pairs of subsets of the input sources. Iteratively solving the complex problem comprises solving a base sub-solution in which a minimum number of source relations are used, e.g. one, to create a base set of intermediate mappings. The next step may comprise using the base sub-solution to build a first set of intermediate mappings, e.g. by selectively merging pairs of mappings from the base set of intermediate mappings. The next step may then comprise using both the base set of intermediate mappings and the first set of intermediate mappings to build a second set of intermediate mappings, for example by selectively merging pairs of mappings from the base set with the first set. The final set of mappings which is generated by the iterative process may thus comprise a selected set of intermediate mappings which were generated in previous steps and may comprise each of the mappings generated in each iteration. The final set of mappings may thus be a set of best-effort mappings which might include an overall solution which merges all input source relations but more typically comprises a subset of the source relations in light of overlap between the source relations.
[0013] The method may comprise identifying a plurality of candidate keys within the plurality of sources, wherein each candidate key is a single or a plurality of source attributes. The method may further comprise calculating inclusion dependencies for the sources, wherein an inclusion dependency is a measure of the overlap between the values of two sets of sources attributes. The candidate keys and/or the inclusion dependencies may be inputs to the method. The candidate keys and the inclusion dependencies may be considered to be profiling data and/or metadata for the sources. Each candidate key may be defined as a set of source attributes which has unique values within that source relation, i.e. a candidate key for a relation in the form of a table with rows and columns may be a column (or a combination thereof) that has unique values in the relation. An inclusion dependency may be partial, i.e., there is some overlap between the values of a set of source attributes and the values of another set of attributes, or total, i.e., the values of one set are fully contained by the values of another set of attributes.
[0014] Selectively merging pairs of mappings may comprise determining whether the pair of mappings are to be merged using at least one of the source attributes, the candidate keys and the inclusion dependencies. For example, if there is no possibility of inferring a relationship between the source attributes within the pair of mappings, no pairwise merging is possible. In other words, not all intermediate mappings within a set may be paired with intermediate mappings in another set. However, when it is determined that the pair of mappings are to be merged, pairwise merging may comprise selecting a merging which is one of a union, a join or a full outer join of the sources within the pair of mappings. Union, join, and full outer join are standard relational operators that can be applied between two relations to merge them, and they are defined as in“Database Systems: The Complete Book, 2nd Edition" by H.Garcia-Molina et al, Pearson, 2014.
[0015] Selectively merging pairs of mappings may comprise determining whether a first mapping within the pair of mappings is subsumed by a second mapping in the pair of mappings, i.e. by applying a subsumption method. By subsumed, it is meant that the values of one mapping are completely contained within the values of the other mapping. The method may further comprise discarding the first mapping to prevent merging when it is determined to be subsumed by the second mapping. Again, where appropriate, this reduces the number of mappings which are to be merged.
[0016] The subsumption method may comprise identifying a first set of attributes which are the attributes in the first mapping which match the target attributes and identifying a second set of attributes which are the attributes in the second mapping which match the same target attributes. The method may further comprise selecting the inclusion dependency between the first and second set of attributes having the highest value; and comparing the selected inclusion dependency to a threshold value, e.g., one. This comparison can be used to determine whether or not a first mapping is subsumed with respect to the target by a second mapping or vice versa. When the inclusion dependency equals the threshold value, this means that one mapping is subsumed by the other mapping. The selected inclusion dependency may be between candidate keys of the first and second mappings.
[0017] The method may further comprise determining whether the first set of attributes match the same target attributes as the second set of attributes, and when there is a match, selectively merging using a union. Alternatively, when they do not match exactly the same target attributes, selectively merging using a full outer join when the selected inclusion dependency between the candidate keys of the two mappings is less than the threshold value, and selectively merging using a join when the selected inclusion dependency equals the threshold value, i.e. when a foreign key relationship can be inferred between the two mappings. In other words, the nature of the selective merging may be dependent on whether or not the first and second sets of attributes match the same or different target attributes.
[0018] There may be occasions when it is not possible to identify any inclusion dependencies and in this case, the method may comprise identifying a first set of candidate keys which are the candidate keys in the first mapping which match a subset of target attributes, identifying a second set of candidate keys which are the candidate keys in the second mapping which match a subset of target attributes, determining whether the first set of candidate keys match the same target attributes as the second set of candidate keys, and if there is a match, selectively merging using a full outer join. If there is no match, it may be determined that there is no possible or useful pairwise merging in such a case. [0019] Generating each set of intermediate mappings may further comprise calculating a fitness value for each intermediate mapping. For example, the fitness value may be calculated using the equation:
Figure imgf000008_0001
where m is the intermediate mapping, t is the target relation, arity(t) is the total number of attributes in target t, matches(m) is the number of matched attributes that mapping m has w.r.t. target t, nulls(m) is the estimated number of nulls that mapping m will generate on the values of the matched attributes and T(m) is the estimated size of mapping m, i.e., the estimated number of tuples that will result after running mapping m.
[0020] In other words, the fitness value may be calculated using an estimate of the number of nulls that the intermediate mapping will generate for the target attributes in the target schema which are matched to source attributes in the sources within the intermediate mapping. The method may further comprise comparing calculated fitness values for all intermediate mappings based on the same sources and retaining only the intermediate mapping having the highest fitness value. This reduces the number of intermediate mappings which need to be stored at each iteration.
[0021] We also describe a non-transitory carrier carrying a computer program which when implemented on a computer causes the computer to implement the method described above.
[0022] We also describe a system comprising a target database, a plurality of source databases and a computing device which is configured to implement the method described above to map data from the plurality of source databases to the target database.
Brief Description of Drawings
[0023] For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which:
[0024] Figure 1 a is a flowchart showing the overall steps of the method;
[0025] Figure 1 b is an example showing a target schema and some input schemas;
[0026] Figure 2 is a flowchart showing the detail of the step of generating mappings in the method of Figure 1 a;
[0027] Figures 3a and 3b are flowcharts showing the detail of the step of merging mappings in the method of Figure 2;
[0028] Figure 4 is a flowchart showing the detail of the step of choosing an operator in the method of Figure 3a;
[0029] Figure 5a is a flowchart showing the detail of a steps in the method of Figure 4;
[0030] Figures 5b and 5c are flowcharts showing the detail of another step in the method of Figure 4;
[0031] Figure 6 is a flowchart showing the detail of a step which is common to the method of Figures 5a and 5b; and [0032] Figure 7 is a flowchart showing the detail of the step of determining whether a mapping is fittest in the method of Figure 3b.
[0033] Figure 8 is a schematic block diagram of a system for implementing the described method.
Detailed Description of Drawings
[0034] Figure 1 a illustrates the overall method for integrating data sets, for example by populating a target database with information from a plurality of sources. In the first step S100, the target schema t is read and, in step S102, the input schemas with their relations are read. Each input relation has one or more attributes. It will be appreciated that although these steps are shown sequentially, some of the steps could be carried out in different orders.
[0035] For example, referring to Figure 1 b, suppose a user is looking for properties in Manchester or Cambridge and would like to know some information about the postcode, street and price together with their associated income and crime ranks. The target schema 20 consists of a single relation labelled“MA_CA_statistics” (which is illustrated as a table in this example) and contains a plurality of attributes: postcode, price, street, income rank and crime rank which are then to be populated with data. Thus as shown a relation may be considered to be a set of tuples (di ,d2,...,dn), where each element dj is a member of Dj, a data domain. There is also a plurality of input schemas 30, 32, 34, 36 in the form of relations which for simplicity have attributes with similar names for matching source and target attributes (again the inputs are illustrated as tables in this example). The input schema includes: Manchester Agency (MA) 30 which has the following attributes: city, street, postcode and price; Cambridge Agency (CA) 32 which has the following attributes: postcode, city, street, price and county; UK Deprivation (UKD) 34 which has the following attributes: crime rank, crimedecile, county and postcode; and UK quality of life statistics (UKQ) 36 which has the following attributes: county, income rank, economic activity rank and health rank.
[0036] Returning to Figure 1 a, the next step S104 is to create an empty set of sub-solutions for mapping the data in the source schemas into the target schema. Mapping the data is a complex or compound problem because it involves finding mappings involving multiple input relations with attributes that match the same target relation. Each member of the set of sub-solutions represents a method for dividing the complex problem of mapping the data into a collection of simpler subproblems for which it is easier to find a solution. As explained in more detail below, the sub-problems involve pairs of subsets of the initial input relations. The results from each pair of sub-problems are then merged to provide a solution to the complex problem.
[0037] Steps S106 to S1 10 show some pre-processing steps which generate profiling data for each source before mapping generation begins. In step S106, the matches between the attributes of the source and target schemas are identified and read. These matches can be thought of as metadata. In step S108, the candidate keys are identified and read. A candidate key for a relation is a column (or a combination thereof) that has unique values in the relation. For example, in Figure 1 b, the candidate keys are identified using the [ck] attribute annotation. In step S1 10, the inclusion dependencies are identified and read. The candidate keys and inclusion dependencies can be thought of as metadata on the source.
[0038] The inclusion dependencies (inds) are determined as follows. Given two projections R and S with identical arity over relations R’ and S’, respectively, we define Rc0s, where Q is the overlap of tuples between R and S and Q can have the following values:
1) if R P S = ø, then Rc0iS, Se 02R, where qi= 02 = 0, and, based on profiling evidence, we say that R and S are disjoint and there is no inclusion dependency from R to S (or, indeed, from S to R),
2) if R P S = R, then Rc0s, where Q =1 , we say that, based on profiling evidence, RciS and there is a (total) inclusion dependency from R to S,
3) if R P S ¹ R and R P S ¹ S, then REeS, where
Figure imgf000010_0001
where V(R) is the number of distinct values on R.
where 0 < Q < 1 , the inclusion dependency is said to be partial.
[0039] For example, referring to Figure 1 b, the partial inclusion dependencies: postcode in UKD and county in UKQ are shown. The overlap is the fraction of the values in the dependent attribute that are found in the referenced attribute: the direction of the arrow is from the dependent attribute to the referenced attribute. Thus, for the candidate key postcode, the overlap from UKD to the source MA with the same attribute is 0.5 (i.e. , half of the distinct values within the column overlap: M1 5BD and M1 5EB). It is noted that the overlap is the same in the opposite direction because each column has four different values. By contrast, the overlap between the postcode columns in UKD and CA differs because CA has two repeated entries. Thus, the overlap from UKD to the source CA for postcode is 0.25 (i.e., one value within the column overlaps: CB2 1 UW) but the overlap from CA to the source UKD for postcode is 0.334. Other overlaps are illustrated in the Figure and summarised below:
Figure imgf000010_0002
[0040] The profile data may also include statistics, such as the cardinality of each relation, the number of distinct values for each column, and the number of nulls for each column, as required for a fitness function that chooses between candidate mappings. Access to the profiling data over the input sources which could be produced using a tool such as Metanome (“Data profiling with metanome” by Papenbrock et al published in Proc. VLDB Endow., 8(12):1860-1863, August 2015) or SINDY (“ Datenbanksystem fur Business, Technologie und Web (BTW), 16. Fachtagung des Gl-Fachbereichs “Datenbanken und Informationssysteme” by Kruse et al published in 4.-6.3.2015 in Hamburg, Germany, Proceedings, volume 241 of LNI, pages 445-454. Gl 2015) is assumed.
[0041] Returning to Figure 1 a, we now have the necessary information to begin creating the mappings. The problem can be defined as given a collection of source schemas S, each source schema with many relations (i.e. , tables 30 to 36 in Figure 1 b), a target relation T (i.e. table 20 in Figure 1 b), metadata on the sources obtained through data profiling, including candidate keys and (partial) inclusion dependencies, metadata in the form of matches between the source relations and the target, generate a set of candidate mappings to populate the target relation with instance data from the source schemas. In step S112, a set of initial mappings is created where for each source relation a 1 -to-1 mapping is created, i.e., each 1 -to-1 mapping maps only one source relation to the target relation, without merging it with other source relations. This step is considered to be the creation of the base sub-solutions for the mapping problem, i.e., if the input relations cannot be merged, then the algorithm will be able to output the base solution. The output is“initMaps”. This output is then added to the set of subsolutions in iteration 1 as subsolutions[1]=initMaps (step S1 14). Each element in initMaps is a sub-solution for each of the source relations that have at least one match with the target relation.
[0042] Each sub-solution is a solution to a sub-problem which is trying to find a mapping for fewer relations than in the initial input. Each sub-problem thus represents an iteration in the mapping generation process. At step S116, the method determines the number of the last iteration, i.e. finaliterationidx, which is equal to the number of input relations. The determined finaliterationidx is an input to the recursive algorithm which is used to generate the mappings - genMaps - step S118. The output of this algorithm consists of all the subsolutions which are generated and retained in all iterations of the algorithm. In a final step, these sub-solution mappings can be outputted (step S120) and the method ends. However, it will be appreciated that a selection of these intermediate mappings may also be output. In other words, the final set of mappings comprises at least some of the mappings from the sets of intermediate mappings generated at each iteration.
[0043] Merely as an example, the table below shows how the inputs in Figure 1 b could be combined to create two complete target rows. It will be appreciated that this is not a complete output of a mapping and is merely provided for illustration. For the first complete row, the first three data entries may be from the MA table and the other two data entries from the UKD and UKQ tables respectively. For the second complete row, the first three data entries may be from the CA table and the other two data entries from the UKD and UKQ tables respectively. In both examples, the candidate key of the postcode may be used to merge the MA/CA and UKD tables and the candidate key of country to merge the UKD and UKQ tables as explained in more detail below.
Figure imgf000011_0001
[0044] Figure 2 shows the detail of the recursive algorithm named“genMaps” which is the top-level algorithm that uses dynamic programming to divide the complex problem into a series of subproblems. Dynamic programming is an approach to solving a complex problem by diving it into a series of subproblems. In the recursive process, the algorithm will try to break down the complex problem, i.e., finding a mapping for all initial sources (finaliterationidx), into smaller sub-problems, i.e., finding mappings for fewer initial relations by recursively calling genMaps with a parameter idx < finaliterationidx. When genMaps for iteration 1 (genMaps(1)) is called, it means that it reached the base solution, and the initial mappings that were previously set as the subsolution for iteration 1 will be returned. By trying to build the mapping for the final iteration, intermediate subsolution mappings need to be created and used in sequent iterations. As explained in more detail below, given N initial source relations, in each iteration idx, 1 < idx < N, the algorithm searches for the best way to merge any idx source relations. The mappings that result from combining each subset of source relations characterize new relations that are referred to as intermediate mappings, the collection of which comprises the solution at iteration idx. The flowchart can also be expressed as an algorithm as follows:
1 : function genMaps(idx)
2: if subsolution[idx] exists then
3: return subsolution[idx]
4: else
5: iteration_maps <— 0
6: for i <— 1 to idx/2 do
7: batchl <— genMaps(i)
8: batch2 <— genMaps(idx - i)
9: new_maps <— MERGEMAPPINGS(batch1 , batch2)
10: iteration_maps.add(new_maps)
1 1 : subsolution[idx] <— iteration_maps
return subsolution[idx]
[0045] Thus, as shown in step S200, the input to the algorithm is idx (the iteration index) and the output collection of mappings is termed“resultMaps”. Initially, as shown the set of mappings is empty (step S202). At step S204, there is a determination as to whether a subsolution for this iteration (idx) already exists. If the subsolution already exists, it is output as the collection of mappings for this iteration, i.e., resultMaps = subsolution [idx] (step S206). If the subsolution does not exist, / is set to 1 to perform an internal recursive algorithm as explained below. After initializing / with 1, a determination as to whether i<idx/2 is made. If this is not true, the subsolution for this iteration is memoized as the subsolution for iteration idx (line 1 1 & subsolution [idx] = resultMaps (step S212)); for reuse in subsequent calls to the algorithm (e.g., lines 2 to 3 and steps 204 and 206).
[0046] If i<idx/2, two sets of intermediate mappings - batch 1 and batch 2 - are generated at steps S214 and S216, respectively. Batch 1 is the collection of intermediate mappings for the input / and batch 2 is the collection of intermediate mappings for the input idx-i. As shown by the dotted line, the algorithm recursively calls the same method, genMaps, with different input parameters. The next step is to merge the sub-solutions from iterations / and idx-i, i.e., to merge mappings from batch 1 and batch 2 (step S218 and line 9), where each mapping from one batch is merged with all the mappings from the other batch. The resulting merged mappings are output as iMaps, i.e. the intermediate mappings for this iteration, and at step S220 these are added to the collection of mappings - resultMaps. / is then incremented by 1 as shown at step S222 and the steps S210 to S222 are iteratively repeated until the condition in step S210 is not satisfied. After iteration N, a set of mappings which merge all or subsets of the initial source relations can be obtained from the collection of sub-solutions. The schema of the output mappings is the same as the schema of the target relation, though some attributes may contain only null values.
[0047] For example, in Figure 1 b, where N<— 4, in iteration idx <— 4 (the last one), the algorithm will try to merge the batch of mappings from iteration 3 with the batch of mappings from iteration 1 , and then pair-wise merge the mappings from iteration 2. For example, assume that in iteration 2, the following mappings were found (NB these are not the complete list but just examples for illustrative purposes):
rri2_i<— merge (MA, CA)
ni2_2<— merge (UKQ, UKD)
ni2_3<— merge (CA, UKD)
Then in iteration 4, the algorithm will try to merge (rri2_i, rri2_2), (rri2_2, må_3) and (rri2_i, rri2_3). It can be observed that by merging rri2_i, with rri2_2, a mapping that incorporates all initial sources will be obtained.
[0048] Figures 3a and 3b shows the detail of the algorithm named“mergeMaps” which implements the steps that take place within an iteration of the main loop of genMaps shown in Figure 2. The detailed algorithm can also be expressed as follows:
1 : function mergeMaps(batch1 , batch2)
2: new_maps <— 0
3: for each map i in batchl do
4: for each map j in batch2 do
5: operator <— chooseOperator(map i, map j)
6: if operator not null then
7: interm_map <— genlntermMapping(operator)
8: md <— computeMetadata(interm map)
9: if isFittest(interm_map , new_maps) then
10: new_maps.add(interm_map)
return new_maps
[0049] Referring to Figures 3a and 3b, the inputs to the mergeMaps algorithm are batchl and batch2 and the output is iMaps (step S300). The initial step is that iMaps is empty (step S302). At step S304, / is set to 0 to perform the iterative algorithm as explained below. After initializing / with 0, a determination as to whether i is smaller than the size of batchl is made. If this is not true, iMaps is output (step S309). If this is true, j is set to 0 (step S308) and a determination as to whether j is smaller than the size of batch2 is made (step S310). If this is not true, / is incremented by 1 (step S31 1) and steps S306, S308 and S310 are repeated. This represents the creation of the pairs of mappings from the two batches by pairing each / mappings from batchl with all j mappings from batch2.
[0050] If j is smaller than the size of batch2, at step S310, MERGEMAPPINGS uses another algorithm chooseOperator (line 5 and step 312) to choose an operator with which to combine map / and map j , from batchl and batch2, respectively, if they can usefully be merged (in Figure 3a, these maps are labelled batchl [i] and batch2[j]). map / and map j represent pairs of mappings from batches of mappings from two previous iterations. The chooseOperator algorithm is described in more detail below. If the mappings cannot be usefully combined, i.e. , if at step S314 the output is null, j is incremented by 1 and steps S310, S312 and S314 are repeated.
[0051] If the mappings can be usefully combined, the new mapping is generated (line 7 and step S318) and the metadata for the resulting intermediate mapping is computed (line 8 and step S320 beginning on Figure 3b). The intermediate mapping is retained if it has the highest fitness of any mapping involving the same source relations (lines 9-10 and steps S322 to S326). The isFittest function selects the mapping with the highest fitness and is explained in more detail below. It can be defined in different ways using table statistics and profiling data, for example by preferring candidate mappings with fewer nulls. As shown at step S324, if the new mapping satisfies the isFittest function, the old mappings that stem from the same initial relations as iMap are removed (step S326) and the new iMap is added to the set of retained iMaps (step S328). However, if the new mapping does not satisfy the conditions in the isFittest function, the mapping is not retained, j is incremented by 1 , and steps S310, S312 and S314 are repeated. Both loops terminate when all pairs of mappings are created and merged.
[0052] Figure 4 shows the detail of the algorithm named “chooseOperator” which is used in the mergeMaps algorithm shown in Figure 3a to combine two intermediate mappings taking into account how these relate to the given target table. The detailed algorithm can be expressed as follows:
1 : function chooseOperator (map1 , map2)
2: \\ t_rel, pd (profile data) are global variables
3: map1_ma <— matchesAttr (map1 , t_rel)
4: map2_ma <— matchesAttr(map2, t_rel)
5: operator <— null
6: if sameMatches(map1_ma, map2_ma) then
7: operator <— genOperatorSame(map1 , map2)
8: else
9: if diffMatches (map1_ma, map2_ma) then
10: operator <— genOperatorDiff (map1 , map2)
return operator
[0053] As set out in Figure 4, the inputs to the algorithm are iMapl and iMap2 (i.e., 2 mappings) and the output is the op (operator) as set out in step S400. As shown in step S402, initially the operator is set to NULL value. The sets of matched target attributes are retrieved for each mapping w.r.t. the target (steps S404 and S406, lines 3 to 4 above). The input in step S404 is thus the matches of the mapping iMapl with the attributes of the target and similarly for step S406 is the matches of the mapping iMap2 with the attributes of the target. The outputs are labelled tm1 (map1_ma) and tm2 (map2_ma) respectively, representing the two sets of matched target attributes.
[0054] The two outputs are then compared (step S408) and if they do not match, i.e., if the two mappings match different attributes within the target schema, they are considered candidates for joining by the algorithm GenOperatorDiff (Step S410 and line 10). Alternatively, if the outputs match and thus the two mappings match the same attributes within the target schema, they are considered candidates for unioning by the algorithm GenOperatorSame (Step S412 and line 7). Returning to the example shown in Figure 1 b, the matches for the MA relations are thus postcode, price and street, and CA also has the same matches so these are candidates for GenOperatorSame. However, UKD has different matches, namely postcode and crime_rank to those of MA, and thus MA and UKD are candidates for joining by the algorithm GenOperatorDiff.
[0055] The output of chooseOperator will either be an operator that represents the way to merge two input mappings or NULL if no useful way could be found to merge them with respect to the target relation. If the algorithm merges the two mappings, then the resulting mapping that is generated will be the result of union, join or full outer join. Another possible action is to discard subsumed mappings; a relation A is subsumed by a relation B if A £ B, so the subsumed relation need not be considered in subsequent merges.
[0056] Figure 5a shows the detail of the algorithm named“genOperatorSame” which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between two source relations when they match identical target attributes. The detailed algorithm can also be expressed as follows:
1 : function genOperatorSame(map1 , map2)
2: subsumedMap <— isSubsumed(map1 , map2)
3: if subsumedMap not NULL then
4: discard(subsumedMap)
5: operator <— null
6: else
7: operator <— Union(map1 , map2)
8: return operator
[0057] As set out in Figure 5a, the inputs to the algorithm are iMapl and iMap2 (i.e. 2 mappings) and the output is the op (operator) as set out in step S500. As shown in step S502, initially the operator is set to NULL value. An algorithm entitled“isSubsumed” (line 2, step S504, the algorithm is described in more detail in Figure 6) tests whether the values for the matched attributes from the first relation are fully contained in the attributes in the second relation, i.e., are subsumed. As shown in step S504, the inputs are also iMapl and iMap2 and the output is the subsumed mapping out of the two given as input (subsumedMap). [0058] In a first case (lines 3 to 5), the output subsumedMap will not be null and no merge will take place because one relation is subsumed by the other relation and thus the subsumed relation can be safely discarded (lines 4,5 and step 510). In a second case (line 7 and step 508), the chosen operator is the union of the two projected mappings on the attributes that are needed in the target and that both mappings match. In other words, the union of iMapl (also expressed as map1) and iMap2 (also expressed as map2) is returned.
[0059] Returning to the example shown in Figure 1 b, one of the calls to the“genOperatorSame” will be made in iteration 2 to merge the CA and MA relations as they match the same target attributes, e.g. postcode, price and street. It can be observed from Figure 1 b, that one does not subsume the other; in fact, their tuples are disjoint, and the lack of (partial) inclusion dependencies between the two relations reflects this. The result of the merge will be a mapping: ni2_i<— union (MA, CA).
[0060] Figures 5b and 5c show the detail of the algorithm named“genOperatorDiff which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between mappings that do not match the same target attributes, i.e., the sets of matched attributes can either be disjoint or have common attributes. The detailed algorithm can also be expressed as follows:
1 : function genOperatorDiff(map1 , map2)
2: \\ t_rel, pd (profile data) are global variables
3: operator <— null
4: subsumedMap <— isSubsumed (map1 , map2)
5: if subsumedMap not NULL then
6: discard(subsumedMap)
7: return operator
8: map1_keys <— getKeys(pd, map1)
9: map2_keys <— getKeys(pd, map2)
10: ind <— getMaxlnd(map1_keys, map2_keys)
1 1 : if ind exists then
12: if ind. overlap = 1 .0 then
13: operator ·*— Join(map1 , map2, ind. attributes)
14: else
15: operator <— OuterJoin(map1 , map2, ind. attributes)
16: else
17: mapljnd <— getMaxlnd (map1 keys, map2. attributes)
18: map2_ind <— getMaxlnd (map2 keys, map1 .attributes)
19: ind <— maxCoef(map1_ind, map2_ind)
20: if ind exists then
21 : if ind. overlap = 1 .0 then
22: operator <— Join(map1 , map2, ind. attributes)
23: else
24: operator <— OuterJoin(map1 , map2, ind. attributes)
25: else
26: map1_mk <— getMatchedKeys(map1 , t_rel)
27: map2_mk <— getMatchedKeys(map2, t_rel)
28:
29: if matchSameTargetAttributes(map1_mk, map2_mk) then
30: operator <— OuterJoin(map1 , map2, <map1_mk, map2_mk>)
31 : return operator
[0061] As shown in Figure 5b, the inputs and output of the genOperatorDiff algorithm are the same as those to the genOperatorSame algorithm of Figure 5a. Furthermore, steps 552, 554, 556 and 560 correspond to steps 502, 504, 506 and 510 of Figure 5a. In other words, as also shown in lines 4 to 7 above, if one relation subsumes the values of the other relation on the attributes that match the target, the subsumed relation is discarded and the method returns null as no operator was found that yields useful information. If the subsumption conditions are not met, the algorithm tries to merge the input relations.
[0062] The candidate keys (i.e. iMapl .ckeys and iMap2.ckeys - also expressed as map1_keys and map2_keys) from both input relations are retrieved from the profile data (lines 8 to 9). Furthermore, any (partial) inclusion dependencies (termed ind) between a pair of keys from the two sets of candidate keys are also retrieved from the profile data. The algorithm then chooses the IND with the highest overlap (step S562 and line 10), i.e., the maximum is chosen (getMaxIND). However, these inclusion dependencies may not exist and thus the algorithm checks whether an inclusion dependency has been found (line 1 1 and step S564).
[0063] Continuing to Figure 5c, if there is an output IND, the level of overlap is checked, i.e., compared to 1 (step S566, line 12). If the overlap is 1 , the chosen operator is a join because a foreign key relationship has been inferred between the two input mappings on their candidate key attributes (step S568 and line 13). Alternatively, if the overlap is between 0 and 1 (but not equal to 0 or 1), the chosen operator is a full outer join because a foreign key relationship could not be inferred, but some tuples can be correlated due to the overlap between the candidate keys. Thus, the algorithm merges the two mappings by applying a full outer join, thus joining the tuples that can be joined while keeping all the data (step S570 and line 15). In both cases, the join condition will compare the candidate key attributes that are involved in the chosen inclusion dependency, IND.
[0064] Returning to Figure 1 b to illustrate this process, one of the calls of genOperatorDiff will be in iteration 2 when merging MA and UKD as these match different target attributes, i.e., they both match postcode, but MA also matches street and price while UKD matches crime_rank. The algorithm will find that there is a candidate key on UKD. postcode and that this attribute has a partial inclusion dependency with MA. postcode which is not a candidate key. Accordingly, there are no inclusion dependencies between two candidate keys and the algorithm tries to infer a foreign key relationship between a candidate key from one relation and a subset of attributes of the other relation which may not represent a candidate key (steps S572, S574 of Figure 5b and lines 17 and 18). Thus, in step S572 (line 17), the inclusion dependencies (IND) for all candidate keys in the first source and the attributes in the second source are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind1). Similarly, in step S574 (line 18), the inclusion dependencies (IND) for all candidate keys in the second source and the attributes in the first source are identified and the IND with the highest overlap is chosen, i.e. the maximum is chosen (getMaxIND with an output ind2). The outputs from steps S572 and S574 are compared to choose the overall maximum at step S576 (line 19).
[0065] There may be no inclusion dependencies between key and non-key attributes and thus the step of checking whether an inclusion dependency has been found is repeated as step S578 (line 20). If there is an inclusion dependency, steps S566 to S570 of Figure 5c are repeated to check the overlap. If the overlap is 1 , a join is used as before (step S568 and line 22). Alternatively, as illustrated in the example, the value of the partial inclusion dependency is 0.5 and thus the result of the merge will be a mapping: OuterJoin(MA, UKD, <MA. postcode, UKD. postcod e>) (step S570 and line 24).
[0066] There may be no inclusion dependencies involving candidate keys, for example in Figure 1 b, there are no candidate keys in MA to share an inclusion dependency with attributes in UKQ or vice versa. Thus, for MA and UKQ it is not possible to infer a foreign key relationship as described above. Accordingly, the algorithm retrieves the candidate keys from both relations that match the target attributes. In more detail shown in Figure 5c, the candidate keys from the first source are retrieved (iMapl .ckeys or map1_keys) as the input and the ones which match the target attributes are output (tckeysl or map1_mk) as set out in step S580 and line 26. Similarly, the candidate keys from the second source are retrieved (iMap2.ckeys or map2_keys) as the input and the ones which match the target attributes are output (tckeys2 or rel2_mk) as set out in step S582 and line 27. These output keys are compared in step S584 (line 29) and if they match, the two relations are merged by applying the full outer join operation as shown in step S586 where the join is on the attributes that meet the above mentioned condition (line 30). Otherwise, no operator which can join the sources is output and the operator is output as null. In the example, there are no attributes in MA and UKQ which are candidate keys and that match the same target attributes, so the comparison between tckeysl (map1_mk) and tckeys2 (map2_mk) will not return any matches thus the operator will be null.
[0067] Figure 6 shows the detail of the algorithm“isSubsumed” which is used as described above. As shown in step S600, the inputs are iMapl and iMap2 and the output is the subsumed mapping of the two given as input, i.e., subsumedMap will be equal to iMapl or iMap2. Initially, subsumedMap is set to NULL (step S602). In steps S604 and S606, the attributes from each of the first and second mappings which match the target schema are identified as sal and sa2 respectively. It will be appreciated that although shown as two separate steps, these steps could be carried out simultaneously or in a different order. In step S608, the inclusion dependencies (IND) from the matched attributes for the first mapping to the matched attributes for the second mapping are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind).
[0068] The level of overlap is next checked, i.e., compared to 1 (step S610). If the overlap is 1 , the subsumedMap is set as the first intermediate mapping (step S612). Alternatively, if the overlap is not 1 , the inclusion dependencies (IND) from the matched attributes for the second mapping to the matched attributes for the first mapping are identified and the IND with the highest overlap is chosen (step S614). The level of overlap is checked again (step S616) and if the overlap is 1 , the subsumedMap is set as the second intermediate mapping (step S618). Alternatively, if the overlap is not 1 the subsumedMap remains NULL and it is returned because neither mapping is subsumed by the other.
[0069] Figure 7 shows the detail of the algorithm“IsFittest” which is used as described above. Each intermediate mapping will contain values that are of different quality with respect to the chosen target relation, so, in order to prune the search space, the“IsFittest” algorithm eliminates the intermediate mappings that have poor quality in comparison with other intermediate mappings which stem from the same initial source relations. The algorithm will always keep the best way of merging a subset of source relations by keeping only the intermediate mapping with the highest fitness result. The fitness result will be computed after the new intermediate mapping is created. As shown in step S700, the inputs are iMap, oldMaps and iterationidx and the output is a Boolean output as to whether or not the mapping“isFittest”. Initially, the output is set to FALSE (step S702).
[0070] In step S704, there is a function to getMapsWithlnitRels, in other words all the intermediate mappings which stem from the same initial source relations are retrieved. The inputs are thus iMap and subsolutions[iterationldx] and the output is memoizedMaps. In step S704, this output is added to the set of existing intermediate maps, i.e., oldMaps (step S706). A fitness function for each generated mapping is then calculated for example using the function below:
Figure imgf000019_0001
Where m is the intermediate mapping, t is the target relation, arity(t) is the total number of attributes in target t, matches(m) is the number of matched attributes that mapping m has w.r.t. target t, nulls(m) is the estimated number of nulls that mapping m will generate on the values of the matched attributes and T(m) is the estimated size of mapping m, i.e., the estimated number of tuples that will result after running mapping m.
[0071] In other words, the fitness function has as input a generated mapping and the output will represent the ratio of non-NULL cells, i.e., meaningful values, that will be considered to populate the target table t. It is noted that the calculated term (arity(t)-matches(m))*T(m) will add the necessary NULLs to populate the target t if mapping m does not match all target attributes.
[0072] As an illustrative example, consider a target schema with a relation t where t has 4 attributes so arity(t) = 4. An example mapping m matches three target attributes (out of 4) so matches(m) = 3 and the total estimated number of nulls that mapping m will generate on the matched attributes is five so nulls(m) = 5. Furthermore, the estimated number of tuples that will result after running mapping m is ten so T(m) = 10, thus the fitness result is:
Figure imgf000019_0002
[0073] This means that mapping m is expected to produce an estimate of 62.5% non-NULL values for target relation t. This low value is mostly due to missing one match to the target which means that the result of mapping m needs a padding of 10 NULLs for the unpopulated target column.
[0074] Once the fitness results have been calculated, they are compared at step S708. If the iMap fitness is greater than the oldMaps fitness, the output is TRUE (step S710) and the new map is retained. Alternatively, if the iMap fitness is less than the fitness of previously retained mappings, the output is FALSE (step S712) and the new intermediate mapping is not retained.
[0075] At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated hardware. Figure 8 is a schematic illustration of such a hardware system which can be used to implement the data integration method described above. The system may comprise a computing device 10 which may be any suitable device, e.g. a user device or a server is connected to a plurality of databases 20a, 20b, to 20n each which contain at least one data source 22a, 22b to 22m. The number of databases and sources shown in the Figure is merely for illustration. Moreover, the plurality of data sources may be all stored on a single database, in one or more schemas. There is also a database 30 which can be used to store the combined data from the different sources in an output form, which is labelled target 32. Again, although this is shown as a separate database, it need not be separate from the database(s) with the sources. Furthermore, some or all of the databases are integrated could be stored on the computing device itself.
[0076] The computing device 10 comprises the standard components such as a processor 12, user interface 16 and memory 14. A mapping module 40 is also illustrated which stores and/or implements the algorithms for the data integration method outlined above. The memory 14 may be used to store the intermediate data generated by implementing the method, e.g. any intermediate mappings, fitness scores, inclusion dependencies, metadata and so on.
[0077] Terms such as‘component’,‘module’ or‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object- oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements.
[0078] Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or“comprises” means including the components) specified but not to the exclusion of the presence of others.
[0079] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.
[0080] All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
[0081] Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
[0082] The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.
[0083] Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims.

Claims

1. A data integration method for populating a target database with data from a plurality of source databases comprising source relations having a plurality of source attributes, the method comprising: inputting a target schema for the target database, wherein the target schema comprises a plurality of target attributes;
generating a set of mappings which populates the target database with data from at least a subset of the plurality of source relations by
defining a plurality of sub-solutions wherein for each sub-solution a number of the source relations are merged and each sub-solution comprises a set of intermediate mappings to map data from at least a subset of the source relations to the target scheme using source attributes of each source relation which match target attributes in the target schema;
iteratively solving each of the plurality of sub-solutions to generate, at each iteration, a set of intermediate mappings which maps more source relations to the target schema than a set of intermediate mappings generated in the previous iteration by selectively merging pairs of mappings from each set of intermediate mappings generated in the previous iteration, and
outputting as the set of mappings at least some of the mappings from the sets of intermediate mappings generated in previous iterations.
2. The method of claim 1 wherein iteratively solving comprises solving a base sub-solution in which no source relations are merged to create a base set of intermediate mappings which is used in each iteration.
3. The method of claim 1 or claim 2, comprising
identifying a plurality of candidate keys within the plurality of sources, wherein each candidate key is a set of source attributes; and
calculating inclusion dependencies for the sources, wherein an inclusion dependency is a measure of the overlap between two sets of source attributes;
wherein selectively merging pairs of mappings comprises determining whether the pair of mappings are to be merged using at least one of the source attributes of the source relations, the candidate keys and the inclusion dependencies.
4. The method of claim 3, wherein when it is determined that the pair of mappings are to be merged, selectively merging pairs of mappings comprises selecting a merging which is one of a union, a join and a full outer join of the sources within the pair of mappings.
5. The method of claim 3 or 4, wherein selectively merging pairs of mappings comprises determining whether a first mapping within the pair of mappings is subsumed by a second mapping in the pair of mappings.
6. The method of claim 5, comprising discarding the first mapping to prevent merging when it is determined to be subsumed by the second mapping.
7. The method of any one of claims 3 to 6, comprising:
identifying a first set of attributes which are the attributes in the first mapping which match the target attributes,
identifying a second set of attributes which are the attributes in the second mapping which match the same target attributes,
selecting the inclusion dependency between the first and second set of attributes having the highest value; and
comparing the selected inclusion dependency to a threshold value.
8. The method of claim 7, comprising
determining whether the first set of attributes match the second set of attributes, and when a match is determined, selectively merging using a union when the selected inclusion dependency is less than the threshold value.
9. The method of claim 7, comprising
determining whether the first set of attributes match the second set of attributes, and if the sets of attributes are not the same, selectively merging using a full outer join when the selected inclusion dependency is less than the threshold value and selectively merging using a join when the selected inclusion dependency equals the threshold value.
10. The method of any one of claims 3 to 5, comprising:
identifying a first set of candidate keys which are the candidate keys in the first mapping which match the target attributes,
identifying a second set of candidate keys which are the candidate keys in the second mapping which match the target attributes,
determining whether the first set of candidate keys match the same target attributes as the second set of candidate keys, and
if there is a match, selectively merging using a full outer join.
1 1 . The method of any one of the preceding claims, wherein iteratively solving further comprises calculating a fitness value for each intermediate mapping.
12. The method of claim 11 , further comprising comparing calculated fitness values for all intermediate mappings stemming from the same source relations and retaining only the intermediate mapping having the highest fitness value.
13. The method of claim 11 or claim 12, wherein the fitness value is calculated using an estimate of the number of nulls that the intermediate mapping will generate for the target attributes in the target schema which are matched to source attributes in the sources within the intermediate mapping.
14. A non-transitory carrier carrying a computer program which when implemented on a computer causes the computer to implement the method of any one of claims 1 to 13.
15. A system comprising a target database, a plurality of source databases comprising source relations and a computing device which is configured to implement the method of any one of claims 1 to 14 to map data from the plurality of source databases to the target database.
PCT/GB2019/051866 2018-07-02 2019-07-01 Method and system for integrating data sets WO2020008180A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1810847.2A GB201810847D0 (en) 2018-07-02 2018-07-02 Method and system for integrating data sets
GB1810847.2 2018-07-02

Publications (1)

Publication Number Publication Date
WO2020008180A1 true WO2020008180A1 (en) 2020-01-09

Family

ID=63143483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2019/051866 WO2020008180A1 (en) 2018-07-02 2019-07-01 Method and system for integrating data sets

Country Status (2)

Country Link
GB (1) GB201810847D0 (en)
WO (1) WO2020008180A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680082A (en) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 Government financial data acquisition system and data acquisition method based on data integration

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7010539B1 (en) 2000-09-08 2006-03-07 International Business Machines Corporation System and method for schema method
US20120078913A1 (en) * 2010-09-23 2012-03-29 Infosys Technologies Limited System and method for schema matching
WO2015148304A1 (en) 2014-03-28 2015-10-01 Tamr, Inc. Method and system for large scale data curation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7010539B1 (en) 2000-09-08 2006-03-07 International Business Machines Corporation System and method for schema method
US20120078913A1 (en) * 2010-09-23 2012-03-29 Infosys Technologies Limited System and method for schema matching
WO2015148304A1 (en) 2014-03-28 2015-10-01 Tamr, Inc. Method and system for large scale data curation

Non-Patent Citations (22)

* Cited by examiner, † Cited by third party
Title
"A Method to Validate Schema Mappings", IP.COM JOURNAL, IP.COM INC., WEST HENRIETTA, NY, US, 30 December 2009 (2009-12-30), XP013136136, ISSN: 1533-0001 *
"In Search of Elegancein the Theory and Practice of Computation", vol. 8000, 25 September 2013, SPRINGER INTERNATIONAL PUBLISHING, ISBN: 978-3-540-37274-5, ISSN: 0302-9743, article BOGDAN ALEXE ET AL: "A New Framework for Designing Schema Mappings", pages: 56 - 88, XP055615201, 032682 *
ABEDJAN ET AL.: "Profiling relational data: a survey", VLDB J., vol. 24, no. 4, 2015, pages 557 - 581, XP035514145, DOI: doi:10.1007/s00778-015-0389-y
ALEXE ET AL.: "Mapmerge: correlating independent schema mappings", VLDB J., vol. 21, no. 2, 2012, pages 191 - 211, XP035032140, DOI: doi:10.1007/s00778-012-0264-z
BELHAJJAME ET AL.: "Incrementally improving dataspaces based on user feedback", INF. SYST., vol. 38, no. 5, 2013, pages 656 - 687, XP028996746, DOI: doi:10.1016/j.is.2013.01.006
BERNSTEIN ET AL.: "Information integration in the enterprise", CACM, vol. 51, no. 9, 2008, pages 72 - 79, XP055292949, DOI: doi:10.1145/1378727.1378745
BOGDAN ALEXE ET AL: "Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration", BUSINESS INTELLIGENCE FOR THE REAL-TIME ENTERPRISE, 24 August 2008 (2008-08-24), Berlin, Heidelberg, pages 108 - 121, XP055615286, ISBN: 978-3-642-03422-0, Retrieved from the Internet <URL:http://web.archive.org/web/20170812160002if_/https://people.cs.umass.edu/~barna/paper/birte.pdf> [retrieved on 20190826], DOI: 10.1007/978-3-642-03422-0_9 *
BONIFATI ET AL.: "Schema mapping verification: the spicy way", EDBT 2008, 11TH INTERNATIONAL CONFERENCE ON EXTENDING DATABASE TECHNOLOGY, NANTES, FRANCE, MARCH 25-29, 2008, PROCEEDINGS, 25 March 2008 (2008-03-25), pages 85 - 96
CHITICARIU ET AL.: "Debugging schema mappings with routes", PROCEEDINGS OF THE 32ND INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 12 September 2006 (2006-09-12), pages 79 - 90, XP058346226
EDWARD ABEL ET AL.: "User driven multi-criteria source selection", INFORMATION SCIENCES, vol. 430-431, 2018, pages 179 - 199
FAGIN ET AL.: "Clio: Schema mapping creation and data exchange", CONCEPTUAL MODELING: FOUNDATIONS AND APPLICATIONS, vol. 5600, pages 198 - 236, XP019121453, DOI: doi:10.1007/978-3-642-02463-4_12
FAGIN RONALD ET AL: "Clio: Schema Mapping Creation and Data Exchange", 2009, INTELLIGENT VIRTUAL AGENT. IVA 2015. LNCS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 198 - 236, ISBN: 978-3-642-17318-9, XP047396468 *
FERNANDEZ ET AL.: "A demo of the data civilizer system", PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD CONFERENCE 2017, 14 May 2017 (2017-05-14), pages 1639 - 1642
H.GARCIA-MOLINA ET AL.: "Foundations of Data Exchange", 2014, CAMBRIDGE UNIVERSITY PRESS, article "Database Systems: The Complete Book"
KHALID BELHAJJAME ET AL.: "13th International Conference on Extending Database Technology", 2010, ACM, article "Feedback-based annotation, selection and refinement of schema mappings for dataspaces", pages: 573 - 584
KONSTANTINOU ET AL.: "The VADA architecture for cost- effective data wrangling", PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD CONFERENCE 2017, 14 May 2017 (2017-05-14), pages 1599 - 1602
KRUSE ET AL.: "Datenbanksystem fur Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs ''Datenbanken und Informationssysteme", PROCEEDINGS, VOLUME 241 OF LNI, vol. 241, 4 March 2015 (2015-03-04), pages 445 - 454
MAHMOUDABOULNAGA: "Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems", PROCEEDINGS OF THE ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD 2010, 6 June 2010 (2010-06-06), pages 411 - 422, XP058087483, DOI: doi:10.1145/1807167.1807213
PAPENBROCK ET AL.: "Data profiling with metanome", PROC. VLDB ENDOW., vol. 8, no. 12, August 2015 (2015-08-01), pages 1860 - 1863, XP058072855, DOI: doi:10.14778/2824032.2824086
SARMA ET AL.: "Bootstrapping pay-as-you- go data integration systems", PROCEEDINGS OF THE ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD 2008, 10 June 2008 (2008-06-10), pages 861 - 874, XP058184164, DOI: doi:10.1145/1376616.1376702
STONEBRAKER ET AL.: "Data curation at scale: The data tamer system", CIDR 2013, SIXTH BIENNIAL CONFERENCE ON INNOVATIVE DATA SYSTEMS RESEARCH, ASILOMAR, CA, USA, JANUARY 6-9, 2013, ONLINE PROCEEDINGS, 6 January 2013 (2013-01-06)
TALUKDAR ET AL.: "Learning to create data-integrating queries", PVLDB, vol. 1, no. 1, 2008, pages 785 - 796, XP058376043, DOI: doi:10.14778/1453856.1453941

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680082A (en) * 2020-04-30 2020-09-18 四川弘智远大科技有限公司 Government financial data acquisition system and data acquisition method based on data integration
CN111680082B (en) * 2020-04-30 2023-08-18 四川弘智远大科技有限公司 Government financial data acquisition system and method based on data integration

Also Published As

Publication number Publication date
GB201810847D0 (en) 2018-08-15

Similar Documents

Publication Publication Date Title
Kong et al. Academic social networks: Modeling, analysis, mining and applications
Li et al. An optimisation model for linear feature matching in geographical data conflation
Giunchiglia et al. A large dataset for the evaluation of ontology matching
CN113434623B (en) Fusion method based on multi-source heterogeneous space planning data
Lemmerich et al. Mining subgroups with exceptional transition behavior
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
CN115328883B (en) Data warehouse modeling method and system
Neto et al. Efficient computation and visualization of multiple density-based clustering hierarchies
CN110737779A (en) Knowledge graph construction method and device, storage medium and electronic equipment
Giannopoulos et al. Classifying points of interest with minimum metadata
Sun et al. Conflating point of interest (POI) data: A systematic review of matching methods
Giannopoulos et al. Learning domain driven and semantically enriched embeddings for poi classification
WO2020008180A1 (en) Method and system for integrating data sets
Reuther Personal name matching: New test collections and a social network based approach
Kwapong et al. A knowledge graph approach to mashup tag recommendation
Bhattacharjee et al. WSM: a novel algorithm for subgraph matching in large weighted graphs
Eddamiri et al. RDF graph mining for cluster-based theme identification
Zada et al. Large-scale data integration using graph probabilistic dependencies (gpds)
Brisebois et al. Efficient scientific research literature ranking model based on text and data mining technique
Feng et al. ASMaaS: Automatic Semantic Modeling as a Service
CN113157882B (en) Knowledge graph path retrieval method and device with user semantics as center
CN110688492B (en) Knowledge graph query method based on lightweight index
Nargesian Relational Data Enrichment by Discovery and Transformation
Sharma et al. Fine-tuned Predictive Model for Verifying POI Data
Rizopoulos Schema matching and schema merging based on uncertain semantic mappings

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19745206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19745206

Country of ref document: EP

Kind code of ref document: A1