WO2020008180A1

WO2020008180A1 - Method and system for integrating data sets

Info

Publication number: WO2020008180A1
Application number: PCT/GB2019/051866
Authority: WO
Inventors: Martin Koehler; Norman PATON; Lacramioara MAZILU; Alvaro FERNANDES
Original assignee: Data Value Factory Limited
Priority date: 2018-07-02
Filing date: 2019-07-01
Publication date: 2020-01-09
Also published as: GB201810847D0

Abstract

A method and a system for integrating data sets are disclosed. The present techniques relate to populating a target database with information from a plurality of sources. The problem of mapping the data has been divided into a collection of simpler sub-problems which involve merging pairs of subsets of the input sources. Iteratively solving the problem comprises solving a base sub-solution in which a minimum number of source relations are used, e.g. one, to create a base set of intermediate mappings. The base sub-solution may be used to build a first set of intermediate mappings. The base set of intermediate mappings and the first set of intermediate mappings may be used to build a second set of intermediate mappings. The final set of mappings which is generated by the iterative process may thus comprise a selected set of intermediate mappings which were generated in previous steps and may comprise each of the mappings generated in each iteration.

Description

Method and System for Integrating Data Sets

Technical field

[0001] The invention relates to a method and system for integrating data sets, for example by populating a target database with information from a plurality of sources.

Background

[0002] There exist known algorithms for generating mappings for populating a target database schema from source schemas. Such algorithms may be informed by different information, such as schema matches, integrity constraints or instance data. One such example is the Clio project as described in US7010539 and as reviewed for example in“Clio: Schema mapping creation and data exchange", by Fagin et al published in Conceptual Modeling: Foundations and Applications, volume 5600 of LNCS, pages 198-236. This project has given rise to comprehensive results on subjects such as data exchange, merging of mappings, debugging of mappings and mapping verification. Examples are discussed in“Foundations of Data Exchange” by Arenas et al published by Cambridge University Press in 2014, “Mapmerge: correlating independent schema mappings” by Alexe et al published in VLDB J., 21 (2):191-21 1 , 2012; “Debugging schema mappings with routes” by Chiticariu et al published in Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 79-90, 2006 and“Schema mapping verification: the spicy way” by Bonifati et al published in EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, March 25-29, 2008, Proceedings, pages 85-96, 2008.

[0003] This line of research has tended to focus on supporting experts who are developing mappings between a single source schema and target schema, where the source and the target are well designed database schemas (for example, with declared keys and foreign keys). In practice, this tends to mean that mapping generation can benefit from precise and exhaustive descriptions of relationships within the source schemas, as well as human-curated matches between the source and the target. In such a setting, the goal is to provide tool support that helps experts curate matches and develop mappings between (potentially complex) source and target schemas for example as described in “Information integration in the enterprise” by Bernstein et al published in CACM, 51 (9): 72-79, 2008.

[0004] More recently, the development of techniques for web data extraction, the publication of extensive open data sets, and the adoption of data lakes means that organisations typically have access to numerous sources in a domain of interest, and a requirement to integrate the key data on a topic at manageable cost. In such a setting, the relevant data may come from many tables, from many independently developed data sources, and thus the relationships between these sources and the ways they can be used together are unlikely to be declared in the sources. Instead, relationships such as inclusion dependencies between sources may have to be inferred from data profiling for example as described in“Profiling relational data: a survey” by Abedjan et al as published in VLDB J., 24(4):557-581 , 2015, and may be partial. Such diverse relationships give rise to a large space of candidate mappings, which are likely of variable utility. Where there can be arbitrary numbers of source schemas, there tends to be less reliable or comprehensive information with which to inform mapping generation, and the focus has tended to shift toward managing the resulting uncertainty that arises from many alternative mappings.

[0005] Some proposals use feedback on results to identify which of the generated mappings are producing appropriate values. For example, given an initial set of mappings between individual sources and a target that could have been produced by Clio,“Incrementally improving dataspaces based on user feedback" by Belhajjame et al published in Inf. Syst., 38(5):656-687, 2013 explores how these mappings can be combined, for example by union, intersection or difference, to produce new mappings that may have higher precision or recall than their predecessors with respect to user feedback on the correctness of result tuples. As such, in this proposal the meta-data about the target schema may include tuples annotated to the effect that they are true positives, false positives or false negatives, and a search seeks to identify sets of mappings that together maximise precision subject to some recall threshold (or vice versa). In“Learning to create data-integrating queries" by Talukdar et al published in PVLDB, 1 (1 ):785— 796, 2008, the user provides additional information about the meta-data for the target schema in the form of example values, and subsequently feedback on results of generated mappings. As feedback is obtained on the correctness or otherwise of result tuples, this is used to refine the scores of candidate mappings. As such, as in Belhajjame et al., mappings are not assumed to be correct, but rather are scored by users based on the suitability of their results.

[0006] Some other proposals have sought to fully automate mapping generation at scale, for example motivated by web scale integration problems. For example,“Bootstrapping pay-as-you- go data integration systems" by Sarma et al published in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 861-874, 2008 describes an approach to generating mappings for a probabilistic mediated schema. Here there is no pre-existing target schema, and the technical focus is on aligning source attributes with counterparts in a mediated schema, and not so much on the generation of mappings that may combine data from different sources. Also seeking to operate at web scale, “Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems" by Mahmoud and Aboulnaga published in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 41 1-422, 2010 cluster single table sources, and then map keyword queries to the domains represented by the clusters. In such approaches, there is an attempt to provide some measure of integration with little additional information about the user’s requirements (e.g. the target schema may not be provided) and with little additional information about the sources (e.g. the meta-data about the source schema may be minimal). In such a setting there is little evidence with which to consider how to combine data from different sources, and thus mappings tend not to be expressive. [0007] There is some other work that seeks to combine data from multiple structured data sources. For example,“Data curation at scale: The data tamer system” by Stonebraker et al published in CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013 targets the integration of significant numbers of enterprise data sets and is also described in WO2015/148304. In relation to mapping, the approach described in that work focuses the integration effort at the instance-level, through entity resolution and fusion, with ongoing human input, for example in the form of training data. The goals of the Data Civilizer project seem similar, although with a greater emphasis on discovery and cleaning as described in“ A demo of the data civilizer system" by Fernandez et al published in Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1639-1642, 2017. There is also research being carried out within the VADA project as described in “The VADA architecture for cost- effective data wrangling” by Konstantinou et al published in Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1599- 1602, 2017. This research has initially been using the paper by Bonifati above to generate mappings between small groups of sources and the target, which are then subject to mapping selection.

[0008] The applicant has recognised the need for an improved method and system which is able to explore the space of candidate mappings, making informed decisions as to which tables should be combined and how, based on the available matches and profiling data.

Summary

[0009] According to the present invention there is provided an apparatus and method as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows.

[0010] We describe a data integration method for populating a target database with data from a plurality of source databases with each source relation in a source database having a plurality of source attributes, the method comprising: inputting a target schema for the target database, wherein the target schema comprises a plurality of target attributes; generating a set of mappings which populates the target database with data from at least a subset of the plurality of source relations by defining a plurality of sub-solutions wherein for each sub-solution a different number of the source relations are merged and each sub-solution comprises a set of intermediate mappings to map data from at least a subset of the source relations to the target scheme using source attributes of each source relation which match target attributes in the target schema; iteratively solving each of the plurality of sub-solutions to generate, at each iteration, a set of intermediate mappings which maps more source relations to the target schema than a set of intermediate mappings generated in the previous iteration by merging pairs of mappings from each set of intermediate mappings generated in the previous iteration, and outputting as the set of mappings at least some of the mappings within the sets of intermediate mappings which are generated in previous iterations. [0011] The output mappings can then be processed by a subsequent step such as mapping selection. Through mapping selection, the preferred mappings are selected based on various features such as size of output, fitness of result, etc. Examples of such techniques can be found in the work of Edward Abel et al. in“User driven multi-criteria source selection” published in Information Sciences, Volumes 430-431 , 2018, Pages 179-199, and Khalid Belhajjame et al, Feedback-based annotation, selection and refinement of schema mappings for dataspaces”, 13th International Conference on Extending Database Technology, ACM, 2010: 573-584.

[0012] In other words, the complex problem of mapping the data has been divided into a collection of simpler sub-problems which involve merging pairs of subsets of the input sources. Iteratively solving the complex problem comprises solving a base sub-solution in which a minimum number of source relations are used, e.g. one, to create a base set of intermediate mappings. The next step may comprise using the base sub-solution to build a first set of intermediate mappings, e.g. by selectively merging pairs of mappings from the base set of intermediate mappings. The next step may then comprise using both the base set of intermediate mappings and the first set of intermediate mappings to build a second set of intermediate mappings, for example by selectively merging pairs of mappings from the base set with the first set. The final set of mappings which is generated by the iterative process may thus comprise a selected set of intermediate mappings which were generated in previous steps and may comprise each of the mappings generated in each iteration. The final set of mappings may thus be a set of best-effort mappings which might include an overall solution which merges all input source relations but more typically comprises a subset of the source relations in light of overlap between the source relations.

[0013] The method may comprise identifying a plurality of candidate keys within the plurality of sources, wherein each candidate key is a single or a plurality of source attributes. The method may further comprise calculating inclusion dependencies for the sources, wherein an inclusion dependency is a measure of the overlap between the values of two sets of sources attributes. The candidate keys and/or the inclusion dependencies may be inputs to the method. The candidate keys and the inclusion dependencies may be considered to be profiling data and/or metadata for the sources. Each candidate key may be defined as a set of source attributes which has unique values within that source relation, i.e. a candidate key for a relation in the form of a table with rows and columns may be a column (or a combination thereof) that has unique values in the relation. An inclusion dependency may be partial, i.e., there is some overlap between the values of a set of source attributes and the values of another set of attributes, or total, i.e., the values of one set are fully contained by the values of another set of attributes.

[0014] Selectively merging pairs of mappings may comprise determining whether the pair of mappings are to be merged using at least one of the source attributes, the candidate keys and the inclusion dependencies. For example, if there is no possibility of inferring a relationship between the source attributes within the pair of mappings, no pairwise merging is possible. In other words, not all intermediate mappings within a set may be paired with intermediate mappings in another set. However, when it is determined that the pair of mappings are to be merged, pairwise merging may comprise selecting a merging which is one of a union, a join or a full outer join of the sources within the pair of mappings. Union, join, and full outer join are standard relational operators that can be applied between two relations to merge them, and they are defined as in“Database Systems: The Complete Book, 2nd Edition" by H.Garcia-Molina et al, Pearson, 2014.

[0015] Selectively merging pairs of mappings may comprise determining whether a first mapping within the pair of mappings is subsumed by a second mapping in the pair of mappings, i.e. by applying a subsumption method. By subsumed, it is meant that the values of one mapping are completely contained within the values of the other mapping. The method may further comprise discarding the first mapping to prevent merging when it is determined to be subsumed by the second mapping. Again, where appropriate, this reduces the number of mappings which are to be merged.

[0016] The subsumption method may comprise identifying a first set of attributes which are the attributes in the first mapping which match the target attributes and identifying a second set of attributes which are the attributes in the second mapping which match the same target attributes. The method may further comprise selecting the inclusion dependency between the first and second set of attributes having the highest value; and comparing the selected inclusion dependency to a threshold value, e.g., one. This comparison can be used to determine whether or not a first mapping is subsumed with respect to the target by a second mapping or vice versa. When the inclusion dependency equals the threshold value, this means that one mapping is subsumed by the other mapping. The selected inclusion dependency may be between candidate keys of the first and second mappings.

[0017] The method may further comprise determining whether the first set of attributes match the same target attributes as the second set of attributes, and when there is a match, selectively merging using a union. Alternatively, when they do not match exactly the same target attributes, selectively merging using a full outer join when the selected inclusion dependency between the candidate keys of the two mappings is less than the threshold value, and selectively merging using a join when the selected inclusion dependency equals the threshold value, i.e. when a foreign key relationship can be inferred between the two mappings. In other words, the nature of the selective merging may be dependent on whether or not the first and second sets of attributes match the same or different target attributes.

[0018] There may be occasions when it is not possible to identify any inclusion dependencies and in this case, the method may comprise identifying a first set of candidate keys which are the candidate keys in the first mapping which match a subset of target attributes, identifying a second set of candidate keys which are the candidate keys in the second mapping which match a subset of target attributes, determining whether the first set of candidate keys match the same target attributes as the second set of candidate keys, and if there is a match, selectively merging using a full outer join. If there is no match, it may be determined that there is no possible or useful pairwise merging in such a case. [0019] Generating each set of intermediate mappings may further comprise calculating a fitness value for each intermediate mapping. For example, the fitness value may be calculated using the equation:

where m is the intermediate mapping, t is the target relation, arity(t) is the total number of attributes in target t, matches(m) is the number of matched attributes that mapping m has w.r.t. target t, nulls(m) is the estimated number of nulls that mapping m will generate on the values of the matched attributes and T(m) is the estimated size of mapping m, i.e., the estimated number of tuples that will result after running mapping m.

[0020] In other words, the fitness value may be calculated using an estimate of the number of nulls that the intermediate mapping will generate for the target attributes in the target schema which are matched to source attributes in the sources within the intermediate mapping. The method may further comprise comparing calculated fitness values for all intermediate mappings based on the same sources and retaining only the intermediate mapping having the highest fitness value. This reduces the number of intermediate mappings which need to be stored at each iteration.

[0021] We also describe a non-transitory carrier carrying a computer program which when implemented on a computer causes the computer to implement the method described above.

[0022] We also describe a system comprising a target database, a plurality of source databases and a computing device which is configured to implement the method described above to map data from the plurality of source databases to the target database.

Brief Description of Drawings

[0023] For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which:

[0024] Figure 1 a is a flowchart showing the overall steps of the method;

[0025] Figure 1 b is an example showing a target schema and some input schemas;

[0026] Figure 2 is a flowchart showing the detail of the step of generating mappings in the method of Figure 1 a;

[0027] Figures 3a and 3b are flowcharts showing the detail of the step of merging mappings in the method of Figure 2;

[0028] Figure 4 is a flowchart showing the detail of the step of choosing an operator in the method of Figure 3a;

[0029] Figure 5a is a flowchart showing the detail of a steps in the method of Figure 4;

[0030] Figures 5b and 5c are flowcharts showing the detail of another step in the method of Figure 4;

[0031] Figure 6 is a flowchart showing the detail of a step which is common to the method of Figures 5a and 5b; and [0032] Figure 7 is a flowchart showing the detail of the step of determining whether a mapping is fittest in the method of Figure 3b.

[0033] Figure 8 is a schematic block diagram of a system for implementing the described method.

Detailed Description of Drawings

[0034] Figure 1 a illustrates the overall method for integrating data sets, for example by populating a target database with information from a plurality of sources. In the first step S100, the target schema t is read and, in step S102, the input schemas with their relations are read. Each input relation has one or more attributes. It will be appreciated that although these steps are shown sequentially, some of the steps could be carried out in different orders.

[0035] For example, referring to Figure 1 b, suppose a user is looking for properties in Manchester or Cambridge and would like to know some information about the postcode, street and price together with their associated income and crime ranks. The target schema 20 consists of a single relation labelled“MA_CA_statistics” (which is illustrated as a table in this example) and contains a plurality of attributes: postcode, price, street, income rank and crime rank which are then to be populated with data. Thus as shown a relation may be considered to be a set of tuples (di ,d2,...,d_n), where each element dj is a member of Dj, a data domain. There is also a plurality of input schemas 30, 32, 34, 36 in the form of relations which for simplicity have attributes with similar names for matching source and target attributes (again the inputs are illustrated as tables in this example). The input schema includes: Manchester Agency (MA) 30 which has the following attributes: city, street, postcode and price; Cambridge Agency (CA) 32 which has the following attributes: postcode, city, street, price and county; UK Deprivation (UKD) 34 which has the following attributes: crime rank, crimedecile, county and postcode; and UK quality of life statistics (UKQ) 36 which has the following attributes: county, income rank, economic activity rank and health rank.

[0036] Returning to Figure 1 a, the next step S104 is to create an empty set of sub-solutions for mapping the data in the source schemas into the target schema. Mapping the data is a complex or compound problem because it involves finding mappings involving multiple input relations with attributes that match the same target relation. Each member of the set of sub-solutions represents a method for dividing the complex problem of mapping the data into a collection of simpler subproblems for which it is easier to find a solution. As explained in more detail below, the sub-problems involve pairs of subsets of the initial input relations. The results from each pair of sub-problems are then merged to provide a solution to the complex problem.

[0037] Steps S106 to S1 10 show some pre-processing steps which generate profiling data for each source before mapping generation begins. In step S106, the matches between the attributes of the source and target schemas are identified and read. These matches can be thought of as metadata. In step S108, the candidate keys are identified and read. A candidate key for a relation is a column (or a combination thereof) that has unique values in the relation. For example, in Figure 1 b, the candidate keys are identified using the [ck] attribute annotation. In step S1 10, the inclusion dependencies are identified and read. The candidate keys and inclusion dependencies can be thought of as metadata on the source.

[0038] The inclusion dependencies (inds) are determined as follows. Given two projections R and S with identical arity over relations R’ and S’, respectively, we define Rc₀s, where Q is the overlap of tuples between R and S and Q can have the following values:

1) if R P S = ø, then Rc_0iS, Se ₀₂R, where qi= 0₂ = 0, and, based on profiling evidence, we say that R and S are disjoint and there is no inclusion dependency from R to S (or, indeed, from S to R),

2) if R P S = R, then Rc₀s, where Q =1 , we say that, based on profiling evidence, RciS and there is a (total) inclusion dependency from R to S,

3) if R P S ¹ R and R P S ¹ S, then REeS, where

where V(R) is the number of distinct values on R.

where 0 < Q < 1 , the inclusion dependency is said to be partial.

[0039] For example, referring to Figure 1 b, the partial inclusion dependencies: postcode in UKD and county in UKQ are shown. The overlap is the fraction of the values in the dependent attribute that are found in the referenced attribute: the direction of the arrow is from the dependent attribute to the referenced attribute. Thus, for the candidate key postcode, the overlap from UKD to the source MA with the same attribute is 0.5 (i.e. , half of the distinct values within the column overlap: M1 5BD and M1 5EB). It is noted that the overlap is the same in the opposite direction because each column has four different values. By contrast, the overlap between the postcode columns in UKD and CA differs because CA has two repeated entries. Thus, the overlap from UKD to the source CA for postcode is 0.25 (i.e., one value within the column overlaps: CB2 1 UW) but the overlap from CA to the source UKD for postcode is 0.334. Other overlaps are illustrated in the Figure and summarised below:

[0040] The profile data may also include statistics, such as the cardinality of each relation, the number of distinct values for each column, and the number of nulls for each column, as required for a fitness function that chooses between candidate mappings. Access to the profiling data over the input sources which could be produced using a tool such as Metanome (“Data profiling with metanome” by Papenbrock et al published in Proc. VLDB Endow., 8(12):1860-1863, August 2015) or SINDY (“ Datenbanksystem fur Business, Technologie und Web (BTW), 16. Fachtagung des Gl-Fachbereichs “Datenbanken und Informationssysteme” by Kruse et al published in 4.-6.3.2015 in Hamburg, Germany, Proceedings, volume 241 of LNI, pages 445-454. Gl 2015) is assumed.

[0041] Returning to Figure 1 a, we now have the necessary information to begin creating the mappings. The problem can be defined as given a collection of source schemas S, each source schema with many relations (i.e. , tables 30 to 36 in Figure 1 b), a target relation T (i.e. table 20 in Figure 1 b), metadata on the sources obtained through data profiling, including candidate keys and (partial) inclusion dependencies, metadata in the form of matches between the source relations and the target, generate a set of candidate mappings to populate the target relation with instance data from the source schemas. In step S112, a set of initial mappings is created where for each source relation a 1 -to-1 mapping is created, i.e., each 1 -to-1 mapping maps only one source relation to the target relation, without merging it with other source relations. This step is considered to be the creation of the base sub-solutions for the mapping problem, i.e., if the input relations cannot be merged, then the algorithm will be able to output the base solution. The output is“initMaps”. This output is then added to the set of subsolutions in iteration 1 as subsolutions[1]=initMaps (step S1 14). Each element in initMaps is a sub-solution for each of the source relations that have at least one match with the target relation.

[0042] Each sub-solution is a solution to a sub-problem which is trying to find a mapping for fewer relations than in the initial input. Each sub-problem thus represents an iteration in the mapping generation process. At step S116, the method determines the number of the last iteration, i.e. finaliterationidx, which is equal to the number of input relations. The determined finaliterationidx is an input to the recursive algorithm which is used to generate the mappings - genMaps - step S118. The output of this algorithm consists of all the subsolutions which are generated and retained in all iterations of the algorithm. In a final step, these sub-solution mappings can be outputted (step S120) and the method ends. However, it will be appreciated that a selection of these intermediate mappings may also be output. In other words, the final set of mappings comprises at least some of the mappings from the sets of intermediate mappings generated at each iteration.

[0043] Merely as an example, the table below shows how the inputs in Figure 1 b could be combined to create two complete target rows. It will be appreciated that this is not a complete output of a mapping and is merely provided for illustration. For the first complete row, the first three data entries may be from the MA table and the other two data entries from the UKD and UKQ tables respectively. For the second complete row, the first three data entries may be from the CA table and the other two data entries from the UKD and UKQ tables respectively. In both examples, the candidate key of the postcode may be used to merge the MA/CA and UKD tables and the candidate key of country to merge the UKD and UKQ tables as explained in more detail below.

[0044] Figure 2 shows the detail of the recursive algorithm named“genMaps” which is the top-level algorithm that uses dynamic programming to divide the complex problem into a series of subproblems. Dynamic programming is an approach to solving a complex problem by diving it into a series of subproblems. In the recursive process, the algorithm will try to break down the complex problem, i.e., finding a mapping for all initial sources (finaliterationidx), into smaller sub-problems, i.e., finding mappings for fewer initial relations by recursively calling genMaps with a parameter idx < finaliterationidx. When genMaps for iteration 1 (genMaps(1)) is called, it means that it reached the base solution, and the initial mappings that were previously set as the subsolution for iteration 1 will be returned. By trying to build the mapping for the final iteration, intermediate subsolution mappings need to be created and used in sequent iterations. As explained in more detail below, given N initial source relations, in each iteration idx, 1 < idx < N, the algorithm searches for the best way to merge any idx source relations. The mappings that result from combining each subset of source relations characterize new relations that are referred to as intermediate mappings, the collection of which comprises the solution at iteration idx. The flowchart can also be expressed as an algorithm as follows:

1 : function genMaps(idx)

2: if subsolution[idx] exists then

3: return subsolution[idx]

4: else

5: iteration_maps <— 0

6: for i <— 1 to idx/2 do

7: batchl <— genMaps(i)

8: batch2 <— genMaps(idx - i)

9: new_maps <— MERGEMAPPINGS(batch1 , batch2)

10: iteration_maps.add(new_maps)

1 1 : subsolution[idx] <— iteration_maps

return subsolution[idx]

[0045] Thus, as shown in step S200, the input to the algorithm is idx (the iteration index) and the output collection of mappings is termed“resultMaps”. Initially, as shown the set of mappings is empty (step S202). At step S204, there is a determination as to whether a subsolution for this iteration (idx) already exists. If the subsolution already exists, it is output as the collection of mappings for this iteration, i.e., resultMaps = subsolution [idx] (step S206). If the subsolution does not exist, / is set to 1 to perform an internal recursive algorithm as explained below. After initializing / with 1, a determination as to whether i<idx/2 is made. If this is not true, the subsolution for this iteration is memoized as the subsolution for iteration idx (line 1 1 & subsolution [idx] = resultMaps (step S212)); for reuse in subsequent calls to the algorithm (e.g., lines 2 to 3 and steps 204 and 206).

[0046] If i<idx/2, two sets of intermediate mappings - batch 1 and batch 2 - are generated at steps S214 and S216, respectively. Batch 1 is the collection of intermediate mappings for the input / and batch 2 is the collection of intermediate mappings for the input idx-i. As shown by the dotted line, the algorithm recursively calls the same method, genMaps, with different input parameters. The next step is to merge the sub-solutions from iterations / and idx-i, i.e., to merge mappings from batch 1 and batch 2 (step S218 and line 9), where each mapping from one batch is merged with all the mappings from the other batch. The resulting merged mappings are output as iMaps, i.e. the intermediate mappings for this iteration, and at step S220 these are added to the collection of mappings - resultMaps. / is then incremented by 1 as shown at step S222 and the steps S210 to S222 are iteratively repeated until the condition in step S210 is not satisfied. After iteration N, a set of mappings which merge all or subsets of the initial source relations can be obtained from the collection of sub-solutions. The schema of the output mappings is the same as the schema of the target relation, though some attributes may contain only null values.

[0047] For example, in Figure 1 b, where N<— 4, in iteration idx <— 4 (the last one), the algorithm will try to merge the batch of mappings from iteration 3 with the batch of mappings from iteration 1 , and then pair-wise merge the mappings from iteration 2. For example, assume that in iteration 2, the following mappings were found (NB these are not the complete list but just examples for illustrative purposes):

rri2_i<— merge (MA, CA)

ni2_2<— merge (UKQ, UKD)

ni2_3<— merge (CA, UKD)

Then in iteration 4, the algorithm will try to merge (rri2_i, rri2_2), (rri2_2, må_3) and (rri2_i, rri2_3). It can be observed that by merging rri2_i, with rri2_2, a mapping that incorporates all initial sources will be obtained.

[0048] Figures 3a and 3b shows the detail of the algorithm named“mergeMaps” which implements the steps that take place within an iteration of the main loop of genMaps shown in Figure 2. The detailed algorithm can also be expressed as follows:

1 : function mergeMaps(batch1 , batch2)

2: new_maps <— 0

3: for each map i in batchl do

4: for each map j in batch2 do

5: operator <— chooseOperator(map i, map j)

6: if operator not null then

7: interm_map <— genlntermMapping(operator)

8: md <— computeMetadata(interm map)

9: if isFittest(interm_map , new_maps) then

10: new_maps.add(interm_map)

return new_maps

[0049] Referring to Figures 3a and 3b, the inputs to the mergeMaps algorithm are batchl and batch2 and the output is iMaps (step S300). The initial step is that iMaps is empty (step S302). At step S304, / is set to 0 to perform the iterative algorithm as explained below. After initializing / with 0, a determination as to whether i is smaller than the size of batchl is made. If this is not true, iMaps is output (step S309). If this is true, j is set to 0 (step S308) and a determination as to whether j is smaller than the size of batch2 is made (step S310). If this is not true, / is incremented by 1 (step S31 1) and steps S306, S308 and S310 are repeated. This represents the creation of the pairs of mappings from the two batches by pairing each / mappings from batchl with all j mappings from batch2.

[0050] If j is smaller than the size of batch2, at step S310, MERGEMAPPINGS uses another algorithm chooseOperator (line 5 and step 312) to choose an operator with which to combine map / and map j , from batchl and batch2, respectively, if they can usefully be merged (in Figure 3a, these maps are labelled batchl [i] and batch2[j]). map / and map j represent pairs of mappings from batches of mappings from two previous iterations. The chooseOperator algorithm is described in more detail below. If the mappings cannot be usefully combined, i.e. , if at step S314 the output is null, j is incremented by 1 and steps S310, S312 and S314 are repeated.

[0051] If the mappings can be usefully combined, the new mapping is generated (line 7 and step S318) and the metadata for the resulting intermediate mapping is computed (line 8 and step S320 beginning on Figure 3b). The intermediate mapping is retained if it has the highest fitness of any mapping involving the same source relations (lines 9-10 and steps S322 to S326). The isFittest function selects the mapping with the highest fitness and is explained in more detail below. It can be defined in different ways using table statistics and profiling data, for example by preferring candidate mappings with fewer nulls. As shown at step S324, if the new mapping satisfies the isFittest function, the old mappings that stem from the same initial relations as iMap are removed (step S326) and the new iMap is added to the set of retained iMaps (step S328). However, if the new mapping does not satisfy the conditions in the isFittest function, the mapping is not retained, j is incremented by 1 , and steps S310, S312 and S314 are repeated. Both loops terminate when all pairs of mappings are created and merged.

[0052] Figure 4 shows the detail of the algorithm named “chooseOperator” which is used in the mergeMaps algorithm shown in Figure 3a to combine two intermediate mappings taking into account how these relate to the given target table. The detailed algorithm can be expressed as follows:

1 : function chooseOperator (map1 , map2)

2: \\ t_rel, pd (profile data) are global variables

3: map1_ma <— matchesAttr (map1 , t_rel)

4: map2_ma <— matchesAttr(map2, t_rel)

5: operator <— null

6: if sameMatches(map1_ma, map2_ma) then

7: operator <— genOperatorSame(map1 , map2)

8: else

9: if diffMatches (map1_ma, map2_ma) then

10: operator <— genOperatorDiff (map1 , map2)

return operator

[0053] As set out in Figure 4, the inputs to the algorithm are iMapl and iMap2 (i.e., 2 mappings) and the output is the op (operator) as set out in step S400. As shown in step S402, initially the operator is set to NULL value. The sets of matched target attributes are retrieved for each mapping w.r.t. the target (steps S404 and S406, lines 3 to 4 above). The input in step S404 is thus the matches of the mapping iMapl with the attributes of the target and similarly for step S406 is the matches of the mapping iMap2 with the attributes of the target. The outputs are labelled tm1 (map1_ma) and tm2 (map2_ma) respectively, representing the two sets of matched target attributes.

[0054] The two outputs are then compared (step S408) and if they do not match, i.e., if the two mappings match different attributes within the target schema, they are considered candidates for joining by the algorithm GenOperatorDiff (Step S410 and line 10). Alternatively, if the outputs match and thus the two mappings match the same attributes within the target schema, they are considered candidates for unioning by the algorithm GenOperatorSame (Step S412 and line 7). Returning to the example shown in Figure 1 b, the matches for the MA relations are thus postcode, price and street, and CA also has the same matches so these are candidates for GenOperatorSame. However, UKD has different matches, namely postcode and crime_rank to those of MA, and thus MA and UKD are candidates for joining by the algorithm GenOperatorDiff.

[0055] The output of chooseOperator will either be an operator that represents the way to merge two input mappings or NULL if no useful way could be found to merge them with respect to the target relation. If the algorithm merges the two mappings, then the resulting mapping that is generated will be the result of union, join or full outer join. Another possible action is to discard subsumed mappings; a relation A is subsumed by a relation B if A £ B, so the subsumed relation need not be considered in subsequent merges.

[0056] Figure 5a shows the detail of the algorithm named“genOperatorSame” which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between two source relations when they match identical target attributes. The detailed algorithm can also be expressed as follows:

1 : function genOperatorSame(map1 , map2)

2: subsumedMap <— isSubsumed(map1 , map2)

3: if subsumedMap not NULL then

4: discard(subsumedMap)

5: operator <— null

6: else

7: operator <— Union(map1 , map2)

8: return operator

[0057] As set out in Figure 5a, the inputs to the algorithm are iMapl and iMap2 (i.e. 2 mappings) and the output is the op (operator) as set out in step S500. As shown in step S502, initially the operator is set to NULL value. An algorithm entitled“isSubsumed” (line 2, step S504, the algorithm is described in more detail in Figure 6) tests whether the values for the matched attributes from the first relation are fully contained in the attributes in the second relation, i.e., are subsumed. As shown in step S504, the inputs are also iMapl and iMap2 and the output is the subsumed mapping out of the two given as input (subsumedMap). [0058] In a first case (lines 3 to 5), the output subsumedMap will not be null and no merge will take place because one relation is subsumed by the other relation and thus the subsumed relation can be safely discarded (lines 4,5 and step 510). In a second case (line 7 and step 508), the chosen operator is the union of the two projected mappings on the attributes that are needed in the target and that both mappings match. In other words, the union of iMapl (also expressed as map1) and iMap2 (also expressed as map2) is returned.

[0059] Returning to the example shown in Figure 1 b, one of the calls to the“genOperatorSame” will be made in iteration 2 to merge the CA and MA relations as they match the same target attributes, e.g. postcode, price and street. It can be observed from Figure 1 b, that one does not subsume the other; in fact, their tuples are disjoint, and the lack of (partial) inclusion dependencies between the two relations reflects this. The result of the merge will be a mapping: ni2_i<— union (MA, CA).

[0060] Figures 5b and 5c show the detail of the algorithm named“genOperatorDiff which is used in the chooseOperator algorithm shown in Figure 4 to decide which operator to apply between mappings that do not match the same target attributes, i.e., the sets of matched attributes can either be disjoint or have common attributes. The detailed algorithm can also be expressed as follows:

1 : function genOperatorDiff(map1 , map2)

2: \\ t_rel, pd (profile data) are global variables

3: operator <— null

4: subsumedMap <— isSubsumed (map1 , map2)

5: if subsumedMap not NULL then

6: discard(subsumedMap)

7: return operator

8: map1_keys <— getKeys(pd, map1)

9: map2_keys <— getKeys(pd, map2)

10: ind <— getMaxlnd(map1_keys, map2_keys)

1 1 : if ind exists then

12: if ind. overlap = 1 .0 then

13: operator ·*— Join(map1 , map2, ind. attributes)

14: else

15: operator <— OuterJoin(map1 , map2, ind. attributes)

16: else

17: mapljnd <— getMaxlnd (map1 keys, map2. attributes)

18: map2_ind <— getMaxlnd (map2 keys, map1 .attributes)

19: ind <— maxCoef(map1_ind, map2_ind)

20: if ind exists then

21 : if ind. overlap = 1 .0 then

22: operator <— Join(map1 , map2, ind. attributes)

23: else

24: operator <— OuterJoin(map1 , map2, ind. attributes)

25: else

26: map1_mk <— getMatchedKeys(map1 , t_rel)

27: map2_mk <— getMatchedKeys(map2, t_rel)

28:

29: if matchSameTargetAttributes(map1_mk, map2_mk) then

30: operator <— OuterJoin(map1 , map2, <map1_mk, map2_mk>)

31 : return operator

[0061] As shown in Figure 5b, the inputs and output of the genOperatorDiff algorithm are the same as those to the genOperatorSame algorithm of Figure 5a. Furthermore, steps 552, 554, 556 and 560 correspond to steps 502, 504, 506 and 510 of Figure 5a. In other words, as also shown in lines 4 to 7 above, if one relation subsumes the values of the other relation on the attributes that match the target, the subsumed relation is discarded and the method returns null as no operator was found that yields useful information. If the subsumption conditions are not met, the algorithm tries to merge the input relations.

[0062] The candidate keys (i.e. iMapl .ckeys and iMap2.ckeys - also expressed as map1_keys and map2_keys) from both input relations are retrieved from the profile data (lines 8 to 9). Furthermore, any (partial) inclusion dependencies (termed ind) between a pair of keys from the two sets of candidate keys are also retrieved from the profile data. The algorithm then chooses the IND with the highest overlap (step S562 and line 10), i.e., the maximum is chosen (getMaxIND). However, these inclusion dependencies may not exist and thus the algorithm checks whether an inclusion dependency has been found (line 1 1 and step S564).

[0063] Continuing to Figure 5c, if there is an output IND, the level of overlap is checked, i.e., compared to 1 (step S566, line 12). If the overlap is 1 , the chosen operator is a join because a foreign key relationship has been inferred between the two input mappings on their candidate key attributes (step S568 and line 13). Alternatively, if the overlap is between 0 and 1 (but not equal to 0 or 1), the chosen operator is a full outer join because a foreign key relationship could not be inferred, but some tuples can be correlated due to the overlap between the candidate keys. Thus, the algorithm merges the two mappings by applying a full outer join, thus joining the tuples that can be joined while keeping all the data (step S570 and line 15). In both cases, the join condition will compare the candidate key attributes that are involved in the chosen inclusion dependency, IND.

[0064] Returning to Figure 1 b to illustrate this process, one of the calls of genOperatorDiff will be in iteration 2 when merging MA and UKD as these match different target attributes, i.e., they both match postcode, but MA also matches street and price while UKD matches crime_rank. The algorithm will find that there is a candidate key on UKD. postcode and that this attribute has a partial inclusion dependency with MA. postcode which is not a candidate key. Accordingly, there are no inclusion dependencies between two candidate keys and the algorithm tries to infer a foreign key relationship between a candidate key from one relation and a subset of attributes of the other relation which may not represent a candidate key (steps S572, S574 of Figure 5b and lines 17 and 18). Thus, in step S572 (line 17), the inclusion dependencies (IND) for all candidate keys in the first source and the attributes in the second source are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind1). Similarly, in step S574 (line 18), the inclusion dependencies (IND) for all candidate keys in the second source and the attributes in the first source are identified and the IND with the highest overlap is chosen, i.e. the maximum is chosen (getMaxIND with an output ind2). The outputs from steps S572 and S574 are compared to choose the overall maximum at step S576 (line 19).

[0065] There may be no inclusion dependencies between key and non-key attributes and thus the step of checking whether an inclusion dependency has been found is repeated as step S578 (line 20). If there is an inclusion dependency, steps S566 to S570 of Figure 5c are repeated to check the overlap. If the overlap is 1 , a join is used as before (step S568 and line 22). Alternatively, as illustrated in the example, the value of the partial inclusion dependency is 0.5 and thus the result of the merge will be a mapping: OuterJoin(MA, UKD, <MA. postcode, UKD. postcod e>) (step S570 and line 24).

[0066] There may be no inclusion dependencies involving candidate keys, for example in Figure 1 b, there are no candidate keys in MA to share an inclusion dependency with attributes in UKQ or vice versa. Thus, for MA and UKQ it is not possible to infer a foreign key relationship as described above. Accordingly, the algorithm retrieves the candidate keys from both relations that match the target attributes. In more detail shown in Figure 5c, the candidate keys from the first source are retrieved (iMapl .ckeys or map1_keys) as the input and the ones which match the target attributes are output (tckeysl or map1_mk) as set out in step S580 and line 26. Similarly, the candidate keys from the second source are retrieved (iMap2.ckeys or map2_keys) as the input and the ones which match the target attributes are output (tckeys2 or rel2_mk) as set out in step S582 and line 27. These output keys are compared in step S584 (line 29) and if they match, the two relations are merged by applying the full outer join operation as shown in step S586 where the join is on the attributes that meet the above mentioned condition (line 30). Otherwise, no operator which can join the sources is output and the operator is output as null. In the example, there are no attributes in MA and UKQ which are candidate keys and that match the same target attributes, so the comparison between tckeysl (map1_mk) and tckeys2 (map2_mk) will not return any matches thus the operator will be null.

[0067] Figure 6 shows the detail of the algorithm“isSubsumed” which is used as described above. As shown in step S600, the inputs are iMapl and iMap2 and the output is the subsumed mapping of the two given as input, i.e., subsumedMap will be equal to iMapl or iMap2. Initially, subsumedMap is set to NULL (step S602). In steps S604 and S606, the attributes from each of the first and second mappings which match the target schema are identified as sal and sa2 respectively. It will be appreciated that although shown as two separate steps, these steps could be carried out simultaneously or in a different order. In step S608, the inclusion dependencies (IND) from the matched attributes for the first mapping to the matched attributes for the second mapping are identified and the IND with the highest overlap is chosen, i.e., the maximum is chosen (getMaxIND with an output ind).

[0068] The level of overlap is next checked, i.e., compared to 1 (step S610). If the overlap is 1 , the subsumedMap is set as the first intermediate mapping (step S612). Alternatively, if the overlap is not 1 , the inclusion dependencies (IND) from the matched attributes for the second mapping to the matched attributes for the first mapping are identified and the IND with the highest overlap is chosen (step S614). The level of overlap is checked again (step S616) and if the overlap is 1 , the subsumedMap is set as the second intermediate mapping (step S618). Alternatively, if the overlap is not 1 the subsumedMap remains NULL and it is returned because neither mapping is subsumed by the other.

[0069] Figure 7 shows the detail of the algorithm“IsFittest” which is used as described above. Each intermediate mapping will contain values that are of different quality with respect to the chosen target relation, so, in order to prune the search space, the“IsFittest” algorithm eliminates the intermediate mappings that have poor quality in comparison with other intermediate mappings which stem from the same initial source relations. The algorithm will always keep the best way of merging a subset of source relations by keeping only the intermediate mapping with the highest fitness result. The fitness result will be computed after the new intermediate mapping is created. As shown in step S700, the inputs are iMap, oldMaps and iterationidx and the output is a Boolean output as to whether or not the mapping“isFittest”. Initially, the output is set to FALSE (step S702).

[0070] In step S704, there is a function to getMapsWithlnitRels, in other words all the intermediate mappings which stem from the same initial source relations are retrieved. The inputs are thus iMap and subsolutions[iterationldx] and the output is memoizedMaps. In step S704, this output is added to the set of existing intermediate maps, i.e., oldMaps (step S706). A fitness function for each generated mapping is then calculated for example using the function below:

[0071] In other words, the fitness function has as input a generated mapping and the output will represent the ratio of non-NULL cells, i.e., meaningful values, that will be considered to populate the target table t. It is noted that the calculated term (arity(t)-matches(m))*T(m) will add the necessary NULLs to populate the target t if mapping m does not match all target attributes.

[0072] As an illustrative example, consider a target schema with a relation t where t has 4 attributes so arity(t) = 4. An example mapping m matches three target attributes (out of 4) so matches(m) = 3 and the total estimated number of nulls that mapping m will generate on the matched attributes is five so nulls(m) = 5. Furthermore, the estimated number of tuples that will result after running mapping m is ten so T(m) = 10, thus the fitness result is:

[0073] This means that mapping m is expected to produce an estimate of 62.5% non-NULL values for target relation t. This low value is mostly due to missing one match to the target which means that the result of mapping m needs a padding of 10 NULLs for the unpopulated target column.

[0074] Once the fitness results have been calculated, they are compared at step S708. If the iMap fitness is greater than the oldMaps fitness, the output is TRUE (step S710) and the new map is retained. Alternatively, if the iMap fitness is less than the fitness of previously retained mappings, the output is FALSE (step S712) and the new intermediate mapping is not retained.

[0075] At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated hardware. Figure 8 is a schematic illustration of such a hardware system which can be used to implement the data integration method described above. The system may comprise a computing device 10 which may be any suitable device, e.g. a user device or a server is connected to a plurality of databases 20a, 20b, to 20n each which contain at least one data source 22a, 22b to 22m. The number of databases and sources shown in the Figure is merely for illustration. Moreover, the plurality of data sources may be all stored on a single database, in one or more schemas. There is also a database 30 which can be used to store the combined data from the different sources in an output form, which is labelled target 32. Again, although this is shown as a separate database, it need not be separate from the database(s) with the sources. Furthermore, some or all of the databases are integrated could be stored on the computing device itself.

[0076] The computing device 10 comprises the standard components such as a processor 12, user interface 16 and memory 14. A mapping module 40 is also illustrated which stores and/or implements the algorithms for the data integration method outlined above. The memory 14 may be used to store the intermediate data generated by implementing the method, e.g. any intermediate mappings, fitness scores, inclusion dependencies, metadata and so on.

[0077] Terms such as‘component’,‘module’ or‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object- oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements.

[0078] Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or“comprises” means including the components) specified but not to the exclusion of the presence of others.

[0079] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

[0080] All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

[0081] Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[0082] The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

[0083] Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims.

Claims

1. A data integration method for populating a target database with data from a plurality of source databases comprising source relations having a plurality of source attributes, the method comprising: inputting a target schema for the target database, wherein the target schema comprises a plurality of target attributes;

generating a set of mappings which populates the target database with data from at least a subset of the plurality of source relations by

defining a plurality of sub-solutions wherein for each sub-solution a number of the source relations are merged and each sub-solution comprises a set of intermediate mappings to map data from at least a subset of the source relations to the target scheme using source attributes of each source relation which match target attributes in the target schema;

iteratively solving each of the plurality of sub-solutions to generate, at each iteration, a set of intermediate mappings which maps more source relations to the target schema than a set of intermediate mappings generated in the previous iteration by selectively merging pairs of mappings from each set of intermediate mappings generated in the previous iteration, and

outputting as the set of mappings at least some of the mappings from the sets of intermediate mappings generated in previous iterations.

2. The method of claim 1 wherein iteratively solving comprises solving a base sub-solution in which no source relations are merged to create a base set of intermediate mappings which is used in each iteration.

3. The method of claim 1 or claim 2, comprising

identifying a plurality of candidate keys within the plurality of sources, wherein each candidate key is a set of source attributes; and

calculating inclusion dependencies for the sources, wherein an inclusion dependency is a measure of the overlap between two sets of source attributes;

wherein selectively merging pairs of mappings comprises determining whether the pair of mappings are to be merged using at least one of the source attributes of the source relations, the candidate keys and the inclusion dependencies.

4. The method of claim 3, wherein when it is determined that the pair of mappings are to be merged, selectively merging pairs of mappings comprises selecting a merging which is one of a union, a join and a full outer join of the sources within the pair of mappings.

5. The method of claim 3 or 4, wherein selectively merging pairs of mappings comprises determining whether a first mapping within the pair of mappings is subsumed by a second mapping in the pair of mappings.

6. The method of claim 5, comprising discarding the first mapping to prevent merging when it is determined to be subsumed by the second mapping.

7. The method of any one of claims 3 to 6, comprising:

identifying a first set of attributes which are the attributes in the first mapping which match the target attributes,

identifying a second set of attributes which are the attributes in the second mapping which match the same target attributes,

selecting the inclusion dependency between the first and second set of attributes having the highest value; and

comparing the selected inclusion dependency to a threshold value.

8. The method of claim 7, comprising

determining whether the first set of attributes match the second set of attributes, and when a match is determined, selectively merging using a union when the selected inclusion dependency is less than the threshold value.

9. The method of claim 7, comprising

determining whether the first set of attributes match the second set of attributes, and if the sets of attributes are not the same, selectively merging using a full outer join when the selected inclusion dependency is less than the threshold value and selectively merging using a join when the selected inclusion dependency equals the threshold value.

10. The method of any one of claims 3 to 5, comprising:

identifying a first set of candidate keys which are the candidate keys in the first mapping which match the target attributes,

identifying a second set of candidate keys which are the candidate keys in the second mapping which match the target attributes,

determining whether the first set of candidate keys match the same target attributes as the second set of candidate keys, and

if there is a match, selectively merging using a full outer join.

1 1 . The method of any one of the preceding claims, wherein iteratively solving further comprises calculating a fitness value for each intermediate mapping.

12. The method of claim 11 , further comprising comparing calculated fitness values for all intermediate mappings stemming from the same source relations and retaining only the intermediate mapping having the highest fitness value.

13. The method of claim 11 or claim 12, wherein the fitness value is calculated using an estimate of the number of nulls that the intermediate mapping will generate for the target attributes in the target schema which are matched to source attributes in the sources within the intermediate mapping.

14. A non-transitory carrier carrying a computer program which when implemented on a computer causes the computer to implement the method of any one of claims 1 to 13.

15. A system comprising a target database, a plurality of source databases comprising source relations and a computing device which is configured to implement the method of any one of claims 1 to 14 to map data from the plurality of source databases to the target database.