IE20120488A1

IE20120488A1 - Multi-taxonomy merger algorithm

Info

Publication number: IE20120488A1
Application number: IE20120488A
Authority: IE
Inventors: Ashley Boyd-Lee
Original assignee: Sciemus Ltd
Priority date: 2012-10-17
Filing date: 2012-10-17
Publication date: 2014-04-23

Abstract

A method of precessing data that has been charactyerised using multiple taxonomies in order to combine and reconcile the data comprises: defining a set of allowable identifier items, were each identitier item is a single string or primitibe type or other code, where each identifier item indicates which child node in the taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with where that identifier item applies, where an identifier consists of the set of identifier items for which valid identification information exists for that datum, and where each datum of the source data comprises both an identifier and the datum information, wherein at least some of the source data has incomplete identifiers with one or more identifier items not being included; defining a qualification value for each identifier, the qualification value being equal to the number of identifier items that contain valid information, and anidentifier having a complete set of identifier items being categorised as fully qualified; merging the multiple taxonomies for the source data by making each child node in a combined taxonomy have a one higher qualification value than the parent node from which it stems; and determining a probability for a specific fully qualified identification of a datum having incomplete identifiers as given by a suitable statistical or other logical representation of the combined effect of the probability of each descendant child node leading down the taxonomy from the respective incomplete identity node in the taxonomy to the specific fully-qualified node in the taxonomy, when the taxonomies are merged using the defined qualification value ordering; such that all source data is associated probabilistically with fully qualified nodes of the combined taxonomy.

Description

MULTI TAXONOMY MERGER ALGORITHM The invention relates to a method of processing data that is in multiple taxonomies in order to combine and reconcile the data.

This application relates to data of the type that can typically be represented by a classification tree known as a taxonomy. Figure 1 shows an example taxonomy that is referenced in more detail below. A taxonomy comprises nodes in the form of holders at junctions of the taxonomy or at end points, which are known as leaf nodes. For example in Figure 1 each box is a node and the leaf nodes are the lowermost boxes. A given datum can be placed within the taxonomy based on its collection of identifier items, which are single strings (or primitive types) indicating which child node in the taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with. Example identifier items in Figure 1 include Small, Ford, Hatchback and so on, The identifiers (i.e. comprised of one or more identifier items) that make up the identity of the data in a given taxonomy are referred to herein as allowable identifiers. If a datum has a full set of identifier items then it can be allocated to a particular leaf node and its identity is said to be fully qualified. A fully qualified datum has the maximum qualification value for the given taxonomy, where the qualification value corresponds to the number of allowable identifier items defined for the datum. If one or more identifier items are unspecified or missing then the datum is only part qualified and the qualification value will be less than the maximum.

Business data can come from multiple sources and this generally means that it is in different taxonomies. It is desirable to be able to combine data from different sources, for example in order to have a larger and hence more statistically significant pool of data. A problem hampering the fusion of data with multiple taxonomies concerns data identity.

Different sources often categorise data differently and some data sources fail to completely elaborate the identity of the data. Often the lowest grade data with respect to its identification is the most valuable, yet prior-art methods fail to extract the full information that such ‘irregular’ data provides.

An example is data that a market analyst holds on cars such as in figure 1. Suppose the data has identifier items of size, number of doors, manufacturer and model. An illustration of the type of problem is that the analyst’s differing sources of data do not specify the full identification of the data. Suppose, for example, some data is identified only by manufacturer. Another report provides a survey of only small cars. In such manner difficulties arise in reconciling the data. The analyst thus has trouble associating the data with multi taxonomies that in some way intersect. Real world situations of this kind are almost always complicated by variable quality data, where’some'of the^da'ta' have identities that are even less qualified than desirable. I Of kN fO » cr K w , I m ,-. r r η T:1 Γί-Π j) Ά iin i/ i ii /.U 0 * F”* £ $ CfrA -2A further consequence is that the analyst has a statistical problem in how to model the limitedness of the data identity.

There are many ways in which the problem could be modelled. Prior art approaches can even lead to a less accurate statistic than when only a single taxonomy was used, and therefore have clearly not solved the general problem to an adequate practical level. There is a need for a method of combining data where: 1. Source data belongs to at least two differing taxonomies. 2. Some ofthe source data has identifiers that are not fully qualified; and 3. It is required to query the resulting combined qualification taxonomy to associate data (probabilistically) with fully-qualified nodes, using both the fully-qualified and partially-qualified data.

The third of the above criteria arises in many industries, for example in the insurance industry, in many areas of business where risk assessment is required and in fact in any other situation where data is held in a taxonomy.

There are a variety of pre-existing ways of analysing information belonging to taxonomies. These use approaches such as: • Scoring - is essentially a crude form of classification with a single taxonomy and thus is not relevant to the defined problem since it cannot provide the required combination of multiple taxonomies. US 6418436 describes a method that uses scoring.

· Segmentation scoring - as described in ‘Recognition using classification and segmentation scoring’, by O. Kimball, M. Ostendorf, and R. Rohlicek, Boston University, MA. This is another kind of single taxonomy.

• Pairwise scoring - this approach also does not permit the combination of data from multiple taxonomies. Pairwise scoring is an approach that has arisen in fraud detection analysis, as described, for example, in ‘On the approximate communal fraud scoring of credit applications’, by C. Phua et al, Clayton School of Information Technology, Monash University.

• Weighting - weighting can be even more subjective than scoring in so far as it is often less quantitative in reasoning argument than scoring. Again, weighting results in a single taxonomy.

• Bagging and resampling - this is commonly associated with optimising classification. An example is described in 'Poly bagging predictors for classification modelling for credit scoring’, by Louzada et al, Expert systems with applications, 38, 1271712720. Bagging and resampling is better than scoring or weighting, however, this approach does not define how a related group of multiple taxonomies is to be held in an unified classification schema. liU 2 Ο 4 88 -3• Dimension reduction - This approach loses the information about how the data varies along the dimension(s) that are reduced, which can have the complicating disadvantage of introducing errors that can be bounded only more crudely than desired.

See, for an example, 'Classification and dimension reduction in bank credit scoring system’, B. Liu, B. Yuan, W. Lui, published in ISNN Ό8 Proceedings ofthe 5th International Symposium on Neural Networks: Advances in Neural Networks. ISBN 978-3-540-87731-8, Springer-Verlag Berlin, Heidelberg © 2008.

• Attribution to the taxonomy within the multi taxonomy that shares the most key words as described in ‘Generating and browsing multiple taxonomies over a document collection’, by S. Spangler et al., Journal of Management Information Systems/Spring 2003, Vol. 19, No. 4, pp 191-212. This is another form of dimension reduction and has the same disadvantages.

• Cohesion and distinctness score - this approach is described in ‘Generating and browsing multiple taxonomies over a document collection’, by S. Spangler et al, Journal of Management Information Systems/Spring 2003, Vol. 19, No. 4, pp 191-212. It generates individual taxonomies that have individual themes. These are either closely related (cohere) or are not (distinct). However, this does not provide an overall coherent schema for all the classifiers and hence does not solve the problem set out above.

• Histopathologic scoring - this is a technique specialised to medicine. This is included to indicate an example of classification methods in the sciences, where the methods principally rely upon properties common only to the respective discipline. These are not relevant to above stated problem which requires a generic tool.

There are data mining tools offered to the insurance market such as Rosella™ from Rosella Software of Australia as described atwww.roselladb.com, but these tools do not allow for combination of data from multiple taxonomies.

In patent literature, US 6999975 discloses a method that combines multiple taxonomies. However, the solution is specific to the field of mail processing since it is in order to deal with incomplete postal addresses.

‘Credit scoring models using soft computing methods: a survey’, A. Lahsasna et. Al., The international Arab Journal of information technology, Vol. 7, No. 2, 2010 surveys the credit scoring market. This survey does not identify any method that addresses the problem set out above.

‘Scoring methods for ordinal multi-dimensional forced-choice items’, A.L.M. De Vries (Maastricht University), L.A. Van der Ark, et al (Tilburg University), University of Giorona, 31st March 2008, identifies why the weighting approach fails to be sufficiently accurate, but it does not provide a consistent framework for binding together individual taxonomies to form an efficient unified classification scheme. -4ΙΕί 2 0 4 88 In summary, there are numerous known methods but none are considered to effectively address the requirements set out above.

Viewed from a first aspect, the invention provides a method of processing data that has been characterised using muitiple taxonomies in order to combine and reconcile the data, the method comprising: receiving source data from at least two differing taxonomies; defining a set of allowable identifier items, where each identifier item is a single string or primitive type or other code, where each identifier item indicates which child node in the taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with where that identifier item applies, where an identifier consists of the set of identifier items for which valid identification information exists for that datum, and where each datum of the source data comprises both an identifier and the datum information, wherein at least some ofthe source data has incomplete identifiers with one or more identifier items not being included; defining a qualification value for each identifier, the qualification value being equal to the number of identifier items that contain valid information, and an identifier having a complete set of identifier items being categorised as fully qualified; merging the multiple taxonomies for the source data by making each child node in a combined taxonomy have a one higher qualification value than the parent node from which it stems; and determining a probability for a specific fully qualified identification of a datum having incomplete identifiers as given by a suitable statistical or other logical representation of the combined effect of the probability of each descendant child node leading down the taxonomy from the respective incomplete identity node in the taxonomy to the specific fullyqualified node in the taxonomy, when the taxonomies are merged using the defined qualification value ordering; such that all source data is associated probabilistically with fullyqualified nodes of the combined taxonomy.

This method provides a way to unify otherwise incompatible data into a single regularised scheme that can then be queried using a generic set of rules. Accessibility and usability of the data is greatly improved. The use of a qualification parameter provides an effective way of binding the multiple taxonomies together. The method avoids the loss of information that arises in the prior art when data is combined into a single taxonomy by simplification, and since the method is generic, it can be applied to data of any type to improve the process of reconciling multiple taxonomies in any field. The method has benefits with any number of differing taxonomies for the source data, for example there may be source data from several differing taxonomies because these have been compiled from different information sources.

In a preferred embodiment the step of defining a list of allowable identifier items comprises determining alt possible identifier items in the source data such that the allowable identifier items consists of an aggregate of the identifier items of all data sources. This -5IE 1 2 Ο 4 88 ensures that no identification information can be lost when the data is reconciled and a new combined representation of the data is created. It is also possible to omit identifier items if it appears that items may be of little use for the intended purpose of the combined data or for other reasons.

Preferably, the method includes building the combined taxonomy for the reconciled data based on an analysis of the allowable identifier items to determine allowable identifiers. The combined taxonomy permits the reconciled data to be queried easily. In a preferred embodiment the combined taxonomy comprises a root node with qualification value of zero and/or nodes assigned to all fully-qualified identifiers. These nodes are preferably combined with intermediate nodes that are assigned based on an analysis of possible combinations of identifier items in part qualified identifiers.

The combined taxonomy in its simplest form may comprise all available combinations of identifier items in part (or fully) qualified identifiers. However, this will typically include some combinations that are not actually possible and as a result there would be redundant nodes. Hence, it is preferred for impossible identifiers to be eliminated. In order to do this the analysis of possible combinations of identifier items may include identifying all fully qualified identifiers, generating a list of potential part qualified identifiers for each fully qualified identifier, eliminating part qualified identifiers that are not possible, and then assigning nodes to the remaining, possible, part qualified identifiers.

Preferably the method comprises eliminating non-discriminatory nodes of the combined taxonomy, that is to say eliminating nodes that are not leaf nodes and have only one more qualified immediate descendent. Another way of defining a non-discriminatory node is that it is a node that is not a leaf node and is not at a point where the taxonomy branches. With this feature the taxonomy includes only the discriminatory nodes representing possible identifiers and is hence optimised without any reduction in the information that is represented by the taxonomy.

A preferred system for the elimination of the majority of the part qualified identifiers that are not possible is to use a many-to-one implication rule. A general many-to-one implication rule that may be applied is that of a union of one-or-more identifier item columns has a many-to-one relationship to a different union of one-or-more identifier item columns then this provides an implication set and the set of possible part qualified identifiers is the subset of part qualified identifiers that is not disallowed by one or more of the many-to-one implication rules in the implication set. For example, in some scenarios the identifier items will include a generic identifier item and also a more specific identifier item that is a subset of the generic. With this scenario it is not possible for the generic identifier item to have a null value when the specific identifier item is known. Combinations including a null value for the -6IEl 2 0 4 8 8 generic when there is valid information for the specific are not possible and may be eliminated.

Sometimes, this will leave other part-qualified identifiers that are not possible due to bespoke reasons. These other unnecessary nodes may be eliminated.

When these steps are completed the combined taxonomy of the preferred embodiment will have leaf nodes for the fully qualified identifiers, a root node with a qualification value of zero, and intermediate nodes for all part-qualified identifiers.

The step of determining a probability for a specific fully qualified identification of a datum having incomplete identifiers may be carried out using any suitable technique. One possibility is to determine the probabilities and their error distribution from relevant sample data of known frequencies for the branches of the taxonomy. Another possibility is to determine these probabilities from a cross-correlation of the data across the whole taxonomy as described in the applicant's co-pending application entitled “Taxonomic cross-correlation algorithm and filed at the Irish and United Kingdom patent offices on 17 October 2012. A third possibility is for certain of these probabilities to be determined by an 'expert judgement' of the most likely resolution of the identity of the data. In this 3rd case, the algorithm may require a user interface (or other data input stream) to allow that expert judgement to be input. A fourth possibility is for some other process to determine in real time how the identity of the data will be resolved, such as within game or simulator algorithms.

Viewed from a second aspect, the invention provides a data processing apparatus configured to perform the method of the first aspect and optionally the preferred features thereof as described above.

In a third aspect, the invention provides a data processing apparatus for processing data that has been characterised in multiple taxonomies in order to combine and reconcile the data, the apparatus comprising: a data receiver for receiving source data from at least two differing taxonomies; and a processor arranged to: define a set of allowable identifier items, where each identifier item is a single string or primitive type or other code, where each identifier item indicates which child node in the taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with where that identifier item applies, where an identifier consists of the set of identifier items for which valid identification information exists for that datum, and where each datum of the source data comprises both an identifier and the datum information, wherein at least some of the source data has incomplete identifiers with one or more identifier items not being included; define a qualification value for each identifier, the qualification value being equal to the number of identifier items that contain valid information, and an identifier having a complete set of identifier items being categorised as fully qualified; merge the multiple taxonomies for the source data by making each child node in a combined taxonomy have a one higher -7IEl 2 Ο 4 88 qualification value than the parent node from which it stems; and to determine a probability for a specific fully qualified identification of a datum having incomplete identifiers as given by a suitable statistical or other logical representation of the combined effect of the probability of each descendant child node leading down the taxonomy from the respective incomplete identity node in the taxonomy to the specific fully-qualified node in the taxonomy, when the taxonomies are merged using the defined qualification value ordering; such that all source data is associated probabilistically with fully-qualified nodes of the combined taxonomy.

The processor may optionally be arranged to carry out the method of any or all of the preferred features set out above in connection with the first aspect of the invention.

The invention also provides a computer program product comprising instructions that when executed on a data processing apparatus will configure the data processing apparatus to perform the method of the first aspect of the invention and optionally any of the above described preferable features.

Certain preferred embodiments will now be described by way of example only and with reference to the accompanying drawing in which: Figure 1 shows an example of a taxonomy relating to the car industry.

The preferred embodiment addresses the problem of combining data from multiple taxonomies. The data would be in the form of a plurality of data entries each with a plurality of identifier items. Since the data comes from different sources and differing taxonomies then the identifier items for data from the different sources will differ. An all inclusive set of identifier items is generated via data merging.

The method of the preferred embodiment comprises the following main steps: Step 1 - Introduction of a qualification parameter: A ‘qualification value' is assigned to each datum that is equal to the number of identifier items that the datum has that contain valid information.

Step 2 - A qualification taxonomy is built: All the available data is used to build a taxonomy ordered by increased qualification.

An example taxonomy is shown in Figure 1. The following rules determine how the qualification taxonomy is built. 1. Nodes are assigned to all identifiers having qualification^. (For example see node 'Small'). That is, such a node has an identifier where only one of its identifier items is specified. 2. All fully-qualified identifiers are assigned nodes, which would be leaf nodes. 3. Intermediate nodes are firstly generated according to all possible permutations. The intermediate nodes represent discriminatory possible identifiers. -8IEi 2 0 4 8 8 4. Non-discriminatory intermediate nodes should then be eliminated. In general partially-qualified nodes are assigned only if more than two arrows join them (thus an arrow may jump a level. For example, the arrow that connects 'Estate' to 'Large Ford Estate’).

. Intermediate nodes that do not satisfy the aforementioned many-to-one implication rule can then also be eliminated. 6. Other invalid intermediate nodes can then be eliminated. For example this may be done based on input from an experienced analyst or suitable expert.

Optional rules 4-6 provide compactness, since not every possible permutation is a valid node.

Step 3 - Filling the taxonomy with data: In the first instance, each datum is put in the node in the taxonomy that has the same identifier as the datum. For example, a datum that is identified as a 'Small Ford Hatchback' is put in that node, rather than say in the ‘Ka’ node.

Step 4 - Assignment of probabilities to resolve identity: A suitable method of assignment of probabilities is made to completely resolve the identities of individual data. For example, this might be based on frequencies as follows. If say 40% of Ford hatchbacks are Ka's, then one could infer p(‘Ka'|'Ford hatchback’) = 0.4. More complex representations of probability than this example are anticipated.

Step 5 - Probabilistic taxonomy query: Step 5 yields information by performing queries on the taxonomy as follows. Only fully-qualified taxonomy queries are made. For example, a query on 'Ka' is fully qualified whereas a query on ‘Large hatchback’ is illegal. Continuing the example, when the query demands data on 'Ka', the algorithm searches both the Ka node for data, as well as for data in all other nodes that could fully qualify to a Ka. In the case of the other nodes, for example, a very simple algorithm would assign a valid probability to a datum in the 'Small Ford Hatchback’ node being a Ka. In practice we use a much more complex probability measure. Nonetheless, the claimed algorithm based on a 'qualification' taxonomy is the same.

A fundamental matter that the invention solves is how to simultaneously identify a datum within multi taxonomies in a regularised manner that ensures consistency.

The car example does not illustrate the full complexity of the problem. If instead, we have a more complex case with the following list of identifier items: a) Model b) Submodel c) Variant d) Fuel state (solid, liquid, gas) e) Fuel type (HFO, diesel, etc.) -9ΙΕ 1 2 Ο 4 8 θ f) Meta description g) Generic Description Then we have three taxonomies: 1) 2) 3) Model Fuel state Meta description Submodel Fuel type Generic description Variant The method described herein provides a way of building a taxonomy that allows the probability tree a) to c) to be included with the tree d) to e) which in turn is to be included with the tree f) to g). This transforms a very difficult statistical problem of data analysis into a regularised one that lends itself to solution.

It will be understood that this method applies to any identifier item labels and not just to ‘Model’, ‘Submodel’ and so on. The above list is given by way of example only.

The declared taxonomy can be used to calculate the probability of a datum having a certain fully-qualified identity when some of its identifier items are unknown. That is simply given the probabilities of each taxonomic link, which can be estimated via any suitable classical, Bayesian or other probabilistic approach. The benefit of so doing is to transform unusable data owing to its poor identity into usable data where its identity uncertainty is systematically statistically captured.

Overall, preferred embodiments ofthe invention realise the following benefits. 1. A means of unifying otherwise incompatible data sources into a regularised scheme; 2. Capture of all identifier-item information; 3. Query of the resulting taxonomy is available via a generic set of rules. 4. Compactness - Has considerably fewer nodes that the permutational join of the individual taxonomies.

. Proximality - Owing to compactness, data is likely to exist in a node closer to the one of interest than would otherwise occur; 6. The set of not-fully qualified nodes is the set of ancestral nodes; and a. Elemental classifiers (e.g. small) are found at the qualification=1 level. b. Compound classifiers are usually more relevant (e.g. data on 'Small Ford Hatchback' is more relevant to ‘Ka’ than 'small’). 7. Comprehensiveness - For a general dataset having identifiers, there exists no more discriminating underlying model for the data than one that captures all ofthe (raw-data) classifiers be they elemental or compound . -10ΙΕί 2 0 4 8 8 An example method for the implementation of step 2 will now be described. At step 2, usable raw data will already belong to one or more taxonomies. The following steps are preferably performed to convert from these (source taxonomies) to a unified taxonomy based on a qualification level structure, the structure of which is a core feature of the invention.

In general, the different data sources will classify the data using differing sets of identifier items. For instance, continuing the previous example, there may be the following identifier items: a) Model b) Submodel c) Variant d) Fuel state (solid, liquid, gas) e) Fuel type (HFO, diesel, etc.) f) Meta description g) Generic Description Suppose one data sources uses all seven identifier items. Suppose another data source uses only those belonging to the 1st taxonomy (which were model, submodel and variant). Another data source might use others of the identifiers. Also the first data source that uses all seven identifier items, might not be able to specify values for every identifier item for every datum it holds. Nonetheless, all sources of information are used to find the full set of fully-qualified identifiers given the superset of (in this example) seven identifier items.

Step 2a - Allocation of identifier items In step 2a, the identifier item list of all data sources is simply aggregated. Alternatively, one could use not all of the identifier items.

Step 2b - Perform query to get list of fully-qualified identifiers Using the identifier items obtained in step 2a, both the data sources and datalabelling sources are used to produce a long table of fully-qualified identifiers like the following that can be found in all the data sources. Normally this is done via a data merge, but is manually checked afterwards and gaps filled accordingly.

Model Submodel Variant Fuel state Fuel type Meta Generic Fiesta 1100 cc Carb Liquid petrol Family Northern Fiesta 1400 cc Turbo Gas gas Commuter Northern Ka 1100CC Carb Liquid petrol Family Southern -11 IEl 2 Ο 4 88 Step 2c - Derive the allowable identifiers from the fully-qualified identifiers In the idea! world every datum would have a fully-qualified identifier, but the general situation is one where there are data where their respective identifiers lack one or more identifier items. These data are said to be not fully qualified.

The next step is to compute the list of all the allowable identifiers. An allowable identifier is defined to be an identifier where between zero and many of its identifier items are missing.

For example, if we take just the 1st fully-qualified identifier in the above table, the following allowable identifiers can be generated.

Model Submodel Variant Fuel state Fuel type Meta Generic Fiesta 1100 cc carb Liquid petrol Family Northern Fiesta 1400 cc turbo Gas gas Commuter NULL Fiesta 1400 cc turbo Gas gas NULL Northern Fiesta 1400 cc turbo Gas gas NULL NULL NULL NULL NULL NULL NULL NULL Northern NULL NULL NULL NULL NULL NULL NULL Step 2d - Derive the permissible identifiers from the allowables The table of allowable identifiers can thus be very large. For example, 500 fullyqualified identifiers might generate 500,000 allowable identifiers.

Having so many identifiers is not good, because most of these identifiers are liable to lack data. Therefore the eliminations described below are preferably performed to reduce the set of identifiers to a much smaller and more appropriate set. This involves reducing the set of allowable identifiers to the set of possible identifiers. The following examples help explain the rule that is used to do this. The method is based around a many-to-one implication rule. For example, consider a new fully-qualified set Row Meta Generic 0 House Detached 1 House Semi 2 House NULL 3 NULL Detached 4 NULL Semi IEl 2 0 4 88 The above cases 3 and 4 are not possible because: Detached =>House and Semi =>House. -12The next example shows how the many-to-one implication rule is generalised further Row Meta Generic Orientation 1 Bar Code Up 2 Bar String Down 3 Word String Up 4 Word Code Down Now we have 3 identifier items. These are {Meta, Generic, Orientation}. There is no many-to-one between any pair of identifier items. However, if we union Meta and Generic we find: Row MetaGeneric Orientation 1 Bar Code Up 2 Bar String Down 3 Word String Up 4 Word Code Down Now we see that 'BarCode | WordString' => ’Up’ and 15 'BarString | WordCode’ => ‘Down Provide valid implications. Thus: 'BarCode j WordString' does not imply ‘NULL’ etc.

So, the general many-to-one implication rule that is applied is that if a union of oneor-more identifier item columns has a many-to-one relationship to a different union of one-ormore identifier item columns, this provides an implication set.

The set of possible identifiers is the subset of allowable identifiers that is not 25 disallowed by one or more of the many-to-one implication rules in the implication set. -13IE' 20 4 88 For example, 500 fully-qualified identifiers might give around 500,000 allowables, which in turn give 10,000 permissible identifiers.

Step 2e - Derive the discriminating identifiers from the permissible identifiers The next stage is to reduce the set of permissible identifiers to the set of 5 discriminating identifiers.

Suppose we have the following taxonomy A I AB j ABC / \ ABCD ABCE The identifier ‘AB’ is permissible but is not discriminating because it is not a leaf node and it only has one immediate more fully-qualified descendent. In other words, ABC carries no more information than AB and thus node ΆΒ’ is not discriminating.

So, the invention will work with a taxonomy based on the allowable identifier set. However, it is more efficient for it to work with a taxonomy based on the permissible set of identifiers. It is even more efficient for it to work with the discriminating set of identifiers. Thus, in the preferred embodiment, the taxonomy will be based on a discriminating set of categories, according to the above definition.

The discriminating set may not be the optimum classification schema. As anyone experienced in say Bayesian Optimal Classification will know, taxonomies can be transformed into different taxonomies. One of these alternative taxonomies could be more efficient than the original taxonomy based on the discriminating set. However, the transformation of taxonomies in such manner is outside of the sphere of interest of the invention, as this is well known in the prior art.

In summary, the preferred set of nodes to be used by the qualification-level based taxonomy is the discriminating nodes as defined by the above rules.

Clearly, it is possible in alternative preferred embodiments for not all of the above rules being used. It is also anticipated that even further rules may be added to eliminate nodes that pass the above rules.

Claims

1. A method of processing data that has been characterised using multiple taxonomies in order to combine and reconcile the data, the method comprising: 5 receiving source data from at least two differing taxonomies; defining a set of allowable identifier items, where each identifier item is a single string or primitive type or other code, where each identifier item indicates which child node in the taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with where that identifier item applies, where an identifier consists ofthe set of 10 identifier items for which valid identification information exists for that datum, and where each datum of the source data comprises both an identifier and the datum information, wherein at least some of the source data has incomplete identifiers with one or more identifier items not being included; defining a qualification value for each identifier, the qualification value being equal to 15 the number of identifier items that contain valid information, and an identifier having a complete set of identifier items being categorised as fully qualified; merging the multiple taxonomies for the source data by making each child node in a combined taxonomy have a one higher qualification value than the parent node from which it stems; 20 and determining a probability for a specific fully qualified identification of a datum having incomplete identifiers as given by a suitable statistical or other logical representation of the combined effect of the probability of each descendant child node leading down the taxonomy from the respective incomplete identity node in the taxonomy to the specific fullyqualified node in the taxonomy, when the taxonomies are merged using the defined 25 qualification value ordering; such that all source data is associated probabilistically with fully-qualified nodes of the combined taxonomy.

2. A method as claimed in claim 1, wherein the step of defining a list of 30 allowable identifier items comprises determining all possible identifier items in the source data such that each allowable identifier item consists of an aggregate of the identifier items of al! data sources.

3. A method as claimed in claim 1 or 2, comprising: building the combined 35 taxonomy for the reconciled data based on an analysis of the allowable identifier items to determine allowable identifiers. -15IE 1 2 0 4 8 8

4. A method as claimed in claim 3, wherein the combined taxonomy comprises a root node with qualification value of zero and/or nodes assigned to all fully-qualified identifiers.

5. 5. A method as claimed in claim 4, wherein there is a root node with qualification value of zero and nodes assigned to all fully-qualified identifiers, and these nodes are combined with intermediate nodes that are assigned based on an analysis of possible combinations of identifier items in part qualified identifiers. 10

6. A method as claimed in 3, 4 or 5 comprising: eliminating impossible identifiers by identifying all fully qualified identifiers, generating a list of potential part qualified identifiers for each fully qualified identifier, eliminating part qualified identifiers that are not possible, and then assigning nodes to the remaining, possible, part qualified identifiers. 15

7. A method as claimed in claim 6, wherein the step of eliminating impossible part qualified identifiers utilises a many-to-one implication rule.

8. A method as claimed in any of claims 3 to 7 comprising: eliminating nondiscriminatory nodes of the combined taxonomy, the non-discriminatory nodes being nodes 20 that are not leaf nodes and have only one less qualified immediate parent plus one more qualified immediate descendent.

9. A method as claimed in any preceding claim, wherein the step of determining a probability for a specific fully qualified identification of a datum having incomplete 25 identifiers comprises determining probabilities based on sample data of known frequencies for the branches of the taxonomy.

10. A method as claimed in any of claims 1 to 9, wherein the step of determining a probability for a specific fully qualified identification of a datum having incomplete 30 identifiers comprises determining probabilities based on a cross-correlation of the data across the whole taxonomy.

11. A data processing apparatus configured to perform the method of any preceding claim. 35

12. A data processing apparatus for processing data that has been characterised in multiple taxonomies in order to combine and reconcile the data, the apparatus comprising: -16ΙΕ ί 2 ΰ 4 8 8 a data receiver for receiving source data from at least two differing taxonomies; and a processor arranged to; define a set of allowable identifier items, where each identifier item is a single string or primitive type or other code, where each identifier item indicates which child node in the 5 taxonomy a datum is identifiable with given the parent node in the taxonomy that the datum is identifiable with where that identifier item applies, where an identifier consists of the set of identifier items for which valid identification information exists for that datum, and where each datum of the source data comprises both an identifier and the datum information, wherein at least some of the source data has incomplete identifiers with one or more 10 identifier items not being included; define a qualification value for each identifier, the qualification value being equal to the number of identifier items that contain valid information, and an identifier having a complete set of identifier items being categorised as fully qualified; merge the multiple taxonomies for the source data by making each child node in a 15 combined taxonomy have a one higher qualification value than the parent node from which it stems; and to determine a probability for a specific fully qualified identification of a datum having incomplete identifiers as given by a suitable statistical or other logical representation of the combined effect of the probability of each descendant child node leading down the taxonomy 20 from the respective incomplete identity node in the taxonomy to the specific fully-qualified node in the taxonomy, when the taxonomies are merged using the defined qualification value ordering; such that all source data is associated probabilistically with fully-qualified nodes of the combined taxonomy.

13. An apparatus as claimed in claim 12, wherein the processor is arranged to carry out the method of any of claims 1 to 10.

14. A computer programme product comprising instructions that when executed 30 on a data processing apparatus will configure the data processing apparatus to perform the method of any of claims 1 to 10.

15. A method of processing data that is in multiple taxonomies in order to combine and reconcile the data, the method being substantially as hereinbefore described.