WO2008029156A1

WO2008029156A1 - Categorisation of data using multiple categorisation engines

Info

Publication number: WO2008029156A1
Application number: PCT/GB2007/003384
Authority: WO
Inventors: Eric Zigmund Sandler; Yuriy Byurher
Original assignee: Xploite Plc
Priority date: 2006-09-07
Filing date: 2007-09-07
Publication date: 2008-03-13

Abstract

A method for categorising an input data object using a plurality of different categorisation engines, including the steps of: each categorisation engine calculating a score for the input data object for a plurality of categories; and categorising the input data object based at least in part on the calculated scores. The categorisation engines may include a Bayesian engine, a support vector machine, or a statistical engine.

Description

CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES

Field of the Invention

The present invention relates to a method and system for categorising data using multiple categorisation engines; particularly, but not exclusively, by combining the scores of the engines.

Background

Categorisation of content such as web pages is useful for searching for information and for filtering information.

Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.

This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to lack of user input for those classes. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.

Automated methods for categorising web pages have been developed.

These methods include Bayesian algorithms, support vector machines (SVM), rule-based classifiers, and statistical classifiers.

Unfortunately these methods do not often produce accurate categorisation results. With the learning algorithms, such as Bayesian and SVM, significant effort must be used to prepare the algorithms to classify a category by refining a vocabulary for the algorithms and refining the training set. In relation to the rule-based classifiers, these require significant human input to create rules and the classifiers remains ineffective as it is near to impossible to manually generate comprehensive rules.

Known statistical classifiers are very inaccurate and produce less than 50% accurate categorisation.

Therefore there is a need to enhance or boost the accuracy of these methods.

There are several known methods for enhancing the accuracy of a categorisation method.

The first is called classifier voting. Multiple classifiers are generated by varying the input parameters of a known classifier for a category. Each generated classifier is provided with the input data. The classifier categorises the input data as belonging to the category or not belonging to the category. These "votes" are then tabulated. If the votes exceed a defined threshold the input data is classified as belonging to the category.

A variation of this method weights the votes of each classifier based on how successful the classifier has been in the past.

The second method uses a combination of classifier voting with learning algorithms. Multiple instances of a learning algorithm are trained on different training sets to produce multiple classifiers. The multiple classifiers are then used in the classifier voting or weighted voting method.

The disadvantage with these methods is that they remain less accurate than manual categorisation and, in the case of learning algorithms, several training sets are required.

It is an object of the present invention to provide a method for improving the categorisation of data by using information from multiple categorisation engines, or to at least provide a useful alternative. Summary of the Invention

According to a first aspect of the invention there is provided a method for categorising an input data object using a plurality of different categorisation engines, including the steps of: i) each categorisation engine calculating a score for the input data object for a plurality of categories; and ii) categorising the input data object based at least in part on the calculated scores.

The scores are preferably non-binary values. The scores may be combined or selected to categorise the input data object.

Preferably, three or more categorisation engines are used.

It is also preferred that at least some of the categorisation engines are learning engines. Each learning engine may be trained using feature vectors of the same type. The type of feature vectors used may be thematic feature vectors such as "bag of words".

Each learning engine may be trained on the same training set.

The input data object may be categorised on the basis of the calculated scores and a second set of calculated scores. The second set of calculated seores is preferably calculated by a second set of categorisation engines. Each engine in the second set is preferably a different categorisation engine. It is also preferred that the second set of engines is comprised of learning engines. The second set of engines may be trained on structural feature vectors

The method may include a step of determining a weighting for each category by combining all the scores for that category and comparing the combination to the combined value of all the scores. The weighting may form the basis for categorising the input data object. The input data object may be categorised within a category if the weighting for that category meets a predefined threshold. The input data object may be categorised within a category if the weighting is equal, within an error margin, to the highest weighting.

The input data object may be categorised within a category if the score for that category meets a predefined threshold. The input data object may be categorised within a category if the score is equal, within an error margin, to the highest score.

It is preferred that one or more of the categorisation engines are selected from the set of fast word statistics algorithm, Bayesian algorithm, and support vector machine.

Preferably, a neural network uses the calculated scores to categorise the input data object. The neural network may be previously trained on at least one pattern comprising at least one set of scores and a set of categories; wherein the set of scores form the inputs for the neural network and the set of categories form the desired outputs. Each set of scores in a pattern may be calculated from a training set of documents by a categorisation engine.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:

Figure 1 : shows a schematic diagram illustrating an embodiment of the invention.

Figure 2: shows a flow diagram illustrating one method of the invention.

Figure 3: shows a flow diagram illustrating a second method of the invention. Figure 4: shows a flow diagram illustrating a third method of the invention.

Figure 5: shows a flow diagram illustrating a fourth method of the invention.

Detailed Description of the Preferred Embodiments

The present invention provides a method and system of categorising data using the scores of a number of different categorisation engines. A score for a number of categories is generated for the data by a number of different categorisation engines, such as a Bayesian engine, a support vector machine (SVM) and/or a statistical engine. The data is categorised in a category by one of the following methods (i) if a combined proportional score for the category meets a threshold, (ii) if the combined proportional category score meets a threshold and is equivalent to the highest combined proportional category score, (iii) if one score for the category meets a threshold, and (iv) if a score for the category meets a threshold and is equivalent to the highest category score.

A system for categorising an input document will now be described.

Figure 1 shows three categorisation engines 1, 2, and 3. Each categorisation engine is a different type of categorisation engine. For example, Engine 1 may be a Bayesian engine, Engine 2 may be a SVM₁ and Engine 3 may be a statistical engine. One statistical engine is described in the patent application CATEGORISATION OF DATA USING A MODEL.

The categorisation engines may be learning engines. The engines may be trained on the same training set.

In one embodiment there may be additional categorisation engines of the same type. For example, there may be two Bayesian engines. In this embodiment the engines of the same type may be trained on different training sets or each engine may be trained on different feature vectors. For example, one engine may be trained on word frequencies and the other engine may be trained on structural features. A structural feature categorisation engine is described in the patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.

Each categorisation engine produces a list of categories and scores for each of those categories for the same input document 4.

A score processor 5 is also shown. The processor takes as input the list of categories and scores from each of the engines.

The scores may be normalised by each engine before being provided as input to the score processor 5.

The processor 5 then uses the scores to determine a list 6 of categories that the input document belongs to.

The list may contain many categories, one category or no categories.

It will be appreciated that instead of a document the system may be adapted for use with any data object such as an image file or data stream.

The processor 5 may determine the list 6 of categories by calculating a weighting for each category based on the scores generated by the engines and categorising the document in a category if the weighting of the category is the highest weighting and/or meets a threshold.

Alternatively, the processor 5 may determine the list 6 of categories by categorising the document in a category if a score for a category is the highest score and/or meets a threshold.

It will be appreciated that the processor 5 may determine the list 6 of categories using another method such as by providing the scores to a neural network. Five methods of how the processor 5 may determine a list 6 of categories from the scores of the categorisation engines will now be described.

Using the list 10 of categories and scores from each engine, a set P of pairs (c,, S₁) , where c, is a category, and s, e [0,1] is a category score can be created for the input document 4 in step 11 within Figures 2, 3, 4, and 5.

Referring to Figure 2, a first method (Boosting A) of categorising the input document will be described.

sum(c ,s )

In step 12 for each unique c a value p, = — — is calculated, where

SUm(S₁ ) sum(C_j,S_j) is a sum of scores S₁ for all pairs in set P where category c, used; and SUm(S₁) is a sum of all scores s, . The document is considered as a one that belongs to the category C₁ , if p_} ≥ th\ in step 13, where th1 is a threshold value.

The threshold th1 may be predetermined by empirical methods. The threshold th1 may be 0.2.

The threshold value may change when different numbers of categories are used.

Referring to Figure 3, a second method (Boosting B) of categorising the input document will be described.

sum(c ,s .)

In step 20 for each unique c a value P₁ = — — is calculated, where

SUm(S₁ )

SUm(C₁₁S₁) is a sum of scores S₁ for all pairs in set P where category C₁ is used; and sum(s_t) is a sum of all scores s, . Determine value p' equal to the maximum value of all p, . The document is considered as a one that belongs to the category c_} , if p_; > thl in step 21 and p_} = p'± ε in step 22, where #72 is a threshold and £^• is an error margin. The error margin may be a small margin such as 0.001.

The threshold th2 may be predetermined by empirical methods. The threshold th2 may be 0.2.

Referring to Figure 4, a third method (Boosting C) of categorising the input document will be described.

The document is considered as a one that belongs to the category C₁ , if S₁ ≥ th3 in step 30, where th3 is a threshold value.

The threshold th3 may be predetermined from empirically methods. The threshold th3 may be equal to 0.7.

Referring to Figure 5, a fourth method (Boosting D) of categorising the input document will be described.

Determine a value s' equal to the maximum value of all s, . The document is considered as a one that belongs to the category c, , if S₁ > thA in step 40 and S_j = s' ± ε in step 41 , where th4 is a threshold and ε is an error margin.

The threshold th4 may be empirically computed and may be equal to 0.2.

A fifth method of categorising the input document will be described.

The scores of the categorization engines are used within the processor 5 to build an input for a trained artificial neural network (ANN) aggregator. The scores are used to build a corresponding part ^V/ of input vector for the ANN aggregator. The size of each part ^V/ is equal to number of categories. The element ^VjJ of input vector for ANN aggregator is a score of category ^Cj calculated by categorization engine e ', .

The ANN aggregator is used to calculate an output vector OV. The size of vector OV is equal to number of categories. To determine the category of particular document, the following rule is used for each element ^0V/ : if ^{ovι ~ υ}- than the document is categorized within category °^ι .

The ANN may be trained using the following process:

A set of patterns for the ANN is arranged using trained categorization engines

E1 , E2, ... En. Each pattern is calculated for particular document ^{q e} (where D is a training set of documents)), and consists of the sets V1, V2, ... Vn and set OV. The sizes of sets VI-Vn are equal to number of categories. Each set VI-Vn is arranged from output of corresponding categorization engine.

The size of set OV is equal to number of categories. The elements ^0V| of set OV may be calculated in accordance with the following formula:

[1, if document d belong to category c 0, else

The sets VI-Vn are used as input for the ANN with the set OV used as desired output from the ANN.

The set of patterns are used to train the ANN.

It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.

The methods Boosting A, B, C, and D have been used in the categorisation of actual web pages (HTML pages) to produce the following test results. Nine categories (chat & messaging, erotic, news, nudism & naturism, pornography, prostitution, shopping, software & downloads, weapons) which are typically utilized for blocking or filtering content on the internet for employees of an organisation or for minors have been used for this test.

Three categorisation engines were used to provide scores for the categories for the methods. The three engines used are a Bayesian categorization algorithm, a Support Vector Machines (SVM) categorization algorithm, and a statistical algorithm detailed in patent application CATEGORISATION OF DATA USING A MODEL.

To implement the test each of Bayesian, SVM, and statistical engines were first trained on a training set comprised of web pages categorised into one of the nine categories.

Each method of the invention and, for comparison purposes, each engine was tested against a test set comprising categorised web pages.

The training and testing sets contain raw HTML pages downloaded from the internet. The distribution of the web pages across the sets and categories is as follows:

The following table summarizes the accuracy results of the three categorizations engines individually and the four methods (Boosting A, Boosting B, Boosting C, and Boosting D):

It can be seen from the results that the methods of the invention provide consistent accuracy over the categories.

Embodiments of the present invention have the following potential advantages:

• The accuracy of different categorization engines used together is better then the accuracy of the categorization engines used standalone.

• Each of the different categorization engines adds value to the composite accuracy.

• An increase of the training set is not required and a high quality composite categorizer can be built using small training sets.

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

1. A method for categorising an input data object using a plurality of different categorisation engines, including the steps of: i) each categorisation engine calculating a score for the input data object for a plurality of categories; and ii) categorising the input data object based at least in part on the calculated scores.

2. A method as claimed in claim 1 wherein the scores are non-binary values.

3. A method as claimed in any one of the preceding claims wherein the scores are combined to categorise the input data object.

4. A method as claimed in any one of the preceding claims wherein three or more categorisation engines are used.

5. A method as claimed in any one of the preceding claims wherein at least some of the categorisation engines are learning engines.

6. A method as claimed in claim 5 wherein each learning engine is trained using feature vectors of the same type

7. A method as claimed in claim 6 wherein the type of feature vectors used are thematic feature vectors.

8. A method as claimed in claim 7 wherein the thematic feature vectors used are "bag of words".

9. A method as claimed in any one of claims 6 to 8 wherein each learning engine is trained on the same training set.

10. A method as claimed in any one of the preceding claims wherein the input data object is categorised on the basis of the calculated scores and a second set of calculated scores.

11. A method as claimed in claim 10 wherein the second set of calculated scores are calculated by a second set of categorisation engines.

12. A method as claimed in claim 11 wherein each engine in the second set is a different categorisation engine.

13. A method as claimed in any one of claims 11 to 12 wherein the second set of engines are learning engines.

14. A method as claimed in claim 13 wherein the second set of engines are trained on structural feature vectors

15. A method as claimed in any one of the preceding claims including the step of determining a weighting for each category by combining all the scores for that category and comparing the combination to the combined value of all the scores.

16. A method as claimed in claim 15 wherein the weighting forms the basis for categorising the input data object.

17. A method as claimed in any one of claims 15 to 16 wherein the input data object is categorised within a category if the weighting for that category meets a predefined threshold.

18. A method as claimed in any one of claims 15 to 17 wherein the input data object is categorised within a category if the weighting is equal, within an error margin, to the highest weighting.

19. A method as claimed in any one of claims 1 to 15 wherein the input data object is categorised within a category if the score for that category meets a predefined threshold.

20. A method as claimed in any one of claims 1 to 15 and 19 wherein the input data object is categorised within a category if the score is equal, within an error margin, to the highest score.

21. A method as claimed in any one of the preceding claims wherein one or more of the categorisation engines are selected from the set of fast word statistics algorithm, Bayesian algorithm, and support vector machine.

22. A method as claimed in any one of the preceding claims wherein a neural network uses the calculated scores to categorise the input data object.

23. A method as claimed in claim 22 wherein the neural network is previously trained on at least one pattern comprising at least one set of scores and a set of categories; wherein the set of scores form the inputs for the neural network and the set of categories form the desired outputs.

24. A method as claimed in claim 23 wherein each set of scores in a pattern are calculated from a training set of documents by a categorisation engine.

25. A system method for categorising an input data object, including: a plurality of different categorisation engines, each categorisation engine arranged for calculating a score for the input data object for a plurality of categories; and a processor arranged for categorising the input data object based at least in part on the calculated scores of the categorisation engines.

26. A system arranged for performing the method of any one of claims 1 to 24.

27. A computer program arranged for performing the method or system of any one of the preceding claims.

28. Storage media arranged for storing a computer program as claimed in claim 27.