GB2442286A

GB2442286A - Categorisation of data e.g. web pages using a model

Info

Publication number: GB2442286A
Application number: GB0624665A
Authority: GB
Inventors: Yuriy Byurher
Original assignee: FUJIN TECHNOLOGY PLC; XPLOITE PLC
Current assignee: FUJIN TECHNOLOGY PLC; XPLOITE PLC
Priority date: 2006-09-07
Filing date: 2006-12-11
Publication date: 2008-04-02
Also published as: GB0624665D0

Abstract

A method for categorising input data objects e.g. web pages using a model comprising a plurality of patterns e.g. words associated with at least one weighting for a category, includes the steps of: identifying patterns within the input data object that correspond to at least some of the patterns within the model; for each identified pattern, determining a weighting for at least one category from the model; calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and categorising the input data object based at least in part on the calculated score. Also disclosed is a method for generating a categorisation model.

Description

CATEGORISATION OF DATA USING A MODEL

Field of Invention

The present invention relates to a method and system for generating a categorisation model, and for categorising data using the model.

Background

Categorisation of content such as web pages is useful for searching for information and for filtering information, such filtering web pages from user internet requests so as to exclude inappropriate material.

Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.

This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to a non-systematic approach to categorisation. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.

Therefore this method is particularly unsuitable for producing a database which is capable of filtering web pages resulting from user internet requests.

Automated methods for categorising web pages have been explored. Two popular methods are the use of Bayesian algorithms and the use of support vector machines (SVM).

Each method extracts feature vectors from a training set of pre-categorised web pages. A feature vector is a vector of numeric features of objects within the training set. Feature vectors may include the occurrence of words or phrases, link information and image information. i

It should be noted that it is difficult to create a 1perfect" training set where the categorised web pages contain no noise -features within a web page that contradict its categorisation.

A number of implementations of the automated methods use occurrence of words as their feature vectors and extract words from the web pages in the training set to build a vocabulary.

A large vocabulary will result in high dimensionality of feature vectors. High dimensionality of feature vectors can cause the automated methods to overfit.

Overfitting is a phenomenon by which the classifier is tuned also to the contingent, rather than just the constitutive characteristics of the training set.

Classifiers which overfit the training data tend to be good at re-classifying the data they have been trained on, but much worse at classifying previously unseen data.

Overfitting is a significant problem for the Bayes algorithm method. This algorithm learns from the training set the conditional probability of each word for each category. A new web page is categorised within the category with the highest posterior probability computed according to the Bayes rule. When the number of training samples (web pages) is insufficient with respect to number of features (words) used, the probabilities learnt may reflect noise in the training set and cannot be trusted to produce accurate categorisation.

The SVM method uses the feature vectors in the vocabulary to determine a hyperplane for each category. Each category hyperplane is defined by support vectors on the edge of the hyperplane. A category hyperplane is used to categorise a new web page as either within the category or not.

Some commentators in the literature suggest that SVMs are also susceptible to overfitting (Chen Lin et. al., An Anti-Noise Text Categorization Method p based on Support Vector Machines; Youshua Bengio et. al., The Curse of Dimensiona/ity for Local Kernel Machines).

To prevent overfitting for both methods, the vocabulary (collection of feature vectors) will often need to be reduced in size.

The vocabulary is reduced by setting feature relevance thresholds. Selecting thresholds for relevance criteria is a complex task. It is dependent on the size and quality of the training set. As an example, thresholds can relate to the exclusion of common words (stop-words), replacement of words with their stems, and exclusion of very rare words.

Even with thresholds, the vocabulary will generally need to be tuned by an expert.

The consequences of these difficulties with the Bayesian and SVM methods is that they must be trained using large high-quality training sets, and that significant user intervention is required to tune the methods for effective categorisation.

When the quality of the training set affects the quality of the categorisation method, extra effort must be expended by the human user to "clean" the training set. In addition, it is much more difficult to assist these methods to dynamically learn using new training data.

There is a desire for a method of categorising content, such as documents, web pages or any data object, which can utilise low quality training data.

It is an object of the present invention to provide a method for generating a categorisation model and categorising data which overcomes the disadvantages of above methods, or to at least provide a useful alternative.

Summary of the Invention 1'

According to a first aspect of the invention there is provided a method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: I) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.

Preferably at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.

A plurality of scores may be calculated for the input data object for a plurality of categories.

It is preferred that the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold. The predefined threshold may be an empirically-derived threshold.

The score s for a category c1 may be calculated as follows: score(c) Si-II where II is the number of identified patterns in the input data object, and score(c) is the sum of weightings for all identified patterns associated with the category Ci The patterns maybe words and the input data object may be a document or a web page.

It is preferred that the model is generated in accordance with the second aspect of the invention.

According to a second aspect of the invention there is provided a method for generating a model for categorising input data objects, including the steps of: I) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.

Preferably the weighting is only inserted into the model if the weighting meets a predefined threshold. The predefuned threshold may be an empirically-derived threshold.

It is also preferred that each data object is only associated with one category.

A weighting may be calculated for one or more of the patterns for a plurality of categories.

The weighting W for a pattern w, for a category cj may be calculated as follows: count(w1,c) w count(w,) where count(w,,c) -is the frequency of the pattern in all data objects associated with the category c1; and count(w,) -is the frequency of the pattern in all data objects.

The patterns may be words and the data objects may be documents or web pages.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which: Figure 1: shows a schematic diagram illustrating a method of generating a categorisation model in accordance with an embodiment of the invention.

Figure 2: shows a schematic diagram illustrating a method of categorising a document using a model in accordance with an embodiment of the invention.

Detailed Description of the Preferred Embodiments

The present invention provides a method and system for generating a categorisation model and for categorising data using the model.

A set of training documents, where each document is associated with a category, are used to train the model. A weighting for each word in the documents associated with each category is calculated based on the frequency of the word within the category compared to the combined frequency of the word in all categories. The (word, category, weighting) tuple is inserted into the model if the weighting meets a threshold.

A new document is categorised by generating a weighting for each category by combining the weightings, extracted from the model, of each word in the document which is paired with that category. The document is categorised within a category if the generated weighting meets a threshold.

The present invention will be described with reference to the categorisation of web pages using word frequency as feature vectors. However, it will be appreciated that the invention may be used to categorise any type of document or data object, such as a word document, an XML document, an image or a data stream. It will also be appreciated that feature vectors other than word frequency may be utilised, including frequency of phrases, structural elements, or any other pattern. The use of structural elements as feature vectors in categorisation is described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.

Referring to Figure 1, an embodiment of the invention for creating a categorisation model 10 for a set of categories C will now be described.

A training set 11 is used to create the model 10 for categorisation.

The training set 11 can be an existing training set or can be compiled by manual or automatic means such as by querying a search engine with queries created for each category.

The training set 11 is comprised of a plurality 12 of documents Dwhere each document dq e D is associated 13 with only one category in the set C. In some embodiments the documents may be associated with multiple categories, and/or may be associated with a category or categories in accordance with a defined weight. The weight may affect the weight given to Each document dq e D appears as a sequence of words w, E W. The words may be from one of any language, from any language, or an invented language such as web script. It will be further appreciated that, in alternative embodiments, any patterns could be used in place of words, such as phrases, code portions, or structural information about the document.

In step 14 the words within the documents are extracted and each word w, is associated with the category c, that the corresponding document is associated with. The word-category association forms a pair (w,,c) for each entry of the word w, in the document. A set E is constructed comprising all pairs 15 from all the documents D. Instep 16 the frequency of each word within each category is determined 17 and the combined frequency of each word within all categories is determined 18.

In step 19 a weighting w is calculated 20 for each word for each category (each unique pair (w,,1) in setE) equal to the proportion of frequency of that word in that category to the combined frequency of the word in all categories.

The weighting w may be calculated in accordance with the following formula: count(w,,c) w= count(w,) where count(w, ,c) -is the number of entries of the pair (w, ,c) in the set E, and count(w,) -is the number of pairs in the set E, in which the The weighting w is combined with the associated word and category (w,,1) to form a word, category, weighting tuple If the weighting w meets (is greater than or equal to) a threshold (hO in step 21 then the corresponding tuple (w1,c,') is inserted 22 into the categorisation model 10, otherwise the tuple is discarded 23.

The threshold (hO may be predetermined by empirical methods. The threshold (hO may be 0.7.

In one embodiment the tuple for a weighting w is inserted into the categorisation model if count(w,) meets a threshold (hi and number of documents containing word w, meets a threshold th2.

The threshold (hi may be predetermined by empirical methods. The threshold thi may be three.

The threshold th2 may also be predetermined by empirical methods. The threshold th2 may be ten.

In one embodiment of the invention the categorisation model is used to train an artificial neural network (ANN). The ANN may be a standard feed-forward neural network with one input layer, one hidden layer and one output layer.

The sizes of input and output layers are equal to M (number of categories).

The size of hidden layer is configurable (minimal value may be equal to M).

A set of neural network patterns for training the neural network are first created using the model.

A neural network pattern is calculated for each document dq in the training set D and consists of sets Vi and V2. Vi will form the input set for the neural network and V2 will form the output set for the neural network. Each document is a sequence of words w1.

The set VI is comprised of elements VI, v2.., v, where n is the number of categories in set C. For each word w, in the document dq and for each category c1, the weighting w,1 corresponding to the word w, and the category ç, is extracted from the model. V1 is assigned the sum of the weightings corresponding to all the words in the document for category c1 divided by the total number of words in the document dg.

The set V2 is comprised of elements v21, v22. . . v2 where n is the number of categories in set C. If dq is categorising within the training set as belonging to category C1 then v2 is given the value "1" otherwise v211s given the value "0".

After each neural network pattern is created the set of patterns may be used to train the neural network.

Referring to Figure 2, an embodiment of the invention for categorising a document 30 in accordance with a categorisation model 31 will be described.

The model 31 includes a set of categories C, a set of words Wand a set of weights W', which associate words in the set Wwith categories in the set C. Each word in the set W can be associated with one or more categories in the set C. Model = w, , , w; fl c, -is a category in the set C, w, E [o,i] -is a weight of association between the word w, and category c, The model may be generated as described in Figure 1.

The words 33 WI, within the document 30, which correspond to words in the model, are extracted from the document in step 32. This ensures that only words for which weightings exist are considered.

In step 34 each word w, is replaced by a set of pairs, w;) where c, C and w are weights of the words w, associated with the category c1 within the model. The sets of pairs 35 form a set P. In steps 36 and 37 a score s1 for each category c1 is then calculated based on the combined weightings for each category compared to the total number of considered words w, within the document 30.

The score s1 may be calculated as follows: -score(c) Si-where -is the number of elements in the set , score(c) -is a sum of weights w for all pairs in the set P, which contain category c,, -If the weighting for a category meets (is greater than or equal to) a threshold th3 in step 38 then the document is categorised 39 within that category.

If none of the weightings exceeds the threshold then the web page cannot be categorised 40.

The threshold th3 may be predetermined by empirical evidence. The threshold th3 may be 0.3.

In an alternative embodiment, the neural network trained on the categorisation model will be used to categorise the input data object.

A neural network input set VI is calculated for the input document 30. i

The set Vi is comprised of elements v1, v2.., v where n is the number of categories in set C. For each word w, in the input document 30 and for each category cS,, the weighting w corresponding to the word w, and the category c1 is extracted from the model 10. V1 is assigned the sum of the weightings corresponding to all the words in the document for category c1 divided by the total number of The input set Vi is provided to the trained neural network and an output set V2 is created. The set V2 is comprised of elements v21, v22. . . v2 where n is the number of categories in set C. Each element is a value between zero and one. The input document is categorised in category c if v2 meets a threshold th4.

The threshold value th4 may be predetermined and may be equal to 0.7.

It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.

An example of the creation of a categorisation model in accordance with an embodiment of the invention will now be described.

The training set is comprised of four documents as follows: Document Name Document Content Document Category Dl AAABBBAAA Cl D2 AAAAAAAAA Cl D3 AAABBBBBB C2 104 CCCBBBAM C2 A set of words w, associated with categories c1 are extracted from the training set: WI C)

AAACI

BBB Cl AAAC1 MACi AMC1 AAAC1 AAAC2 BBB C2 BBB C2 CCC C2 BBB C2 AAAC2 A weighting w, is calculated for each unique pair of words and categories (w,, c)in the setE: , , w AAA Ci 6/7=0.86 BBB Cl 1/4 = 0.25 BBB C2 3/4 = 0.75 AAA C2 1/7=0.14 CCC 1/1=1 The word, category and weighting (w,, c, , w) tupte is added to the model M if the weighting is over the predetermined threshold (0.7): WI CJ W1 AAA Cl 0.86 BBB C2 0.75 LCCCC2 1 An example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.

For the purposes of this example, the categorisation model M will be used.

The document to be categorised is: Document Name Document ConteW D5 CCC AAA XXX A set P is created comprising the words in the document that are also in the model: Wi C 1v, CCC C2 I AAA Cl 0.86 XXx As the word XXX is not in the model M the number of words in P is 2 (1P 2).

A score s, is calculated for each category c3 by summing the weights and dividing by the size of the set P: ci Si Cl 1/2=0.5 C2 0.86/2 = 0.431 A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = ((Cl, 0.5), (C2, 0.43)) Therefore the document is categorised within Cl and C2.

Another example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.

For the purposes of this example, the categorisation model M will also be used.

The document to be categorised is: Document Name Document Content D6 AAA XXX YYY ZZZZ A set P is created comprising the words in the document that are also in the model: WI Ci W, AAA Cl 0.86

XXX

YYY zzz

As the words XXX, YYY, and ZZZ are not in the model M the number of words in Pis 1 (II = 1).

A score s, is calculated for each category c, by summing the weights and dividing by the size of the set P: Ci Si Cl 0.86/1 = 0.86 A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = ((C1,O.86)) One embodiment of the invention -Fast Word Statistics (FWS) -has been used in the categorisation of actual web pages (HTML pages) to produce the following test results. Four categories (weapons, chat, nudity, and pornography) which are typically utilized for blocking or filtering content on the internet for minors are used in the test.

For the purposes of comparison, results from Bayesian categorization algorithm and Support Vector Machines (SVM) categorization algorithm were also generated. In this test the Bayesian and SVM algorithms used single categorisation mode while the embodiment of the invention utilised multiple categorization mode. Generally single categorisation mode gives better results for the Bayesian and SVM algorithms then multiple categorisation mode.

To implement the test each of FWS, Bayesian, and SVM were first trained on a training set comprised of web pages categorised into one of the four categories and then each method was tested against a testing set of web pages for which the correct categorisation is known.

The Bayesian and SVM algorithms were first significantly tuned (optimised) in accordance with known methods. FWS used the raw data without any tuning.

The training and testing sets contain raw HTML pages downloaded from the internet. The distribution of the web pages across the sets and categories is as follows: Weapons Chat Nudity Pornography Training Set Size 238 345 541 842 Test Set Size 739 371 437 1801 The following table summarises the optimised performance results: Metrics FWS Bayesian SVM Training time 5.5 sec 8.19 sec 44.7 sec Average time for categorizing a document 0.6 msec 0.48 msec 1.3 msec The following table summarises the accuracy of the results of all three methods: Category Algorithm Valid match Invalid match Unclassified FWS 88.6% 0% 11.4% Weapon Bayesian 89.0% 1.3% 9.7% SVM 77.9% 0.9% 21.2% FWS 90.8% 0.8% 9.4% Chat Bayesian 97.3% 0.5% 2.2% SVM 85.6% 0.4% 14% FWS 94.8% 0.5% 4.7% Nudity Bayesian 95.9% 0.6% 3.7% SVM 93.1% 0% 6.9% FWS 87.9% 2.3% 9.8% Pornography Bayesian 86. 6% 5.5% 7.9% SVM 79.5% 2.9% 17.6% In other tests the FWS method was show to be linearly scalable and was used for other languages where it has shown similar or better performance and accuracy of results.

Embodiments of the present invention have the following potential advantages: 1) An embodiment provides a fast, linearly scalable learning method that permits fast construction of a categorisation model from raw and average-to-low quality of input documents.

2) In contrast to the Bayesian and SVM methods, an embodiment is immune to low quality training sets.

3) In contrast to the Bayesian and SVM methods, an embodiment does not require pre-processing of the vocabularies or tuning to provide accurate categorisation.

4) The performance and accuracy of an embodiment is similar or sometimes better than Bayesian and SVM algorithms.

5) An embodiment provides consistent and stable results on new categories and languages while Bayesian and SVM require significant human intervention for preparation and tuning.

6) An embodiment uses a statistical analysis approach and can be utilised in other fields such as data research and data mining.

7) The complexity of implementation of an embodiment is minimal compared to other well-known text categorization techniques.

8) An embodiment produces a different pattern of categorisation results to existing methods, in that the valid/invalid matches and uncategorised results for a set of documents are likely to be different to any other method. The consequence of this is that the embodiment is suited for combination with existing methods to produce an improved categorisation for a new document. The combination of categorisation methods is described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES.

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art.

Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

Claims A method for categorising an input data object using a model

comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.
2. A method as claimed in claim 2 wherein at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.
3. A method as claimed in any one of the preceding claims wherein a plurality of scores is calculated for the input data object for a plurality of categories.
4. A method as claimed in any one of the preceding claims wherein the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold.
5. A method as claimed in claim 4 wherein the predefined threshold is an empirically-derived threshold.
6. A method as claimed in any one of the preceding claims wherein the score s, for a category cj is calculated as follows: -score(c)

JP

where IJ is the number of identified patterns in the input data object, and score(c) is the sum of weightings for all identified patterns associated with the category c1.
7. A method as claimed in any one of the preceding claims wherein the patterns are words.
8. A method as claimed in any one of the preceding claims wherein the input data object is a document.
9. A method as claimed in any one of the preceding claims wherein the input data object is a web page.
10. A method as claimed in any one of the preceding claims wherein the model is generated in accordance with claim 10.
11. A method as claimed in any one Of the preceding claims wherein an artificial neural network is used to categorise the input data object using the calculated score.
12. A method as claimed in claim 11 wherein an input set is used by the neural network to categorise the input data object, and wherein the input set comprises one or more of the calculated scores.
13. A method as claimed in claim 12 wherein the calculated scores is calculated as the sum of the weightings for the identified patterns in the input data object divided by the number of identified patterns in the input data object.
14. A method for generating a model for categorising input data objects, including the steps of: I) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.
15. A method as claimed in claim 14 wherein the weighting is only inserted into the model if the weighting meets a predefined threshold.
16. A method as claimed in claim 15 wherein the predefined threshold is an empirically-derived threshold.
17. A method as claimed in any one of claims 14 to 16 wherein each data object is only associated with one category.
18. A method as claimed in any one of claims 14 to 17 wherein a weighting is calculated for one or more of the patterns for a plurality of categories.
19. A method as claimed in any one of claims 14 to 18 wherein the weighting w for a pattern w, for a category c, is calculated as follows: count(w, ,c) w= count(w,) where count(w,,c) -is the frequency of the pattern in all data objects associated with the category C1; and count(w,) -is the frequency of the pattern in all data objects.
20. A method as claimed in any one of claims 14 to 19 wherein the patterns are words.
21. A method as claimed in any one of claims 14 to 20 wherein the data objects are documents.
22. A method as claimed in any one of claims 14 to 21 wherein the data objects are web pages.
23. A method as claimed in any one of claims 14 to 22 including the step of training an artificial neural network using the model.
24. A method as claimed in claim 23 wherein the neural network is trained on a set of neural network patterns for each data object.
25. A method as claimed in claim 24 wherein each neural network pattern includes an input set and an output set.
26. A method as claimed in claim 25 wherein the input set includes a set of elements for each category, and wherein each element is based at least in part on the sum of the weightings for the patterns in the data object divided by the number of patterns in the data object.
27. A system for categorising an input data object using a model, including: a processor arranged for generating a model for categorising input data objects by associating each data object of a plurality of data objects with one of a plurality of categories; extracting a plurality of patterns from the plurality of data objects; calculating a weighting for each pattern for at least one category in dependence on the frequency of each pattern within the data objects associated with that category compared to the frequency of the pattern within all data objects; and inserting each weighting into the model; a processor arranged for categorising the input data object using the model by identifying patterns within the input data object that correspond to at least some of the patterns within the model; for each identified pattern, determining a weighting for at least one category from the model; calculating a score for the input data object for at least one category based on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and categorising the input data object in dependence on the calculated score; and a memory arranged for storing the model.
28. A system arranged for performing the method of any one of claims I to 26.
29. A computer program arranged for performing the method or system of any one of the preceding claims.
30. A storage media arranged for storing a computer program as claimed in claim 29.