WO2008029156A1 - Categorisation of data using multiple categorisation engines - Google Patents

Categorisation of data using multiple categorisation engines Download PDF

Info

Publication number
WO2008029156A1
WO2008029156A1 PCT/GB2007/003384 GB2007003384W WO2008029156A1 WO 2008029156 A1 WO2008029156 A1 WO 2008029156A1 GB 2007003384 W GB2007003384 W GB 2007003384W WO 2008029156 A1 WO2008029156 A1 WO 2008029156A1
Authority
WO
WIPO (PCT)
Prior art keywords
categorisation
engines
input data
data object
scores
Prior art date
Application number
PCT/GB2007/003384
Other languages
French (fr)
Inventor
Eric Zigmund Sandler
Yuriy Byurher
Original Assignee
Xploite Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0624666A external-priority patent/GB2442287A/en
Application filed by Xploite Plc filed Critical Xploite Plc
Publication of WO2008029156A1 publication Critical patent/WO2008029156A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to a method and system for categorising data using multiple categorisation engines; particularly, but not exclusively, by combining the scores of the engines.
  • Categorisation of content such as web pages is useful for searching for information and for filtering information.
  • This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to lack of user input for those classes. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.
  • Bayesian algorithms include Bayesian algorithms, support vector machines (SVM), rule-based classifiers, and statistical classifiers.
  • SVM support vector machines
  • rule-based classifiers include Bayesian algorithms, support vector machines (SVM), rule-based classifiers, and statistical classifiers.
  • the first is called classifier voting. Multiple classifiers are generated by varying the input parameters of a known classifier for a category. Each generated classifier is provided with the input data. The classifier categorises the input data as belonging to the category or not belonging to the category. These "votes" are then tabulated. If the votes exceed a defined threshold the input data is classified as belonging to the category.
  • a variation of this method weights the votes of each classifier based on how successful the classifier has been in the past.
  • the second method uses a combination of classifier voting with learning algorithms. Multiple instances of a learning algorithm are trained on different training sets to produce multiple classifiers. The multiple classifiers are then used in the classifier voting or weighted voting method.
  • a method for categorising an input data object using a plurality of different categorisation engines including the steps of: i) each categorisation engine calculating a score for the input data object for a plurality of categories; and ii) categorising the input data object based at least in part on the calculated scores.
  • the scores are preferably non-binary values.
  • the scores may be combined or selected to categorise the input data object.
  • three or more categorisation engines are used.
  • each learning engine may be trained using feature vectors of the same type.
  • the type of feature vectors used may be thematic feature vectors such as "bag of words".
  • Each learning engine may be trained on the same training set.
  • the input data object may be categorised on the basis of the calculated scores and a second set of calculated scores.
  • the second set of calculated seores is preferably calculated by a second set of categorisation engines. Each engine in the second set is preferably a different categorisation engine. It is also preferred that the second set of engines is comprised of learning engines.
  • the second set of engines may be trained on structural feature vectors
  • the method may include a step of determining a weighting for each category by combining all the scores for that category and comparing the combination to the combined value of all the scores.
  • the weighting may form the basis for categorising the input data object.
  • the input data object may be categorised within a category if the weighting for that category meets a predefined threshold.
  • the input data object may be categorised within a category if the weighting is equal, within an error margin, to the highest weighting.
  • the input data object may be categorised within a category if the score for that category meets a predefined threshold.
  • the input data object may be categorised within a category if the score is equal, within an error margin, to the highest score.
  • one or more of the categorisation engines are selected from the set of fast word statistics algorithm, Bayesian algorithm, and support vector machine.
  • a neural network uses the calculated scores to categorise the input data object.
  • the neural network may be previously trained on at least one pattern comprising at least one set of scores and a set of categories; wherein the set of scores form the inputs for the neural network and the set of categories form the desired outputs.
  • Each set of scores in a pattern may be calculated from a training set of documents by a categorisation engine.
  • Figure 1 shows a schematic diagram illustrating an embodiment of the invention.
  • Figure 2 shows a flow diagram illustrating one method of the invention.
  • Figure 3 shows a flow diagram illustrating a second method of the invention.
  • Figure 4 shows a flow diagram illustrating a third method of the invention.
  • Figure 5 shows a flow diagram illustrating a fourth method of the invention.
  • the present invention provides a method and system of categorising data using the scores of a number of different categorisation engines.
  • a score for a number of categories is generated for the data by a number of different categorisation engines, such as a Bayesian engine, a support vector machine (SVM) and/or a statistical engine.
  • the data is categorised in a category by one of the following methods (i) if a combined proportional score for the category meets a threshold, (ii) if the combined proportional category score meets a threshold and is equivalent to the highest combined proportional category score, (iii) if one score for the category meets a threshold, and (iv) if a score for the category meets a threshold and is equivalent to the highest category score.
  • Figure 1 shows three categorisation engines 1, 2, and 3. Each categorisation engine is a different type of categorisation engine.
  • Engine 1 may be a Bayesian engine
  • Engine 2 may be a SVM 1
  • Engine 3 may be a statistical engine.
  • One statistical engine is described in the patent application CATEGORISATION OF DATA USING A MODEL.
  • the categorisation engines may be learning engines.
  • the engines may be trained on the same training set.
  • categorisation engines of the same type there may be additional categorisation engines of the same type.
  • the engines of the same type may be trained on different training sets or each engine may be trained on different feature vectors.
  • one engine may be trained on word frequencies and the other engine may be trained on structural features.
  • a structural feature categorisation engine is described in the patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.
  • Each categorisation engine produces a list of categories and scores for each of those categories for the same input document 4.
  • a score processor 5 is also shown. The processor takes as input the list of categories and scores from each of the engines.
  • the scores may be normalised by each engine before being provided as input to the score processor 5.
  • the processor 5 uses the scores to determine a list 6 of categories that the input document belongs to.
  • the list may contain many categories, one category or no categories.
  • the processor 5 may determine the list 6 of categories by calculating a weighting for each category based on the scores generated by the engines and categorising the document in a category if the weighting of the category is the highest weighting and/or meets a threshold.
  • the processor 5 may determine the list 6 of categories by categorising the document in a category if a score for a category is the highest score and/or meets a threshold.
  • the processor 5 may determine the list 6 of categories using another method such as by providing the scores to a neural network. Five methods of how the processor 5 may determine a list 6 of categories from the scores of the categorisation engines will now be described.
  • a set P of pairs (c,, S 1 ) where c, is a category, and s, e [0,1] is a category score can be created for the input document 4 in step 11 within Figures 2, 3, 4, and 5.
  • Boosting A categorising the input document
  • SUm(S 1 ) sum(C j ,S j ) is a sum of scores S 1 for all pairs in set P where category c, used; and SUm(S 1 ) is a sum of all scores s, .
  • the document is considered as a one that belongs to the category C 1 , if p ⁇ ⁇ th ⁇ in step 13, where th1 is a threshold value.
  • the threshold th1 may be predetermined by empirical methods.
  • the threshold th1 may be 0.2.
  • the threshold value may change when different numbers of categories are used.
  • Boosting B categorising the input document
  • SUm(C 11 S 1 ) is a sum of scores S 1 for all pairs in set P where category C 1 is used; and sum(s t ) is a sum of all scores s, .
  • Determine value p' equal to the maximum value of all p, .
  • the error margin may be a small margin such as 0.001.
  • the threshold th2 may be predetermined by empirical methods.
  • the threshold th2 may be 0.2.
  • Boosting C categorising the input document
  • the document is considered as a one that belongs to the category C 1 , if S 1 ⁇ th3 in step 30, where th3 is a threshold value.
  • the threshold th3 may be predetermined from empirically methods.
  • the threshold th3 may be equal to 0.7.
  • Boosting D categorising the input document
  • the threshold th4 may be empirically computed and may be equal to 0.2.
  • the scores of the categorization engines are used within the processor 5 to build an input for a trained artificial neural network (ANN) aggregator.
  • the scores are used to build a corresponding part V/ of input vector for the ANN aggregator.
  • the size of each part V/ is equal to number of categories.
  • the element VjJ of input vector for ANN aggregator is a score of category Cj calculated by categorization engine e ', .
  • the ANN aggregator is used to calculate an output vector OV.
  • the size of vector OV is equal to number of categories. To determine the category of particular document, the following rule is used for each element 0V/ : if ov ⁇ ⁇ ⁇ - than the document is categorized within category ° ⁇ .
  • the ANN may be trained using the following process:
  • a set of patterns for the ANN is arranged using trained categorization engines
  • Each pattern is calculated for particular document q e (where D is a training set of documents)), and consists of the sets V1, V2, ... Vn and set OV.
  • the sizes of sets VI-Vn are equal to number of categories.
  • Each set VI-Vn is arranged from output of corresponding categorization engine.
  • the size of set OV is equal to number of categories.
  • of set OV may be calculated in accordance with the following formula:
  • the sets VI-Vn are used as input for the ANN with the set OV used as desired output from the ANN.
  • the set of patterns are used to train the ANN.
  • Boosting A, B, C, and D have been used in the categorisation of actual web pages (HTML pages) to produce the following test results.
  • Nine categories chat & messaging, erotic, news, nudism & naggling, pornography, prostitution, shopping, software & downloads, weapons) which are typically utilized for blocking or filtering content on the internet for employees of an organisation or for minors have been used for this test.
  • the three categorisation engines were used to provide scores for the categories for the methods.
  • the three engines used are a Bayesian categorization algorithm, a Support Vector Machines (SVM) categorization algorithm, and a statistical algorithm detailed in patent application CATEGORISATION OF DATA USING A MODEL.
  • SVM Support Vector Machines
  • each method of the invention and, for comparison purposes, each engine was tested against a test set comprising categorised web pages.
  • the training and testing sets contain raw HTML pages downloaded from the internet.
  • the distribution of the web pages across the sets and categories is as follows:

Abstract

A method for categorising an input data object using a plurality of different categorisation engines, including the steps of: each categorisation engine calculating a score for the input data object for a plurality of categories; and categorising the input data object based at least in part on the calculated scores. The categorisation engines may include a Bayesian engine, a support vector machine, or a statistical engine.

Description

CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES
Field of the Invention
The present invention relates to a method and system for categorising data using multiple categorisation engines; particularly, but not exclusively, by combining the scores of the engines.
Background
Categorisation of content such as web pages is useful for searching for information and for filtering information.
Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.
This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to lack of user input for those classes. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.
Automated methods for categorising web pages have been developed.
These methods include Bayesian algorithms, support vector machines (SVM), rule-based classifiers, and statistical classifiers.
Unfortunately these methods do not often produce accurate categorisation results. With the learning algorithms, such as Bayesian and SVM, significant effort must be used to prepare the algorithms to classify a category by refining a vocabulary for the algorithms and refining the training set. In relation to the rule-based classifiers, these require significant human input to create rules and the classifiers remains ineffective as it is near to impossible to manually generate comprehensive rules.
Known statistical classifiers are very inaccurate and produce less than 50% accurate categorisation.
Therefore there is a need to enhance or boost the accuracy of these methods.
There are several known methods for enhancing the accuracy of a categorisation method.
The first is called classifier voting. Multiple classifiers are generated by varying the input parameters of a known classifier for a category. Each generated classifier is provided with the input data. The classifier categorises the input data as belonging to the category or not belonging to the category. These "votes" are then tabulated. If the votes exceed a defined threshold the input data is classified as belonging to the category.
A variation of this method weights the votes of each classifier based on how successful the classifier has been in the past.
The second method uses a combination of classifier voting with learning algorithms. Multiple instances of a learning algorithm are trained on different training sets to produce multiple classifiers. The multiple classifiers are then used in the classifier voting or weighted voting method.
The disadvantage with these methods is that they remain less accurate than manual categorisation and, in the case of learning algorithms, several training sets are required.
It is an object of the present invention to provide a method for improving the categorisation of data by using information from multiple categorisation engines, or to at least provide a useful alternative. Summary of the Invention
According to a first aspect of the invention there is provided a method for categorising an input data object using a plurality of different categorisation engines, including the steps of: i) each categorisation engine calculating a score for the input data object for a plurality of categories; and ii) categorising the input data object based at least in part on the calculated scores.
The scores are preferably non-binary values. The scores may be combined or selected to categorise the input data object.
Preferably, three or more categorisation engines are used.
It is also preferred that at least some of the categorisation engines are learning engines. Each learning engine may be trained using feature vectors of the same type. The type of feature vectors used may be thematic feature vectors such as "bag of words".
Each learning engine may be trained on the same training set.
The input data object may be categorised on the basis of the calculated scores and a second set of calculated scores. The second set of calculated seores is preferably calculated by a second set of categorisation engines. Each engine in the second set is preferably a different categorisation engine. It is also preferred that the second set of engines is comprised of learning engines. The second set of engines may be trained on structural feature vectors
The method may include a step of determining a weighting for each category by combining all the scores for that category and comparing the combination to the combined value of all the scores. The weighting may form the basis for categorising the input data object. The input data object may be categorised within a category if the weighting for that category meets a predefined threshold. The input data object may be categorised within a category if the weighting is equal, within an error margin, to the highest weighting.
The input data object may be categorised within a category if the score for that category meets a predefined threshold. The input data object may be categorised within a category if the score is equal, within an error margin, to the highest score.
It is preferred that one or more of the categorisation engines are selected from the set of fast word statistics algorithm, Bayesian algorithm, and support vector machine.
Preferably, a neural network uses the calculated scores to categorise the input data object. The neural network may be previously trained on at least one pattern comprising at least one set of scores and a set of categories; wherein the set of scores form the inputs for the neural network and the set of categories form the desired outputs. Each set of scores in a pattern may be calculated from a training set of documents by a categorisation engine.
Brief Description of the Drawings
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:
Figure 1 : shows a schematic diagram illustrating an embodiment of the invention.
Figure 2: shows a flow diagram illustrating one method of the invention.
Figure 3: shows a flow diagram illustrating a second method of the invention. Figure 4: shows a flow diagram illustrating a third method of the invention.
Figure 5: shows a flow diagram illustrating a fourth method of the invention.
Detailed Description of the Preferred Embodiments
The present invention provides a method and system of categorising data using the scores of a number of different categorisation engines. A score for a number of categories is generated for the data by a number of different categorisation engines, such as a Bayesian engine, a support vector machine (SVM) and/or a statistical engine. The data is categorised in a category by one of the following methods (i) if a combined proportional score for the category meets a threshold, (ii) if the combined proportional category score meets a threshold and is equivalent to the highest combined proportional category score, (iii) if one score for the category meets a threshold, and (iv) if a score for the category meets a threshold and is equivalent to the highest category score.
A system for categorising an input document will now be described.
Figure 1 shows three categorisation engines 1, 2, and 3. Each categorisation engine is a different type of categorisation engine. For example, Engine 1 may be a Bayesian engine, Engine 2 may be a SVM1 and Engine 3 may be a statistical engine. One statistical engine is described in the patent application CATEGORISATION OF DATA USING A MODEL.
The categorisation engines may be learning engines. The engines may be trained on the same training set.
In one embodiment there may be additional categorisation engines of the same type. For example, there may be two Bayesian engines. In this embodiment the engines of the same type may be trained on different training sets or each engine may be trained on different feature vectors. For example, one engine may be trained on word frequencies and the other engine may be trained on structural features. A structural feature categorisation engine is described in the patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.
Each categorisation engine produces a list of categories and scores for each of those categories for the same input document 4.
A score processor 5 is also shown. The processor takes as input the list of categories and scores from each of the engines.
The scores may be normalised by each engine before being provided as input to the score processor 5.
The processor 5 then uses the scores to determine a list 6 of categories that the input document belongs to.
The list may contain many categories, one category or no categories.
It will be appreciated that instead of a document the system may be adapted for use with any data object such as an image file or data stream.
The processor 5 may determine the list 6 of categories by calculating a weighting for each category based on the scores generated by the engines and categorising the document in a category if the weighting of the category is the highest weighting and/or meets a threshold.
Alternatively, the processor 5 may determine the list 6 of categories by categorising the document in a category if a score for a category is the highest score and/or meets a threshold.
It will be appreciated that the processor 5 may determine the list 6 of categories using another method such as by providing the scores to a neural network. Five methods of how the processor 5 may determine a list 6 of categories from the scores of the categorisation engines will now be described.
Using the list 10 of categories and scores from each engine, a set P of pairs (c,, S1) , where c, is a category, and s, e [0,1] is a category score can be created for the input document 4 in step 11 within Figures 2, 3, 4, and 5.
Referring to Figure 2, a first method (Boosting A) of categorising the input document will be described.
sum(c ,s )
In step 12 for each unique c a value p, = — — is calculated, where
SUm(S1 ) sum(Cj,Sj) is a sum of scores S1 for all pairs in set P where category c, used; and SUm(S1) is a sum of all scores s, . The document is considered as a one that belongs to the category C1 , if p} ≥ th\ in step 13, where th1 is a threshold value.
The threshold th1 may be predetermined by empirical methods. The threshold th1 may be 0.2.
The threshold value may change when different numbers of categories are used.
Referring to Figure 3, a second method (Boosting B) of categorising the input document will be described.
sum(c ,s .)
In step 20 for each unique c a value P1 = — — is calculated, where
SUm(S1 )
SUm(C11S1) is a sum of scores S1 for all pairs in set P where category C1 is used; and sum(st) is a sum of all scores s, . Determine value p' equal to the maximum value of all p, . The document is considered as a one that belongs to the category c} , if p; > thl in step 21 and p} = p'± ε in step 22, where #72 is a threshold and £ is an error margin. The error margin may be a small margin such as 0.001.
The threshold th2 may be predetermined by empirical methods. The threshold th2 may be 0.2.
Referring to Figure 4, a third method (Boosting C) of categorising the input document will be described.
The document is considered as a one that belongs to the category C1 , if S1 ≥ th3 in step 30, where th3 is a threshold value.
The threshold th3 may be predetermined from empirically methods. The threshold th3 may be equal to 0.7.
Referring to Figure 5, a fourth method (Boosting D) of categorising the input document will be described.
Determine a value s' equal to the maximum value of all s, . The document is considered as a one that belongs to the category c, , if S1 > thA in step 40 and Sj = s' ± ε in step 41 , where th4 is a threshold and ε is an error margin.
The threshold th4 may be empirically computed and may be equal to 0.2.
A fifth method of categorising the input document will be described.
The scores of the categorization engines are used within the processor 5 to build an input for a trained artificial neural network (ANN) aggregator. The scores are used to build a corresponding part V/ of input vector for the ANN aggregator. The size of each part V/ is equal to number of categories. The element VjJ of input vector for ANN aggregator is a score of category Cj calculated by categorization engine e ', .
The ANN aggregator is used to calculate an output vector OV. The size of vector OV is equal to number of categories. To determine the category of particular document, the following rule is used for each element 0V/ : if ovι ~ υ- than the document is categorized within category °ι .
The ANN may be trained using the following process:
A set of patterns for the ANN is arranged using trained categorization engines
E1 , E2, ... En. Each pattern is calculated for particular document q e (where D is a training set of documents)), and consists of the sets V1, V2, ... Vn and set OV. The sizes of sets VI-Vn are equal to number of categories. Each set VI-Vn is arranged from output of corresponding categorization engine.
The size of set OV is equal to number of categories. The elements 0V| of set OV may be calculated in accordance with the following formula:
[1, if document d belong to category c 0, else
The sets VI-Vn are used as input for the ANN with the set OV used as desired output from the ANN.
The set of patterns are used to train the ANN.
It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.
The methods Boosting A, B, C, and D have been used in the categorisation of actual web pages (HTML pages) to produce the following test results. Nine categories (chat & messaging, erotic, news, nudism & naturism, pornography, prostitution, shopping, software & downloads, weapons) which are typically utilized for blocking or filtering content on the internet for employees of an organisation or for minors have been used for this test.
Three categorisation engines were used to provide scores for the categories for the methods. The three engines used are a Bayesian categorization algorithm, a Support Vector Machines (SVM) categorization algorithm, and a statistical algorithm detailed in patent application CATEGORISATION OF DATA USING A MODEL.
To implement the test each of Bayesian, SVM, and statistical engines were first trained on a training set comprised of web pages categorised into one of the nine categories.
Each method of the invention and, for comparison purposes, each engine was tested against a test set comprising categorised web pages.
The training and testing sets contain raw HTML pages downloaded from the internet. The distribution of the web pages across the sets and categories is as follows:
Figure imgf000012_0001
Figure imgf000013_0001
The following table summarizes the accuracy results of the three categorizations engines individually and the four methods (Boosting A, Boosting B, Boosting C, and Boosting D):
Figure imgf000013_0002
Figure imgf000014_0001
Figure imgf000015_0001
It can be seen from the results that the methods of the invention provide consistent accuracy over the categories.
Embodiments of the present invention have the following potential advantages:
• The accuracy of different categorization engines used together is better then the accuracy of the categorization engines used standalone.
• Each of the different categorization engines adds value to the composite accuracy.
• An increase of the training set is not required and a high quality composite categorizer can be built using small training sets.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

Claims
1. A method for categorising an input data object using a plurality of different categorisation engines, including the steps of: i) each categorisation engine calculating a score for the input data object for a plurality of categories; and ii) categorising the input data object based at least in part on the calculated scores.
2. A method as claimed in claim 1 wherein the scores are non-binary values.
3. A method as claimed in any one of the preceding claims wherein the scores are combined to categorise the input data object.
4. A method as claimed in any one of the preceding claims wherein three or more categorisation engines are used.
5. A method as claimed in any one of the preceding claims wherein at least some of the categorisation engines are learning engines.
6. A method as claimed in claim 5 wherein each learning engine is trained using feature vectors of the same type
7. A method as claimed in claim 6 wherein the type of feature vectors used are thematic feature vectors.
8. A method as claimed in claim 7 wherein the thematic feature vectors used are "bag of words".
9. A method as claimed in any one of claims 6 to 8 wherein each learning engine is trained on the same training set.
10. A method as claimed in any one of the preceding claims wherein the input data object is categorised on the basis of the calculated scores and a second set of calculated scores.
11. A method as claimed in claim 10 wherein the second set of calculated scores are calculated by a second set of categorisation engines.
12. A method as claimed in claim 11 wherein each engine in the second set is a different categorisation engine.
13. A method as claimed in any one of claims 11 to 12 wherein the second set of engines are learning engines.
14. A method as claimed in claim 13 wherein the second set of engines are trained on structural feature vectors
15. A method as claimed in any one of the preceding claims including the step of determining a weighting for each category by combining all the scores for that category and comparing the combination to the combined value of all the scores.
16. A method as claimed in claim 15 wherein the weighting forms the basis for categorising the input data object.
17. A method as claimed in any one of claims 15 to 16 wherein the input data object is categorised within a category if the weighting for that category meets a predefined threshold.
18. A method as claimed in any one of claims 15 to 17 wherein the input data object is categorised within a category if the weighting is equal, within an error margin, to the highest weighting.
19. A method as claimed in any one of claims 1 to 15 wherein the input data object is categorised within a category if the score for that category meets a predefined threshold.
20. A method as claimed in any one of claims 1 to 15 and 19 wherein the input data object is categorised within a category if the score is equal, within an error margin, to the highest score.
21. A method as claimed in any one of the preceding claims wherein one or more of the categorisation engines are selected from the set of fast word statistics algorithm, Bayesian algorithm, and support vector machine.
22. A method as claimed in any one of the preceding claims wherein a neural network uses the calculated scores to categorise the input data object.
23. A method as claimed in claim 22 wherein the neural network is previously trained on at least one pattern comprising at least one set of scores and a set of categories; wherein the set of scores form the inputs for the neural network and the set of categories form the desired outputs.
24. A method as claimed in claim 23 wherein each set of scores in a pattern are calculated from a training set of documents by a categorisation engine.
25. A system method for categorising an input data object, including: a plurality of different categorisation engines, each categorisation engine arranged for calculating a score for the input data object for a plurality of categories; and a processor arranged for categorising the input data object based at least in part on the calculated scores of the categorisation engines.
26. A system arranged for performing the method of any one of claims 1 to 24.
27. A computer program arranged for performing the method or system of any one of the preceding claims.
28. Storage media arranged for storing a computer program as claimed in claim 27.
PCT/GB2007/003384 2006-09-07 2007-09-07 Categorisation of data using multiple categorisation engines WO2008029156A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
UAA200609647 2006-09-07
UA200609647 2006-09-07
GB0624666A GB2442287A (en) 2006-09-07 2006-12-11 Categorisation of data using multiple categorisation engines
GB0624666.4 2006-12-11

Publications (1)

Publication Number Publication Date
WO2008029156A1 true WO2008029156A1 (en) 2008-03-13

Family

ID=38736054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003384 WO2008029156A1 (en) 2006-09-07 2007-09-07 Categorisation of data using multiple categorisation engines

Country Status (1)

Country Link
WO (1) WO2008029156A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390094A (en) * 2018-04-20 2019-10-29 伊姆西Ip控股有限责任公司 Method, electronic equipment and the computer program product classified to document

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1626356A2 (en) * 2004-08-13 2006-02-15 Microsoft Corporation Method and system for summarizing a document

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1626356A2 (en) * 2004-08-13 2006-02-15 Microsoft Corporation Method and system for summarizing a document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEBASTANI F: "Machine Learning in Automated Text Categorization", ACM COMPUTING SURVEYS, ACM, NEW YORK, NY, US, US, vol. 34, no. 1, March 2002 (2002-03-01), pages 1 - 47, XP002280034, ISSN: 0360-0300 *
TSYMBAL A ET AL: "Handling Local Concept Drift with Dynamic Integration of Classifiers: Domain of Antibiotic Resistance in Nosocomial Infections", COMPUTER-BASED MEDICAL SYSTEMS, 2006. CBMS 2006. 19TH IEEE INTERNATIONAL SYMPOSIUM ON SALT LAKE CITY, UT, USA 22-23 JUNE 2006, PISCATAWAY, NJ, USA,IEEE, 22 June 2006 (2006-06-22), pages 679 - 684, XP010923993, ISBN: 0-7695-2517-1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390094A (en) * 2018-04-20 2019-10-29 伊姆西Ip控股有限责任公司 Method, electronic equipment and the computer program product classified to document
CN110390094B (en) * 2018-04-20 2023-05-23 伊姆西Ip控股有限责任公司 Method, electronic device and computer program product for classifying documents

Similar Documents

Publication Publication Date Title
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN104391835B (en) Feature Words system of selection and device in text
CN105468713B (en) A kind of short text classification method of multi-model fusion
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
CN107908715A (en) Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN106156163B (en) Text classification method and device
CN109271520B (en) Data extraction method, data extraction device, storage medium, and electronic apparatus
CN111538828B (en) Text emotion analysis method, text emotion analysis device, computer device, and readable storage medium
CN108052505A (en) Text emotion analysis method and device, storage medium, terminal
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN107180084A (en) Word library updating method and device
CN105930416A (en) Visualization processing method and system of user feedback information
CN110825850B (en) Natural language theme classification method and device
CN109858034A (en) A kind of text sentiment classification method based on attention model and sentiment dictionary
CN109598307A (en) Data screening method, apparatus, server and storage medium
CN109960791A (en) Judge the method and storage medium, terminal of text emotion
CN108268470A (en) A kind of comment text classification extracting method based on the cluster that develops
CN114564582A (en) Short text classification method, device, equipment and storage medium
CN114896392A (en) Work order data clustering method and device, electronic equipment and storage medium
CN103514168B (en) Data processing method and device
Drishya et al. Cyberbully image and text detection using convolutional neural networks
WO2023082698A1 (en) Public satisfaction analysis method, storage medium, and electronic device
WO2008029156A1 (en) Categorisation of data using multiple categorisation engines
GB2442286A (en) Categorisation of data e.g. web pages using a model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07804184

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07804184

Country of ref document: EP

Kind code of ref document: A1