GB2441598A

GB2441598A - Categorisation of Data using Structural Analysis

Info

Publication number: GB2441598A
Application number: GB0624667A
Authority: GB
Inventors: Taras Svirskyi; Glib Alieksieiev
Original assignee: FUJIN TECHNOLOGY PLC; XPLOITE PLC
Current assignee: FUJIN TECHNOLOGY PLC; XPLOITE PLC
Priority date: 2006-09-07
Filing date: 2006-12-11
Publication date: 2008-03-12
Also published as: GB0624667D0

Abstract

A method/system for categorising an input data object 10 such as a file or data stream, includes steps of: generating a signature 7, 8 for each of a plurality of categories C1, C2; analysing the structure of the input data object preferably by means of an analyzer 1 using the signatures; and categorising the input data object based at least in part on the analysis. Preferably the input data object is categorized by extracting features 9 and processed by a learning engine 11 such as a Bayesian Algorithm or support vector machine or a rules engine for comparing extracted features with values of the features of the signatures. Preferably the new extracted features can be used by the categorization engines 11 to train on training sets 2 of thematically generic documents to be capable of categorizing new documents based on structure rather than theme. The new data object could incorporate new websites, shopping websites, software downloads, newsgroups, streaming media and search engines.

Description

1 2441598

CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS

Field of Invention

The present invention relates to a method and system for categorising data by structurally analysing the data using a signature for a category.

Background

Categorisation of content such as web pages is useful for searching for information and for filtering information.

Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.

This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to lack of user input for those classes. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.

Automated methods for categorising web pages have been explored. Two popular methods are the use of Bayesian algorithms and the use of support vector machines (SVM).

Each method is trained on a training set of web pages which have been classified within categories.

The methods extract feature vectors from the training set. Feature vectors are attributes that are common to a category. Feature vectors are almost always words or phrases, but can also include formatting strings.

S

One disadvantage with these methods is that they can be unsuccessful in classifying new input data based on their training when the input data contains few identifying feature vectors, such as web pages with little text.

One way of ameliorating this disadvantage is by analysing the links within the web page.

Analysis of the links includes evaluating the number of external links, number of links directed to the page (as used within GoogleTMs PageRankTM), terms extracted from linked documents, and text surrounding or describing the link.

However, links only provide one aspect to assist categorisation and there is a need to improve categorisation of thematically generic sites such as shopping and news sites.

It is an object of the present invention to provide a method for categorising data by structural analysis which overcomes the disadvantages of above methods, or to at least provide a useful alternative.

Summary of the Invention

According to a first aspect of the invention there is provided a method for categorising an input data object including the steps of: i) generating a signature for each of a plurality of categories; ii) analysing the structure of the input data object using the signatures; and iii) categorising the input data object based at least in part on the analysis.

Preferably, the input data object is categorised by at least one categorisation engine and the categorisation engine is a learning engine.

The signatures may include features that are specific to that category. * 3

The signatures may be generated with the assistance of an inductive logic programming module.

Preferably, the step of generating a signature for a plurality of categories includes the sub-steps of: associating each data object of a plurality of data objects with a category; and calculating structural features for each category from the data objects associated with that category to form the signature for that category. Inductive logic programming may be used to calculate the structural features.

It is preferred that the method includes the step of training the categorisation engine using the signatures.

Preferably, the signatures include structural features and the structural features include one or more of functional features, usage of keywords, visual layout, groupings of data, and/or patterns.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which: Figure 1: shows a schematic diagram illustrating an embodiment of the invention.

Detailed Description of the Preferred Embodiments

The present invention provides a method and system for categorising an input data object by analysing the structure of the input data object using a signature for each category.

The present invention will now be described in reference to a document as the input data object. However, it will be appreciate that the input data object could be any data such as a file, or data stream.

A set of structural features are extracted from a training set of documents which have previously been associated with categories. A signature for each category is generated based on the extracted structural features which are associated with that category. A learning engine is trained on the signatures.

An uncategorised document is analysed to extract its structural features. The trained learning engine uses these extracted features to categorise this document.

Figure 1 shows a structural analyzer 1.

The structural analyzer I accepts as input a training set 2 containing a set 3 of documents categorised in one category 4 and a set 5 of documents categorised in a second category 6.

The invention may be adapted for use with any number of categories. In an alternative embodiment a document in the training set may be categorised in more than one category.

The structural analyzer I determines a structural signature 7 and 8 for each category based upon the documents in the training set 2. Each signature 7 and 8 includes structural features that correlate to the category. The features can include functional features such as a password field, keyword usage such as prominent (for example large font size) use of the word "news", visual layout such as a particular frame layout, groupings of data such as image grouped together which could indicate screenshots, and patterns such as a price pattern.

The structural analyzer 1 may utilise an inductive logic programming (ILP) module to produce or assist with the production of the signatures 7 and 8.

In one embodiment the structural analyzer 1 may be assisted by a human user. * 5

A feature extractor 9 is shown. The feature extractor 9 utilises the signatures 7 and 8 to determine which structural features are to be extracted from an input document 10.

A categorisation engine 11 is also shown. The categorisation engine may be a learning engine such as a Bayesian algorithm or a support vector machine (SVM).

The categorisation engine 11 utilises the category signatures 7 and 8 to produce a list 12 of categories that it considers the input document 10 belongs to. Where the categorisation engine 11 is a learning engine it may utilise the signature by being trained 13 on the signatures 7 and 8.

In one embodiment the categorisation engine is a rules engine that compares the features of the input document with the values of the features of the signatures.

Structural features that may be identified by the structural analyzer 1 and incorporated into the category signatures 7 and 8 include generic features that reflect some aspect of web page design (structure) and category-specific features; for example, a price pattern in an HTML document.

Documents are a combination of information and structure. Information in general is the text present in the document. Structure may be a way of presenting text. For example, in a case where the text is about guns the category of information of the text (or the theme of the text) is guns'. This text can be placed within a plain-text web page document. The text can also be within a message in a web forum. The text can be arranged as a book and sold through an on-line shop. In all these cases there is a document containing text about guns. The only distinguishing feature between these cases is the page structure or document structure. By analyzing specific structural features a structural signature for each of a forum, on-line shop, and raw text can created and those different document structures can be differentiated.

Online forums and online shops can be about anything subject (contain any theme). There are forums about guns, forums about sport, and forums about sex. Web pages from these forums can contain any text. Therefore text analysis to determine whether a web page belongs to a forum type can fail. In this case additional features are required which are not extracted from raw text (for example, the existence of a login box or presence of a table of messages). All these additional features can be provided by analysis of the structure of the documents. These new features can then be used by categorization engines to train on training sets of thematically generic documents, to be able to categorise new documents on the basis of structure instead of theme.

Generic structural features may include: * Number of links to pages within the domain * Number of links to pages outside the domain * Links ratio -links to within the domain / links to outside the domain * Link context size -length of closest block that link is in (count average on page or number of short\long) * Average link text size -computed by dividing the combined text size of all links by the total text size on the web-page * Scripts size ratio -computed by dividing the total size within the <SCRIPT> tags by total size within all tags on the web-page * Average script size -computed as an average and maximum script size on the web-page * Input form size -computed separately for each type of a form. Form types include text, password, checkbox, radio, submit, reset, file, hidden, image, and button * GUID pattern -identify the existence of a GUID pattern within the webpage by searching for a sequence of symbols of the following type: [O-9a-fA-F]{8}[-]?([O-9a-fA-F]{4}[-]?){3}[O-9a-fA-F]{1 2) * Text size between successive tags -calculate maximum, average and standard deviation for the "plain text" on the web-page * Maximum and average number of characters between successive tags * HTML tags histogram -computed by dividing amount of each tag by total amount of all tags on the page * Separation of various portions of the web page such as anchor text, headings, table headings, "sponsored link" and "advertisements", and text analysis of each portion identified with its portion * Anchor text ratio -the ratio of anchor text characters to total text characters in an HTML document. The total text of the document is defined as the content of all DOM text nodes in the document, and the anchor text is the content of the DOM text nodes, that are children of the anchor (<A>) elements * Text per table data tag (TTD) -the average number of text characters within a table data tag (<TD>) in an HTML document. TTD is computed by dividing the total number of characters in DOM text nodes, which are children of table data (<TD>) tags, by the total number of table data tags in the document * Image frequency * Image size consistency * Image-within-link frequency Category specific structural features for a "news" category may include: * Existence of a RSS (Really Simple Syndication) feed * Existence of a comment\reply capability * Existence of a print version "button" * Size and count of continuous text blocks * Existence of polling on a web page * Number of date patterns in text * Number of time patterns in text (such as "2hrs ago") * Existence of an archive calendar (a table with a continuous number of years or months) * Existence of "related articles" with a list of links * Existence of the following items as a sequence in a menu or as links on page -"business", "world", "hi-tech', "technology", "finance" * Existence of following items as keywords on the page -"today", "hot news", "hot topics", "hot stories", "top articles", "breaking news", "latest news", "culture", "sport", "tech", "politics" * Date pattern within the webpage URL such as lenta. rulnews/2006107/27/corrupt/ Category specific structural features for a "shopping" category may include: * Price pattern within the plain text. For example: [{currency}] number [?(l){currency}] Where currency = $, , yen, * Price pattern within anchor text (between <a></a> tags) * The following keywords (text): o price, basket, cart, order, trolley, poduct o shopping bag oorder now o add to basketpcarttroIleyshopping bag} obuy obuy now when used within the following tags: o<input class={text}

..> o<input value={text} ....> o<td class={text} . o <img src={text} alt={text) ....> * Existence of a "basket" entity * Existence of following items in a menu -"our product", "catalog", "internet store", "delivery" * "KoHcynbTaHT" -as a special block on the web page (present on many Russian e-shops) * "email + icq + name" listing as separate block * 9 Category specific structural features for a "software downloads" category may include: * Links to binary files (such as zip, tar.gz, tar.bz2, exe) * The following pattern "file size: <number> Mb(Kb)" * The following pattern "requirements: <list of OSes or hardware items>"...DTD: For example:

"System requirements: o 98 I Me / 2000 I XP o Microsoft DirectX 8.1 oPentium II 366 MHz (Pentium III 600 MHz recommended) o64 Mb RAM (128 Mb recommended) o Riva TNT, 8 Mb (GeForce256, 32 Mb recommended)" * Inclusion of a software license (freeware, shareware, trial, demo) * Existence of screenshots * The following pattern "downloads: <number>" * The following pattern "version -[v. I ver.I (digit*\.)* [alphalbetalrcllrc2]" * Grouping the above features in small blocks Category specific structural features for a "forums/newsgroups" category may include: * Usage of a newgroup-faciliating engine such as phpbb * Site mostly in plain text with the following text in links -"next thread", "prey, thread", "reply", "sort by (thread, date)" * Trees of links * Keywords "thread", "message", "post", "reply" within anchor text or within URLs * Links starting with "Re:" Category specific structural features for a "streaming media" category may include: * "Listen/watch" links * The following pattern "<number> kbps" * Blocks of alphabetically-sorted single-letter links (common to MP3 sites) * 10 * Tables including a song title, a duration pattern, file size pattern and download link * Links ending with a year (For example "Abbey Road -1969") Category specific structural features for a "search engines/portals" category may include: * A form with a text field and "search" button (such as <input type="submit" value="search">) * Very little plain text with most text in links * Numerous links with common set of categories: Business, Entertainment, Computers/Internet It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.

One potential advantage of embodiments of the present invention is the greater efficacy of categorising thematic-neutral documents such as new websites, shopping websites, software downloads, newsgroups, streaming media, and search engines.

Another potential advantage of an embodiment of the present invention is that it provides another angle of categorisation because it identifies new features within the data, this is useful when combined with the scores generated by other categorisation engines (such as text-based classifiers) for categorising "difficult to categorise" documents. Combination of categorisation scores is described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES. * 11

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art.

Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

Claims 1. A method for categorising an input data object, including the

steps of: i) generating a signature for each of a plurality of categories; ii) analysing the structure of the input data object using the signatures; and iii) categorising the input data object based at least in part on the analysis.
2. A method as claimed in claim I wherein the input data object is categorised in step iii using at least one categorisation engine.
3. A method as claimed in any one of the preceding claims wherein the categorisation engine is a learning engine.
4. A method as claimed in any one of the preceding claims wherein the signatures include features that are category-specific.
5. A method as claimed in any one of the preceding claims wherein the signatures are generated with the assistance of an inductive logic programming module.
6. A method as claimed in any one of the preceding claims wherein the step of generating the signature includes the sub-steps of: associating each data object of a plurality of data objects with a category; and calculating structural features for each category from the data objects associated with that category to form the signature.
7. A method as claimed in claim 6 wherein inductive logic programming is used to calculate the structural features.
8. A method as claimed in any one of the preceding claims including the step of training the categorisation engine using the signatures. * 13
9. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include functional features.
10. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include usage of keywords.
11. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include visual layout.
12. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include groupings of data.
13. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include patterns.
14. A system for categorising input data objects, including: a memory arranged for storing a signature for each of a plurality of categories; and a processor arranged for analysing the structure of an input data object using at least one of the signatures and categorising the input data object based at least in part on the analysis.
15. A computer program arranged for effecting the method or system of any one of the preceding claims.
16. Storage media arranged for storing a computer program as claimed in claim 15.