GB2441598A - Categorisation of Data using Structural Analysis - Google Patents

Categorisation of Data using Structural Analysis Download PDF

Info

Publication number
GB2441598A
GB2441598A GB0624667A GB0624667A GB2441598A GB 2441598 A GB2441598 A GB 2441598A GB 0624667 A GB0624667 A GB 0624667A GB 0624667 A GB0624667 A GB 0624667A GB 2441598 A GB2441598 A GB 2441598A
Authority
GB
United Kingdom
Prior art keywords
method
structural features
signatures
data object
preceding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0624667A
Other versions
GB0624667D0 (en
Inventor
Taras Svirskyi
Glib Alieksieiev
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FUJIN TECHNOLOGY PLC
XPLOITE PLC
Original Assignee
Fujin Technology Plc
Xploite Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to UA200609643 priority Critical
Application filed by Fujin Technology Plc, Xploite Plc filed Critical Fujin Technology Plc
Publication of GB0624667D0 publication Critical patent/GB0624667D0/en
Priority claimed from PCT/GB2007/003378 external-priority patent/WO2008029153A1/en
Publication of GB2441598A publication Critical patent/GB2441598A/en
Application status is Withdrawn legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

A method/system for categorising an input data object 10 such as a file or data stream, includes steps of: generating a signature 7, 8 for each of a plurality of categories C1, C2; analysing the structure of the input data object preferably by means of an analyzer 1 using the signatures; and categorising the input data object based at least in part on the analysis. Preferably the input data object is categorized by extracting features 9 and processed by a learning engine 11 such as a Bayesian Algorithm or support vector machine or a rules engine for comparing extracted features with values of the features of the signatures. Preferably the new extracted features can be used by the categorization engines 11 to train on training sets 2 of thematically generic documents to be capable of categorizing new documents based on structure rather than theme. The new data object could incorporate new websites, shopping websites, software downloads, newsgroups, streaming media and search engines.

Description

1 2441598

CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS

Field of Invention

The present invention relates to a method and system for categorising data by structurally analysing the data using a signature for a category.

Background

Categorisation of content such as web pages is useful for searching for information and for filtering information.

Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.

This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to lack of user input for those classes. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the Internet, is very high.

Automated methods for categorising web pages have been explored. Two popular methods are the use of Bayesian algorithms and the use of support vector machines (SVM).

Each method is trained on a training set of web pages which have been classified within categories.

The methods extract feature vectors from the training set. Feature vectors are attributes that are common to a category. Feature vectors are almost always words or phrases, but can also include formatting strings.

S

One disadvantage with these methods is that they can be unsuccessful in classifying new input data based on their training when the input data contains few identifying feature vectors, such as web pages with little text.

One way of ameliorating this disadvantage is by analysing the links within the web page.

Analysis of the links includes evaluating the number of external links, number of links directed to the page (as used within GoogleTMs PageRankTM), terms extracted from linked documents, and text surrounding or describing the link.

However, links only provide one aspect to assist categorisation and there is a need to improve categorisation of thematically generic sites such as shopping and news sites.

It is an object of the present invention to provide a method for categorising data by structural analysis which overcomes the disadvantages of above methods, or to at least provide a useful alternative.

Summary of the Invention

According to a first aspect of the invention there is provided a method for categorising an input data object including the steps of: i) generating a signature for each of a plurality of categories; ii) analysing the structure of the input data object using the signatures; and iii) categorising the input data object based at least in part on the analysis.

Preferably, the input data object is categorised by at least one categorisation engine and the categorisation engine is a learning engine.

The signatures may include features that are specific to that category. * 3

The signatures may be generated with the assistance of an inductive logic programming module.

Preferably, the step of generating a signature for a plurality of categories includes the sub-steps of: associating each data object of a plurality of data objects with a category; and calculating structural features for each category from the data objects associated with that category to form the signature for that category. Inductive logic programming may be used to calculate the structural features.

It is preferred that the method includes the step of training the categorisation engine using the signatures.

Preferably, the signatures include structural features and the structural features include one or more of functional features, usage of keywords, visual layout, groupings of data, and/or patterns.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which: Figure 1: shows a schematic diagram illustrating an embodiment of the invention.

Detailed Description of the Preferred Embodiments

The present invention provides a method and system for categorising an input data object by analysing the structure of the input data object using a signature for each category.

The present invention will now be described in reference to a document as the input data object. However, it will be appreciate that the input data object could be any data such as a file, or data stream.

A set of structural features are extracted from a training set of documents which have previously been associated with categories. A signature for each category is generated based on the extracted structural features which are associated with that category. A learning engine is trained on the signatures.

An uncategorised document is analysed to extract its structural features. The trained learning engine uses these extracted features to categorise this document.

Figure 1 shows a structural analyzer 1.

The structural analyzer I accepts as input a training set 2 containing a set 3 of documents categorised in one category 4 and a set 5 of documents categorised in a second category 6.

The invention may be adapted for use with any number of categories. In an alternative embodiment a document in the training set may be categorised in more than one category.

The structural analyzer I determines a structural signature 7 and 8 for each category based upon the documents in the training set 2. Each signature 7 and 8 includes structural features that correlate to the category. The features can include functional features such as a password field, keyword usage such as prominent (for example large font size) use of the word "news", visual layout such as a particular frame layout, groupings of data such as image grouped together which could indicate screenshots, and patterns such as a price pattern.

The structural analyzer 1 may utilise an inductive logic programming (ILP) module to produce or assist with the production of the signatures 7 and 8.

In one embodiment the structural analyzer 1 may be assisted by a human user. * 5

A feature extractor 9 is shown. The feature extractor 9 utilises the signatures 7 and 8 to determine which structural features are to be extracted from an input document 10.

A categorisation engine 11 is also shown. The categorisation engine may be a learning engine such as a Bayesian algorithm or a support vector machine (SVM).

The categorisation engine 11 utilises the category signatures 7 and 8 to produce a list 12 of categories that it considers the input document 10 belongs to. Where the categorisation engine 11 is a learning engine it may utilise the signature by being trained 13 on the signatures 7 and 8.

In one embodiment the categorisation engine is a rules engine that compares the features of the input document with the values of the features of the signatures.

Structural features that may be identified by the structural analyzer 1 and incorporated into the category signatures 7 and 8 include generic features that reflect some aspect of web page design (structure) and category-specific features; for example, a price pattern in an HTML document.

Documents are a combination of information and structure. Information in general is the text present in the document. Structure may be a way of presenting text. For example, in a case where the text is about guns the category of information of the text (or the theme of the text) is guns'. This text can be placed within a plain-text web page document. The text can also be within a message in a web forum. The text can be arranged as a book and sold through an on-line shop. In all these cases there is a document containing text about guns. The only distinguishing feature between these cases is the page structure or document structure. By analyzing specific structural features a structural signature for each of a forum, on-line shop, and raw text can created and those different document structures can be differentiated.

Online forums and online shops can be about anything subject (contain any theme). There are forums about guns, forums about sport, and forums about sex. Web pages from these forums can contain any text. Therefore text analysis to determine whether a web page belongs to a forum type can fail. In this case additional features are required which are not extracted from raw text (for example, the existence of a login box or presence of a table of messages). All these additional features can be provided by analysis of the structure of the documents. These new features can then be used by categorization engines to train on training sets of thematically generic documents, to be able to categorise new documents on the basis of structure instead of theme.

Generic structural features may include: * Number of links to pages within the domain * Number of links to pages outside the domain * Links ratio -links to within the domain / links to outside the domain * Link context size -length of closest block that link is in (count average on page or number of short\long) * Average link text size -computed by dividing the combined text size of all links by the total text size on the web-page * Scripts size ratio -computed by dividing the total size within the <SCRIPT> tags by total size within all tags on the web-page * Average script size -computed as an average and maximum script size on the web-page * Input form size -computed separately for each type of a form. Form types include text, password, checkbox, radio, submit, reset, file, hidden, image, and button * GUID pattern -identify the existence of a GUID pattern within the webpage by searching for a sequence of symbols of the following type: [O-9a-fA-F]{8}[-]?([O-9a-fA-F]{4}[-]?){3}[O-9a-fA-F]{1 2) * Text size between successive tags -calculate maximum, average and standard deviation for the "plain text" on the web-page * Maximum and average number of characters between successive tags * HTML tags histogram -computed by dividing amount of each tag by total amount of all tags on the page * Separation of various portions of the web page such as anchor text, headings, table headings, "sponsored link" and "advertisements", and text analysis of each portion identified with its portion * Anchor text ratio -the ratio of anchor text characters to total text characters in an HTML document. The total text of the document is defined as the content of all DOM text nodes in the document, and the anchor text is the content of the DOM text nodes, that are children of the anchor (<A>) elements * Text per table data tag (TTD) -the average number of text characters within a table data tag (<TD>) in an HTML document. TTD is computed by dividing the total number of characters in DOM text nodes, which are children of table data (<TD>) tags, by the total number of table data tags in the document * Image frequency * Image size consistency * Image-within-link frequency Category specific structural features for a "news" category may include: * Existence of a RSS (Really Simple Syndication) feed * Existence of a comment\reply capability * Existence of a print version "button" * Size and count of continuous text blocks * Existence of polling on a web page * Number of date patterns in text * Number of time patterns in text (such as "2hrs ago") * Existence of an archive calendar (a table with a continuous number of years or months) * Existence of "related articles" with a list of links * Existence of the following items as a sequence in a menu or as links on page -"business", "world", "hi-tech', "technology", "finance" * Existence of following items as keywords on the page -"today", "hot news", "hot topics", "hot stories", "top articles", "breaking news", "latest news", "culture", "sport", "tech", "politics" * Date pattern within the webpage URL such as lenta. rulnews/2006107/27/corrupt/ Category specific structural features for a "shopping" category may include: * Price pattern within the plain text. For example: [{currency}] number [?(l){currency}] Where currency = $, , yen, * Price pattern within anchor text (between <a></a> tags) * The following keywords (text): o price, basket, cart, order, trolley, poduct o shopping bag oorder now o add to basketpcarttroIleyshopping bag} obuy obuy now when used within the following tags: o<input class={text}

..> o<input value={text} ....> o<td class={text} . o <img src={text} alt={text) ....> * Existence of a "basket" entity * Existence of following items in a menu -"our product", "catalog", "internet store", "delivery" * "KoHcynbTaHT" -as a special block on the web page (present on many Russian e-shops) * "email + icq + name" listing as separate block * 9 Category specific structural features for a "software downloads" category may include: * Links to binary files (such as zip, tar.gz, tar.bz2, exe) * The following pattern "file size: <number> Mb(Kb)" * The following pattern "requirements: <list of OSes or hardware items>"...DTD: For example:

"System requirements: o 98 I Me / 2000 I XP o Microsoft DirectX 8.1 oPentium II 366 MHz (Pentium III 600 MHz recommended) o64 Mb RAM (128 Mb recommended) o Riva TNT, 8 Mb (GeForce256, 32 Mb recommended)" * Inclusion of a software license (freeware, shareware, trial, demo) * Existence of screenshots * The following pattern "downloads: <number>" * The following pattern "version -[v. I ver.I (digit*\.)* [alphalbetalrcllrc2]" * Grouping the above features in small blocks Category specific structural features for a "forums/newsgroups" category may include: * Usage of a newgroup-faciliating engine such as phpbb * Site mostly in plain text with the following text in links -"next thread", "prey, thread", "reply", "sort by (thread, date)" * Trees of links * Keywords "thread", "message", "post", "reply" within anchor text or within URLs * Links starting with "Re:" Category specific structural features for a "streaming media" category may include: * "Listen/watch" links * The following pattern "<number> kbps" * Blocks of alphabetically-sorted single-letter links (common to MP3 sites) * 10 * Tables including a song title, a duration pattern, file size pattern and download link * Links ending with a year (For example "Abbey Road -1969") Category specific structural features for a "search engines/portals" category may include: * A form with a text field and "search" button (such as <input type="submit" value="search">) * Very little plain text with most text in links * Numerous links with common set of categories: Business, Entertainment, Computers/Internet It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.

One potential advantage of embodiments of the present invention is the greater efficacy of categorising thematic-neutral documents such as new websites, shopping websites, software downloads, newsgroups, streaming media, and search engines.

Another potential advantage of an embodiment of the present invention is that it provides another angle of categorisation because it identifies new features within the data, this is useful when combined with the scores generated by other categorisation engines (such as text-based classifiers) for categorising "difficult to categorise" documents. Combination of categorisation scores is described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES. * 11

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art.

Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims (16)

  1. Claims 1. A method for categorising an input data object, including the
    steps of: i) generating a signature for each of a plurality of categories; ii) analysing the structure of the input data object using the signatures; and iii) categorising the input data object based at least in part on the analysis.
  2. 2. A method as claimed in claim I wherein the input data object is categorised in step iii using at least one categorisation engine.
  3. 3. A method as claimed in any one of the preceding claims wherein the categorisation engine is a learning engine.
  4. 4. A method as claimed in any one of the preceding claims wherein the signatures include features that are category-specific.
  5. 5. A method as claimed in any one of the preceding claims wherein the signatures are generated with the assistance of an inductive logic programming module.
  6. 6. A method as claimed in any one of the preceding claims wherein the step of generating the signature includes the sub-steps of: associating each data object of a plurality of data objects with a category; and calculating structural features for each category from the data objects associated with that category to form the signature.
  7. 7. A method as claimed in claim 6 wherein inductive logic programming is used to calculate the structural features.
  8. 8. A method as claimed in any one of the preceding claims including the step of training the categorisation engine using the signatures. * 13
  9. 9. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include functional features.
  10. 10. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include usage of keywords.
  11. 11. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include visual layout.
  12. 12. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include groupings of data.
  13. 13. A method as claimed in any one of the preceding claims wherein the signatures include structural features and the structural features include patterns.
  14. 14. A system for categorising input data objects, including: a memory arranged for storing a signature for each of a plurality of categories; and a processor arranged for analysing the structure of an input data object using at least one of the signatures and categorising the input data object based at least in part on the analysis.
  15. 15. A computer program arranged for effecting the method or system of any one of the preceding claims.
  16. 16. Storage media arranged for storing a computer program as claimed in claim 15.
GB0624667A 2006-09-07 2006-12-11 Categorisation of Data using Structural Analysis Withdrawn GB2441598A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
UA200609643 2006-09-07

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/GB2007/003378 WO2008029153A1 (en) 2006-09-07 2007-09-07 Categorisation of data using structural analysis

Publications (2)

Publication Number Publication Date
GB0624667D0 GB0624667D0 (en) 2007-01-17
GB2441598A true GB2441598A (en) 2008-03-12

Family

ID=37711890

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0624667A Withdrawn GB2441598A (en) 2006-09-07 2006-12-11 Categorisation of Data using Structural Analysis

Country Status (1)

Country Link
GB (1) GB2441598A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998058344A1 (en) * 1997-06-16 1998-12-23 The Dialog Corporation Text classification system and method
EP1096391A2 (en) * 1999-10-26 2001-05-02 Hewlett-Packard Company, A Delaware Corporation Automatic categorization of documents using document signatures
WO2002010957A2 (en) * 2000-07-31 2002-02-07 Eliyon Technologies Corporation Computer method and apparatus for determining content types of web pages
WO2006049581A1 (en) * 2004-11-05 2006-05-11 Dramtech (Asia Pacific) Pte Ltd A method to transmit and update a transmitted electronic document
EP1818839A1 (en) * 2006-02-14 2007-08-15 Accenture Global Services GmbH System and method for online information analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998058344A1 (en) * 1997-06-16 1998-12-23 The Dialog Corporation Text classification system and method
EP1096391A2 (en) * 1999-10-26 2001-05-02 Hewlett-Packard Company, A Delaware Corporation Automatic categorization of documents using document signatures
WO2002010957A2 (en) * 2000-07-31 2002-02-07 Eliyon Technologies Corporation Computer method and apparatus for determining content types of web pages
WO2006049581A1 (en) * 2004-11-05 2006-05-11 Dramtech (Asia Pacific) Pte Ltd A method to transmit and update a transmitted electronic document
EP1818839A1 (en) * 2006-02-14 2007-08-15 Accenture Global Services GmbH System and method for online information analysis

Also Published As

Publication number Publication date
GB0624667D0 (en) 2007-01-17

Similar Documents

Publication Publication Date Title
Feldman et al. The text mining handbook: advanced approaches in analyzing unstructured data
Biagioli et al. Automatic semantics extraction in law documents
Dave et al. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews
RU2377645C2 (en) Method and system for classifying display pages using summaries
Spink et al. Feedback in information retrieval.
Lau et al. Word sense induction for novel sense detection
Milne et al. An open-source toolkit for mining Wikipedia
Danisman et al. Feeler: Emotion classification of text using vector space model
US20080010291A1 (en) Techniques for clustering structurally similar web pages
Lyon et al. Detecting short passages of similar text in large document collections
Goldberg et al. A dataset of syntactic-ngrams over time from a very large corpus of english books
Harsanyi Multiple Authors, Multiple Problems--Bibliometrics and the Study of Scholarly Collaboration: A Literature Review.
Sun et al. Dom based content extraction via text density
Song et al. Learning block importance models for web pages
Alexa et al. A review of software for text analysis
Estival et al. Author profiling for English emails
US20050149851A1 (en) Generating hyperlinks and anchor text in HTML and non-HTML documents
Leshed et al. Understanding how bloggers feel: recognizing affect in blog posts
US20060200341A1 (en) Method and apparatus for processing sentiment-bearing text
US20050273706A1 (en) Systems and methods for identifying and extracting data from HTML pages
Song et al. Identifying ambiguous queries in web search
US8346765B2 (en) Generating ranked search results using linear and nonlinear ranking models
JP2007334894A (en) Visualization within context of source document for annotation of document
WO2004083989A2 (en) Web server for adapted web content
DE10323444A1 (en) Method and apparatus for categorizing and displaying documents of a distributed database

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)