WO2011051656A2 - Document processing system and method - Google Patents

Document processing system and method Download PDF

Info

Publication number
WO2011051656A2
WO2011051656A2 PCT/GB2010/001975 GB2010001975W WO2011051656A2 WO 2011051656 A2 WO2011051656 A2 WO 2011051656A2 GB 2010001975 W GB2010001975 W GB 2010001975W WO 2011051656 A2 WO2011051656 A2 WO 2011051656A2
Authority
WO
WIPO (PCT)
Prior art keywords
document
processing system
document processing
keyword
language
Prior art date
Application number
PCT/GB2010/001975
Other languages
French (fr)
Other versions
WO2011051656A9 (en
Inventor
Geoffrey Watts
Julia Fowler
Original Assignee
Stylescape Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stylescape Limited filed Critical Stylescape Limited
Publication of WO2011051656A2 publication Critical patent/WO2011051656A2/en
Publication of WO2011051656A9 publication Critical patent/WO2011051656A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to a document processing system and method, in
  • Market as used herein means geographic market, such as a specific country or set of countries;
  • Vertical as used herein means a specific subject area, such as clothing, music, politics, etc.
  • Websites providing online reviews posted by customers may illustrate the public's opinion of existing and specific products and services, but by their nature they relate to after-the-fact analysis once that product or service has been developed and made available to the public, and they do not provide opinions within a given market or vertical as a whole.
  • the invention provides a document processing system comprising:
  • a document identification module configured to search document sources identified by a document source database
  • l a rating module configured to apply a rating value to language fragments contained in the identified documents
  • a calculation module configured to determine an average rating value for language fragments containing a match for a selected keyword.
  • the rating module is configured to generate language fragments from text contained within the identified documents for storage in a database.
  • the system further comprises a document classification module configured to classify documents identified by the document identification module.
  • each document is classified into one or both of a vertical corresponding to a field of the document, and a geographical market.
  • each document is classified into a vertical and a topic within that vertical.
  • classification of the documents is determined using naive Bayesian analysis.
  • each document is assigned to a time period encompassing a determined publication date of the document and wherein the calculation module is configured to calculate the average rating for each time period.
  • the language fragments comprise words expressing opinion, and wherein the rating is determined by the expressed opinion.
  • the ratings are determined by analysing each language fragment to determine an opinion expressed in the document selected from a group comprising positive opinion and negative opinion, and applying a rating according to the determined opinion.
  • the group further comprises neutral opinion.
  • the average rating value relates to opinion of a physical object.
  • the language fragments are analysed using na ' ive Bayesian analysis to determine the rating value.
  • the system further comprises a source scoring module configured to apply a weighting to each rating value.
  • the weighting value is determined by the source of the language fragment.
  • the weighting value is determined, at least in part, by an external quality indicator.
  • the external quality indicators are selected from one or more of: a website ranking, and number of times a website has been bookmarked.
  • documents identified by the document identifier are stored in a document storage database and wherein the weighting value is determined, at least in part, by the number of times the source of the language fragment has been cited by documents stored in the document storage database.
  • system further comprises a module for determining the language of the document.
  • the language of the document is determined using a /7-gram probabilistic approach.
  • the keyword is a predetermined keyword stored in a keyword database.
  • the system is configured to identify the keyword associated with the highest average rating value.
  • the system is configured to generate a list of keywords in order of the average rating value associated with each keyword.
  • the keyword is a keyword selected by a system user.
  • system further comprising a display module for displaying the average rating value on a display screen.
  • the display module is configured to display, for each time period, the number of language fragments containing the selected keyword.
  • the display module is configured to display the average rating for each time period.
  • the display module is configured to display the list of keywords in order of the average rating value associated with each keyword
  • system further comprising an application programming interface for linking the system to a machine.
  • the machine is an automated product ordering or manufacturing control system.
  • the invention comprises a rating module for use in a document processing system, the rating module being configured to:
  • the invention provides a document processing method comprising the steps of: identifying a plurality of documents comprising text;
  • the method comprises the step of generating language fragments from text contained within the identified documents for storage in a database.
  • the method comprises the step of classifying documents identified by the document identification module.
  • the method comprises the step of classifying each document into one or both of a vertical corresponding to a field of the document, and a geographical market.
  • each document is classified into a vertical and a topic within that vertical.
  • classification of the documents is determined using naive Bayesian analysis.
  • each document is assigned to a time period encompassing a determined publication date of the document and wherein the calculation module is configured to calculate the average rating for each time period.
  • the language fragments comprise words expressing opinion, and wherein the rating is determined by the expressed opinion.
  • the ratings are determined by analysing each language fragment to determine an opinion expressed in the document selected from a group comprising positive opinion and negative opinion, and applying a rating according to the determined opinion.
  • the group further comprises neutral opinion.
  • the average rating value relates to opinion of a physical object.
  • the language fragments are analysed using nai ' ve Bayesian analysis to determine the rating value.
  • the method comprises the step of applying a weighting to each rating value.
  • the weighting value is determined by the source of the language fragment.
  • the weighting value is determined, at least in part, by an external quality indicator.
  • the external quality indicators are selected from one or more of: a website ranking, and number of times a website has been bookmarked.
  • documents identified by the document identifier are stored in a document storage database and wherein the weighting value is determined, at least in part, by the number of times the source of the language fragment has been cited by documents stored in the document storage database.
  • the method comprises the step of determining the language of the document.
  • the language of the document is determined using a /i-gram probabilistic approach.
  • the keyword is a predetermined keyword stored in a keyword database.
  • the method comprises the step of identifying the keyword associated with the highest average rating value.
  • the method comprises the step of generating a list of keywords in order of the average rating value associated with each keyword.
  • the keyword is a keyword selected by a system user.
  • the invention provides computer program code which when run on a computer causes the computer to perform the method according to the third aspect.
  • the invention provides an article of manufacture.
  • the article of manufacture is
  • the invention provides computer program product comprising
  • the invention provides a computer readable medium recorded with computer readable code arranged to cause a computer to perform the method according to the third aspect,
  • Document as used herein means any file containing text in any natural language that is accessible to a document identifier including but not limited to a Webcrawler, including but not limited to text files of web pages (with or without html); word processor files such as Word .doc files; and .pdf files.
  • Figure 1 is an overview of an embodiment of a document processing system
  • Figure 2 illustrates document classification according to the embodiment of Figure 1 ;
  • Figure 3 illustrates source scoring according to the embodiment of Figure 1 ;
  • Figure 4 illustrates sentiment analysis according to the embodiment of Figure 1
  • Figure 5 illustrates an output of the embodiment of Figure 1 arising from analysis using keywords in a keyword database
  • Figure 6 illustrates an output of the embodiment of Figure 1 based on selection of a keyword of Figure 5.
  • Keyword querying of the classified and indexed documents either based on predetermined keywords stored in a keyword database or keywords selected by a system user.
  • a document identifier in the form of a Webcrawler 102 identifies documents from the Internet taken document sources that are identified to Webcrawler 102 by document source database 104.
  • the document sources contained in document source database 104 and that are checked by the crawler 102 may include any website, including websites containing reviews and opinions, blogs, Twitter, etc. This may include websites from which files may be retrieved directly, and / or websites which may be queried via a search interface in order to retrieve relevant documents.
  • the document identifier in this embodiment is a webcrawler that refers to websites as document sources
  • the document identifier for use in the system of the invention may be an identifier that refers to other document sources, for example captions for television that may or may not be displayed on television (open and closed captions, respectively), or text from speech, wherein documents are generated using speech recognition software.
  • Documents identified by crawler 104 are then classified by document classifier 106 as described below.
  • the document source database 104 may comprise only sources that have been manually curated into the database, however the system may provide for automatic acquisition of new sources over time as described in more detail below under "Automatic Sourcing”.
  • Image store database 1 15 may store images taken from documents identified by the document identifier. Although the system of the present invention is concerned with processing of text in documents as described below, images may also be captured and indexed, for example according to colour.
  • Document classifier 106 applies one or more classifications to each document
  • Classifications that may be applied include:
  • Market - a market classification may be applied to identify the geographical market to which the document applies such as a specific country or group of countries.
  • Vertical -a vertical classification may be applied to identify the subject area that the document is concerned with (e.g. fashion, arts, electronic goods, etc.)
  • the document source database 104 identifies to document classifier 106 at least one market and at least one vertical that the document source is relevant to. If the document source database identifies only one market and only one vertical then documents from that source may simply be tagged by document classifier 106 with the relevant market and vertical.
  • the document source database 104 identifies that a document source is relevant to two or more markets and / or two or more verticals then the document is analysed by document classifier 106 to determine which classifications should be applied.
  • a document identified by the crawler 102 is read to identify its source. If at step 204 the document is identified as deriving from a document source that is relevant to more than one vertical then it is classified using naive Bayesian analysis against a representative corpus (i.e. text collection) of data taken from vertical corpus database 206 containing terms relevant to different verticals using the Monte Carlo method at step 208, and assigning a best-fit vertical to the document at step 210.
  • the contents of the vertical corpus database 206 contains data taken from the document index database 1 10, described below.
  • classifications may be applied directly to the document, and document classifier 106 may be bypassed. Moreover, it will be appreciated that the document classifier 106 may be bypassed if a document source is relevant to one vertical only and the system is configured to classify documents by vertical but not by market, or if a document source is relevant to one market only and the system is configured to classify documents by market but not by vertical.
  • URLs are normalised for citation matching at step 220 and the classified documents, tagged with classification information, are stored in classified document store 108 (also shown in Figure 1 ) that may be interrogated as described later.
  • normalisation process allows documents from the same document source to be identified, as described below in source scoring.
  • normalisation may involve expansion of a short alias of a longer URL, such as an alias generated using tinyURL.com.
  • each document may be further classified according to the
  • Topic addressed by the document e.g. the topic of womenswear in the case of a document classified within the vertical of fashion, or the topic of mobile phones in the case of a document classified within the vertical of electronic goods.
  • Each topic may be further subdivided at least once into subtopics, sub-subtopics, and so on.
  • Sub- classification is carried out in the same way as classification described above.
  • Documents within classified document store 106 are analysed by source analyser and scorer 1 10 to identify potential " new document sources. If two or more documents from different sources all refer to a further source then that further source may be added to the document source database as a new document source, if it is not already in the document source database 104.
  • Each document source carries a weighting value determined by a number of factors to reflect the quality of that source, and that weighting value is stored in document source database.
  • a weighting value is generated for both manually curated sources and automatically identified sources.
  • step 301 an existing source score is retrieved from document source database 104 or, for a new source, an arbitrary score may initially be applied.
  • a manually curated source may carry a different weighting to an automatically identified source.
  • External quality indicators 304 and the number of times the source has been referenced are used to determine the quality of each source, and from this the weighting value to be applied to documents originating from that source.
  • External quality indicators 304 include statistics on how often a source website has been bookmarked or the ranking of the source website (for example in Google, Yahoo BOSS or Alexa results).
  • the external quality indicators for a given source may be general, or may be specific to different markets and / or verticals. These quality indicators are aggregated to generate an external quality value that is applied at step 305 as part of the process of determining a weighting value to the document source.
  • the contents of the classified document store are analysed to determine the number of times each document source has been cited as an indication of the quality of that source.
  • a source is relevant to only one market and / or vertical then only one citation score is applied. However, if a source is relevant to more than one vertical and / or market then different scores citation scores may be applied based on the market and / or vertical associated with the citing document or documents. The number of citations for a given source are aggregated to generate a citation value that is applied at step 305 as further part of the process of determining a weighting value to the document source.
  • the citation value and external quality value are assessed by means of establishing a set of scores that fall within the bounds of a Wilson score confidence interval by the using the following formula: 10 001975
  • zl - a / 2 is the 1 - a / 2 percentile of a standard normal distribution of scores, and n is the count of the regular scores being considered.
  • the scoring process is repeated regularly, and can be repeated continuously to
  • Each document in classified document store 108 is fragmented into language
  • the language fragments contain measurable expressions of sentiment.
  • the language fragments are first normalised into stemmed text at step 402 and analysed to determine a sentiment associated with that language fragment using na ' ive Bayesian analysis by means of a classifier such as that discussed by A. McCallum and . Nigam, "A Comparison of Event Models for Naive Bayes Text Classification" pp. 41-48,
  • the sentiment may be selected from positive, negative or neutral only, or may include further options such as very positive or very negative.
  • Stemming of text includes, in particular, generating word stems. For example, the words “sequins” or “sequinned” may be stemmed to "sequin”.
  • the fragmentation and stemming may be conducted using software such as that available from LingPipe (http://alias-i.com/lingpipe/) or NLTK (www.nltk.org).
  • Language fragments typically comprise a name or a noun and an associated adjective.
  • the corpora of language fragments used for sentiment analysis include manually scored corpora 404 and, once document index has been populated, a Monte Carlo selected subset of document index database 1 10 (described below). This subset is changed regularly.
  • These corpora include, in particular, adjectives tagged with a sentiment rating value.
  • Database 110 also provides the automatic training corpora 405 for Bayesian analyser.
  • a list of keywords for each market, vertical and / or detail is stored in keyword
  • the keywords may include nouns, adjectives (e.g. colours), names (of places, people, brands, musical groups etc.), etc.
  • Keywords is used herein for simplicity, however for the avoidance of doubt it will be appreciated that this term does not just encompass terms of a single word, but may include terms containing characters other than letters, such as numerals or punctuation symbols, and may contain more than one word (for example, a forename and a surname).
  • the keyword database may include defined relationships between keywords, for example a fashion designer's name may be associated with a specific brand name or shop name.
  • the system may search for relationships between keywords. For example, if the occurrence of a keyword in a document is accompanied by occurrence another keyword in a high percentage of documents in the classified document database that is above a threshold percentage then the keywords may be identified as being linked.
  • the keyword database contains an initial set of keywords that are derived from a curated taxonomy associated with the vertical in question.
  • nouns or other keywords, such as names
  • a pre-defined grammar that can vary by vertical and language
  • Figure 5 illustrates the output of an analysis n keywords within a given period that have the most positive associated sentiment (in this case, the 20 most popular keywords in the vertical of womenswear for the topic of detail in the UK market for the period of 24 May - 1 October).
  • a publication date is typically available from the document source. If the publication date is a date in the future then the date of indexing is applied as the publication date. In order to identify documents that may carry an incorrect publication date in the past, the publication date of a document being processed by the system may be compared to the publication date of the previous document from the same source. If the publication date of the document is later than the publication date of the earlier document then an error may be flagged for manual checking and correction.
  • This sentiment score for each keyword is calculated by averaging the sum of the values applied to the relevant language fragments from the sentiment analysis process, taking into account the weighting for the source of each language fragment.
  • the average sentiment for a given keyword may be calculated using the formula:
  • n is the number of language fragments containing a match with the identified keyword
  • s(i) is the sentiment score associated with occurrence i of the keyword (for example, +1 , -1 or 0 for positive, negative and neutral sentiment respectively)
  • w(i) is the weighting value for the source of occurrence i of the keyword
  • C is a constant proportional to the size of the result set
  • m is the prior mean centring the results around neutral.
  • this calculation is performed by a calculation module (not shown) which may be provided between the document index database 1 10 and the API 1 1 1 .
  • a chart illustrates the sentiment score associated with the keyword (shown in Figure 5 as "Love”, “Hate” and “Neutral”). This chart is generated for any given keyword using the sum of the weighted sentiment values for each sentiment classification.
  • Figure 6 which shows a search for the keyword "sequins”.
  • the results are presented in the form of a bar chart showing the number of hits for the search term for a predefined time period (in this example, hits per week).
  • the sentiment for each time period the sentiment having been calculated as described above.
  • the resultant line graph clearly shows variation in trends over time.
  • the bar chart illustrates trends over time, and in particular serves to identify time periods during which interest in the searched term was particularly high.
  • a user can interrogate classified document database and document index database using query tools 1 14, for example a search tool for searching any keyword. It will be appreciated that this may be any keyword, and may or may not be the same as a keyword in keyword database 1 12.
  • Keywords may be stemmed in the same way as language fragments described above.
  • a search for the keyword "sequins" may be stemmed to "sequin".
  • an unstemmed keyword used in a query may be matched to a stemmed language fragment without first stemming the keyword.
  • the process of matching a keyword to a language fragment in a document may comprise matching a stemmed keyword to a stemmed language fragment, or matching an unstemmed keyword to a stemmed language fragment.
  • stemming of language fragments in documents, it will be appreciated that stemming of language fragments may be omitted, and unstemmed language fragments may be matched to stemmed or unstemmed keywords.
  • the API may also allow for access to the system to allow information stored in the system's databases to be automatically fed into applications such as demand analysis packages, sales forecasting or data warehousing tools, for example to control ordering or manufacture of goods.
  • the present invention can be used in any vertical and any market to identify and summarise trends and opinions within those verticals and markets, for example: retail demand prediction;
  • Storage type media include any or all of the memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunications networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the network operator or carrier into the computer platform(s) that serve as the document processing system.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and other various air-links.
  • the physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software.
  • terms such as computer or machine "readable medium” refer to any medium that participates in providing instructions to a processor for execution.
  • a machine readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium.
  • Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the document processing.
  • Volatile storage media include dynamic memory, such as main memory of such a computer platform.
  • Tangible transmission media include coaxial cables; copper wire and fiber optics, including eth wires that comprise a bus within a computer system.
  • Carrier-wave transmission media can take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD- ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer can read programming code and or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

Abstract

Document processing system and method comprising a document identification module for searching document sources identified by a document source database, a rating module configured to apply a rating value to language fragments contained in the identified documents and a calculation module configured to determine an average rating value for language fragments containing a match for a selected keyword.

Description

Document processing system and method
Field of the Invention
[0001 ] The present invention relates to a document processing system and method, in
particular a document processing system for identifying opinions and trends.
Background of the Invention
(0002) There is now an enormous amount of information available electronically, including both objective information (facts) and subjective information (opinions).
[0003] In searching for factual information, use of keywords in a basic search engine is often adequate for identifying the facts being sought.
[0004] However, in searching for opinions a problem arises because it is difficult to build an accurate picture of sentiment for any given topic within a specific vertical, within a specific market, or across the public as a whole.
[0005] "Market" as used herein means geographic market, such as a specific country or set of countries; "Vertical" as used herein means a specific subject area, such as clothing, music, politics, etc.
[0006] Within any given vertical, there may be numerous diverse opinions upon any given topic within that vertical. The sheer quantity of information that is typically available for any given topic makes it difficult to analyse this information in order to build a comprehensive view on opinions relating to that topic, and this analysis is further complicated because opinions from some sources may be considered to be more authoritative than opinions from other sources; an opinion that may have been considered authoritative at the time it was written may have since become obsolete; and an opinion may be relevant to only one or a limited number of markets only.
[0007] Although applications such as Google trends, Google Insight and Bing xRank may show trends based on volumes of popular search terms over time, they provide no analysis of opinion.
[0008] Websites providing online reviews posted by customers may illustrate the public's opinion of existing and specific products and services, but by their nature they relate to after-the-fact analysis once that product or service has been developed and made available to the public, and they do not provide opinions within a given market or vertical as a whole.
Summary of the Invention
[0009] In a first aspect, the invention provides a document processing system comprising:
a document identification module configured to search document sources identified by a document source database; l a rating module configured to apply a rating value to language fragments contained in the identified documents; and
a calculation module configured to determine an average rating value for language fragments containing a match for a selected keyword.
[0010] Optionally, the rating module is configured to generate language fragments from text contained within the identified documents for storage in a database.
(001 ) ] Optionally, the system further comprises a document classification module configured to classify documents identified by the document identification module.
[00 12) Optionally, each document is classified into one or both of a vertical corresponding to a field of the document, and a geographical market.
[0013) Optionally, each document is classified into a vertical and a topic within that vertical.
[00 14] Optionally, classification of the documents is determined using naive Bayesian analysis.
[0015) Optionally, each document is assigned to a time period encompassing a determined publication date of the document and wherein the calculation module is configured to calculate the average rating for each time period.
[00 16] Optionally, the language fragments comprise words expressing opinion, and wherein the rating is determined by the expressed opinion.
[0017] Optionally, the ratings are determined by analysing each language fragment to determine an opinion expressed in the document selected from a group comprising positive opinion and negative opinion, and applying a rating according to the determined opinion.
[0018] Optionally, the group further comprises neutral opinion.
[0019] Optionally, the average rating value relates to opinion of a physical object.
[0020] Optionally, the language fragments are analysed using na'ive Bayesian analysis to determine the rating value.
[0021 ] Optionally, the system further comprises a source scoring module configured to apply a weighting to each rating value.
[0022] Optionally, the weighting value is determined by the source of the language fragment.
[0023] Optionally, the weighting value is determined, at least in part, by an external quality indicator.
[0024] Optionally, the external quality indicators are selected from one or more of: a website ranking, and number of times a website has been bookmarked.
{0025) Optionally, documents identified by the document identifier are stored in a document storage database and wherein the weighting value is determined, at least in part, by the number of times the source of the language fragment has been cited by documents stored in the document storage database.
[0026J Optionally, the system further comprises a module for determining the language of the document.
[0027] Optionally, the language of the document is determined using a /7-gram probabilistic approach.
[0028] Optionally, the keyword is a predetermined keyword stored in a keyword database.
[0029] Optionally, the system is configured to identify the keyword associated with the highest average rating value.
[0030] Optionally, the system is configured to generate a list of keywords in order of the average rating value associated with each keyword.
[0031 ] Optionally, the keyword is a keyword selected by a system user.
[0032] Optionally, the system further comprising a display module for displaying the average rating value on a display screen.
[0033] Optionally, the display module is configured to display, for each time period, the number of language fragments containing the selected keyword.
[0034] Optionally, the display module is configured to display the average rating for each time period.
[0035] Optionally, the display module is configured to display the list of keywords in order of the average rating value associated with each keyword
[0036] Optionally, the system further comprising an application programming interface for linking the system to a machine.
[0037] Optionally, the machine is an automated product ordering or manufacturing control system.
[0038] In a second aspect, the invention comprises a rating module for use in a document processing system, the rating module being configured to:
receive documents identified by a document identification module;
generate language fragments from text contained within the identified documents and apply a rating value to language fragments contained in the identified documents; and
communicate with a document index database for storage of rated language fragments.
[0039] in a third aspect, the invention provides a document processing method comprising the steps of: identifying a plurality of documents comprising text;
identifying language fragments in the text of the identified documents; applying a rating to each language fragment; and
determining an average rating value for language fragments containing a selected keyword.
[0040] Optionally, the method comprises the step of generating language fragments from text contained within the identified documents for storage in a database.
[004 ] ] Optionally, the method comprises the step of classifying documents identified by the document identification module.
[0042] Optionally, the method comprises the step of classifying each document into one or both of a vertical corresponding to a field of the document, and a geographical market.
[0043] Optionally, each document is classified into a vertical and a topic within that vertical.
[0044] Optionally, classification of the documents is determined using naive Bayesian analysis.
[0045] Optionally, each document is assigned to a time period encompassing a determined publication date of the document and wherein the calculation module is configured to calculate the average rating for each time period.
[0046] Optionally, the language fragments comprise words expressing opinion, and wherein the rating is determined by the expressed opinion.
[0047] Optionally, the ratings are determined by analysing each language fragment to determine an opinion expressed in the document selected from a group comprising positive opinion and negative opinion, and applying a rating according to the determined opinion.
[0048] Optionally, the group further comprises neutral opinion.
[0049] Optionally, the average rating value relates to opinion of a physical object.
[0050] Optionally, the language fragments are analysed using nai've Bayesian analysis to determine the rating value.
[0051 ] Optionally the method comprises the step of applying a weighting to each rating value.
[0052] Optionally, the weighting value is determined by the source of the language fragment.
[0053] Optionally, the weighting value is determined, at least in part, by an external quality indicator. [0054] Optionally, the external quality indicators are selected from one or more of: a website ranking, and number of times a website has been bookmarked.
[0055] Optionally, documents identified by the document identifier are stored in a document storage database and wherein the weighting value is determined, at least in part, by the number of times the source of the language fragment has been cited by documents stored in the document storage database.
[0056] Optionally, the method comprises the step of determining the language of the document.
[0057] Optionally, the language of the document is determined using a /i-gram probabilistic approach.
[0058] Optionally, the keyword is a predetermined keyword stored in a keyword database.
[0059] Optionally, the method comprises the step of identifying the keyword associated with the highest average rating value.
[0060] Optionally, the method comprises the step of generating a list of keywords in order of the average rating value associated with each keyword.
[0061 ] Optionally, the keyword is a keyword selected by a system user.
[0062] In a fourth aspect, the invention provides computer program code which when run on a computer causes the computer to perform the method according to the third aspect.
[0063] In a fifth aspect, the invention provides an article of manufacture. The article
includes a machine-readable storage medium and executable program instructions embodied in the machine-readable storage medium that when executed by a programmable system causes the system to perform the method according to the third aspect.
[0064] In a sixth aspect, the invention provides computer program product comprising
programme code means for performing the method according to the third aspect.
[0065] In a seventh aspect, the invention provides a computer readable medium recorded with computer readable code arranged to cause a computer to perform the method according to the third aspect,
[0066] "Document" as used herein means any file containing text in any natural language that is accessible to a document identifier including but not limited to a Webcrawler, including but not limited to text files of web pages (with or without html); word processor files such as Word .doc files; and .pdf files.
Description of the Drawings
[0067] The invention will now be described in more detail with reference to the Figures wherein: Figure 1 is an overview of an embodiment of a document processing system;
Figure 2 illustrates document classification according to the embodiment of Figure 1 ;
Figure 3 illustrates source scoring according to the embodiment of Figure 1 ;
Figure 4 illustrates sentiment analysis according to the embodiment of Figure 1;
Figure 5 illustrates an output of the embodiment of Figure 1 arising from analysis using keywords in a keyword database; and
Figure 6 illustrates an output of the embodiment of Figure 1 based on selection of a keyword of Figure 5.
Detailed Description of the Invention
[0068] The system and method disclosed in Figure 1 can be divided into four stages:
Document sourcing in which documents are sourced, and the quality of document sources is determined;
Classification and storage of documents obtained from document sources;
Analysis of the documents to determine the sentiment expressed within those documents, and indexing of the determined sentiment; and
Keyword querying of the classified and indexed documents, either based on predetermined keywords stored in a keyword database or keywords selected by a system user.
Each stage is described in more detail below.
Sourcing
[0069] With reference to Figure 1, a document identifier in the form of a Webcrawler 102 identifies documents from the Internet taken document sources that are identified to Webcrawler 102 by document source database 104. The document sources contained in document source database 104 and that are checked by the crawler 102 may include any website, including websites containing reviews and opinions, blogs, Twitter, etc. This may include websites from which files may be retrieved directly, and / or websites which may be queried via a search interface in order to retrieve relevant documents.
[0070] Although the document identifier in this embodiment is a webcrawler that refers to websites as document sources, it will be appreciated that the document identifier for use in the system of the invention may be an identifier that refers to other document sources, for example captions for television that may or may not be displayed on television (open and closed captions, respectively), or text from speech, wherein documents are generated using speech recognition software.
[0071] Documents identified by crawler 104 are then classified by document classifier 106 as described below.
[0072] Initially, the document source database 104 may comprise only sources that have been manually curated into the database, however the system may provide for automatic acquisition of new sources over time as described in more detail below under "Automatic Sourcing".
[0073] Image store database 1 15 may store images taken from documents identified by the document identifier. Although the system of the present invention is concerned with processing of text in documents as described below, images may also be captured and indexed, for example according to colour.
Classification
[0074] Document classifier 106 applies one or more classifications to each document
identified by the document identifier. Classifications that may be applied include:
[0075] Market - a market classification may be applied to identify the geographical market to which the document applies such as a specific country or group of countries.
[0076] Vertical -a vertical classification may be applied to identify the subject area that the document is concerned with (e.g. fashion, arts, electronic goods, etc.)
[0077] The document source database 104 identifies to document classifier 106 at least one market and at least one vertical that the document source is relevant to. If the document source database identifies only one market and only one vertical then documents from that source may simply be tagged by document classifier 106 with the relevant market and vertical.
[0078] However, if the document source database 104 identifies that a document source is relevant to two or more markets and / or two or more verticals then the document is analysed by document classifier 106 to determine which classifications should be applied.
[0079] The document classification process is described in more detail in Figure 2. In a first step 202 a document identified by the crawler 102 is read to identify its source. If at step 204 the document is identified as deriving from a document source that is relevant to more than one vertical then it is classified using naive Bayesian analysis against a representative corpus (i.e. text collection) of data taken from vertical corpus database 206 containing terms relevant to different verticals using the Monte Carlo method at step 208, and assigning a best-fit vertical to the document at step 210. The contents of the vertical corpus database 206 contains data taken from the document index database 1 10, described below. [0080] As mentioned above, if the document source database identifies that the source for a given document is relevant to one vertical only then that vertical is simply assigned to the document at step 212.
[0081 ] The same analysis is applied if a document derives from a document source that is identified as being relevant to more than one market, starting at step 214 onwards in which case analysis is made against a corpus of data contained within a market corpus database containing terms relevant to different markets.
[0082] According to this embodiment, all documents are classified by document classifier 106. However, if the document source is recorded as being relevant to only one market and only one vertical then it will be appreciated that the relevant
classifications may be applied directly to the document, and document classifier 106 may be bypassed. Moreover, it will be appreciated that the document classifier 106 may be bypassed if a document source is relevant to one vertical only and the system is configured to classify documents by vertical but not by market, or if a document source is relevant to one market only and the system is configured to classify documents by market but not by vertical.
[0083] After classification is complete, an n-gram probabilistic approach (see W. B. Cavnar and J. M. Trenkle, "N-gram-based text categorization," pp. 161-175, 1994
(http://www.let. rug.nl/~vannoordyTextCat/textcat.pdf) is used to determine the natural language of each document at step 216 and 218. If the natural language cannot be identified by this approach then a language may be assigned to the document based on the relevant market.
[0084] URLs are normalised for citation matching at step 220 and the classified documents, tagged with classification information, are stored in classified document store 108 (also shown in Figure 1 ) that may be interrogated as described later. The
normalisation process allows documents from the same document source to be identified, as described below in source scoring. In particular, normalisation may involve expansion of a short alias of a longer URL, such as an alias generated using tinyURL.com.
Sub-classifications
[0085] Within each vertical, each document may be further classified according to the
specific topic addressed by the document (e.g. the topic of womenswear in the case of a document classified within the vertical of fashion, or the topic of mobile phones in the case of a document classified within the vertical of electronic goods). Each topic may be further subdivided at least once into subtopics, sub-subtopics, and so on. Sub- classification is carried out in the same way as classification described above.
Automatic sourcing
[0086] Documents within classified document store 106 are analysed by source analyser and scorer 1 10 to identify potential" new document sources. If two or more documents from different sources all refer to a further source then that further source may be added to the document source database as a new document source, if it is not already in the document source database 104.
Source scoring
[0087] Each document source carries a weighting value determined by a number of factors to reflect the quality of that source, and that weighting value is stored in document source database. A weighting value is generated for both manually curated sources and automatically identified sources.
[0088] The process of applying a source score is described in detail in Figure 3. In step 301 , an existing source score is retrieved from document source database 104 or, for a new source, an arbitrary score may initially be applied. A manually curated source may carry a different weighting to an automatically identified source. External quality indicators 304 and the number of times the source has been referenced are used to determine the quality of each source, and from this the weighting value to be applied to documents originating from that source.
[0089] External quality indicators 304 include statistics on how often a source website has been bookmarked or the ranking of the source website (for example in Google, Yahoo BOSS or Alexa results). The external quality indicators for a given source may be general, or may be specific to different markets and / or verticals. These quality indicators are aggregated to generate an external quality value that is applied at step 305 as part of the process of determining a weighting value to the document source.
[0090] Separately, the contents of the classified document store are analysed to determine the number of times each document source has been cited as an indication of the quality of that source.
[0091 ] If a source is relevant to only one market and / or vertical then only one citation score is applied. However, if a source is relevant to more than one vertical and / or market then different scores citation scores may be applied based on the market and / or vertical associated with the citing document or documents. The number of citations for a given source are aggregated to generate a citation value that is applied at step 305 as further part of the process of determining a weighting value to the document source.
[0092] In the process of Figure 3, both a citation value and an external quality value are used in determining a source weighting value, however it will be appreciated that either of these values could be used alone in determining a weighting value.
[0093] The citation value and external quality value are assessed by means of establishing a set of scores that fall within the bounds of a Wilson score confidence interval by the using the following formula: 10 001975
Figure imgf000011_0001
where p is the frequency of positive quality scores
zl - a / 2 is the 1 - a / 2 percentile of a standard normal distribution of scores, and n is the count of the regular scores being considered.
Wilson, E. B. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212, 1927
[0094] Scores falling outside these bounds are discarded. The mean of the remaining scores is then stored in the document source database 104 against the document,
[0095] The score for any given source is diminished over time when no new quality
indicators become available.
[0096] The scoring process is repeated regularly, and can be repeated continuously to
maintain up-to-date scores by taking account of changes to the content of the classified document store 108, in particular to take account of new documents added to the classified document store, and to take account of the effect of time on sources.
Sentiment analysis
[0097] The process of sentiment analysis is illustrated in Figure 4.
[0098] Each document in classified document store 108 is fragmented into language
fragments using natural language toolkits. The exact manner of this fragmentation depends on the language of the document.
[0099] The language fragments contain measurable expressions of sentiment. In order to measure this sentiment, the language fragments are first normalised into stemmed text at step 402 and analysed to determine a sentiment associated with that language fragment using na'ive Bayesian analysis by means of a classifier such as that discussed by A. McCallum and . Nigam, "A Comparison of Event Models for Naive Bayes Text Classification" pp. 41-48,
1998(http://www. kamalnigam.com/papers/muItinomial-aaaiws98.pdf) at step 403 against a corpora of positive, negative and neutral sentiment language fragments, and each fragment is tagged with the determined sentiment rating value. The sentiment may be selected from positive, negative or neutral only, or may include further options such as very positive or very negative. Stemming of text includes, in particular, generating word stems. For example, the words "sequins" or "sequinned" may be stemmed to "sequin". [0100] The fragmentation and stemming may be conducted using software such as that available from LingPipe (http://alias-i.com/lingpipe/) or NLTK (www.nltk.org). Language fragments typically comprise a name or a noun and an associated adjective.
Processing Example
[0101 ] For the sentences "I love these three looks above available now at Saks Fifth Avenue.
When I was in NY last month for Fashion Week, all I saw was the sparkle of sequins down the runway."
Figure imgf000012_0001
[0102] The corpora of language fragments used for sentiment analysis include manually scored corpora 404 and, once document index has been populated, a Monte Carlo selected subset of document index database 1 10 (described below). This subset is changed regularly. These corpora include, in particular, adjectives tagged with a sentiment rating value.
[0103] The weighting value stored in document source database 104 for each document source is then applied at stage 406, and normalised using a Wilson score confidence.
Indexing
[0104] With reference to Figures 1 and 4, each language fragment, tagged with its associated sentiment rating value and its market and / or vertical classifications, and with a link back to the original document, is indexed by indexer 109 and stored in document index database 1 10 for interrogation using application programming interface (API) 1 1 1. Database 110 also provides the automatic training corpora 405 for Bayesian analyser.
Keyword database
[0105] A list of keywords for each market, vertical and / or detail is stored in keyword
database 1 12. The keywords may include nouns, adjectives (e.g. colours), names (of places, people, brands, musical groups etc.), etc.
[0106] "Keywords" is used herein for simplicity, however for the avoidance of doubt it will be appreciated that this term does not just encompass terms of a single word, but may include terms containing characters other than letters, such as numerals or punctuation symbols, and may contain more than one word (for example, a forename and a surname).
[0107] The keyword database may include defined relationships between keywords, for example a fashion designer's name may be associated with a specific brand name or shop name. Alternatively, the system may search for relationships between keywords. For example, if the occurrence of a keyword in a document is accompanied by occurrence another keyword in a high percentage of documents in the classified document database that is above a threshold percentage then the keywords may be identified as being linked.
[0108] The keyword database contains an initial set of keywords that are derived from a curated taxonomy associated with the vertical in question. As documents are processed, nouns (or other keywords, such as names) are automatically identified from the processed documents by means of a pre-defined grammar that can vary by vertical and language, These nouns are added to the keyword database when they are identified a plurality of times and in proximity to existing keywords.
[0109] These keywords are checked against the language fragments stored in document index database 1 10 to identify language fragments containing a match for the keywords, and this is used in order to generate a keyword report having an output such as shown in Figure 5.
(0 1 10] Figure 5 illustrates the output of an analysis n keywords within a given period that have the most positive associated sentiment (in this case, the 20 most popular keywords in the vertical of womenswear for the topic of detail in the UK market for the period of 24 May - 1 October). A publication date is typically available from the document source. If the publication date is a date in the future then the date of indexing is applied as the publication date. In order to identify documents that may carry an incorrect publication date in the past, the publication date of a document being processed by the system may be compared to the publication date of the previous document from the same source. If the publication date of the document is later than the publication date of the earlier document then an error may be flagged for manual checking and correction.
[01 1 1 ] This sentiment score for each keyword is calculated by averaging the sum of the values applied to the relevant language fragments from the sentiment analysis process, taking into account the weighting for the source of each language fragment. The average sentiment for a given keyword may be calculated using the formula:
rv
7T -
[01 12] wherein n is the number of language fragments containing a match with the identified keyword; s(i) is the sentiment score associated with occurrence i of the keyword (for example, +1 , -1 or 0 for positive, negative and neutral sentiment respectively); and w(i) is the weighting value for the source of occurrence i of the keyword, C is a constant proportional to the size of the result set, and m is the prior mean centring the results around neutral.
[01 13] In the embodiment of Figure 1 , this calculation is performed by a calculation module (not shown) which may be provided between the document index database 1 10 and the API 1 1 1 .
[0 1 14] For any given keyword, a chart illustrates the sentiment score associated with the keyword (shown in Figure 5 as "Love", "Hate" and "Neutral"). This chart is generated for any given keyword using the sum of the weighted sentiment values for each sentiment classification.
[0 Π 5] It is possible to generate similar output for any combination of a time period, a
vertical (shown in Figure 5 as "Market Segments"), a market (shown in Figure 5 as "Regions") and, optionally, a topic within the selected vertical (shown in Figure 5 as "Details)".
[0 H 6] This keyword tool provides detailed analysis of current trends within any given
market or vertical.
[0 1 17} Any given keyword in the list may be selected to provide more information as
illustrated in Figure 6, which shows a search for the keyword "sequins". The results are presented in the form of a bar chart showing the number of hits for the search term for a predefined time period (in this example, hits per week). [01 18] Also provided is the sentiment for each time period, the sentiment having been calculated as described above. The resultant line graph clearly shows variation in trends over time.
[01 19] By selecting any given week on the chart it is possible to view the documents that have been identified in that query and that form the basis for the sentiment analysis.
[0120] The bar chart illustrates trends over time, and in particular serves to identify time periods during which interest in the searched term was particularly high.
User query
[0121 ] In addition to generating reports based on keywords contained within keyword
database, it is also possible for a user to interrogate classified document database and document index database using query tools 1 14, for example a search tool for searching any keyword. It will be appreciated that this may be any keyword, and may or may not be the same as a keyword in keyword database 1 12.
[0122] Keywords may be stemmed in the same way as language fragments described above.
For example, a search for the keyword "sequins" may be stemmed to "sequin". However, it will be appreciated that an unstemmed keyword used in a query may be matched to a stemmed language fragment without first stemming the keyword. Thus, the process of matching a keyword to a language fragment in a document may comprise matching a stemmed keyword to a stemmed language fragment, or matching an unstemmed keyword to a stemmed language fragment.
[0123] Moreover, although the embodiment described above relates to stemming of language fragments in documents, it will be appreciated that stemming of language fragments may be omitted, and unstemmed language fragments may be matched to stemmed or unstemmed keywords.
[0124] This output of this interrogation can be in the same form as described above with respect to Figure 6.
Customer API access
[Ό Ι 25] The API may also allow for access to the system to allow information stored in the system's databases to be automatically fed into applications such as demand analysis packages, sales forecasting or data warehousing tools, for example to control ordering or manufacture of goods.
[0126] The present invention can be used in any vertical and any market to identify and summarise trends and opinions within those verticals and markets, for example: retail demand prediction;
culture and the arts, for example for determining opinion within a market for a particular style of art or uncovering art "movements"; banking, insurance and finance for evaluating opinions about banking news, investment sentiment and interest rates;
government for local or global opinion on policy decisions;
food and agriculture, e.g. for sentiment on genetically modified organisms, organic food, eating habits and sourcing of ingredients;
manufacturing to show consumer trends for design of products;
building and architecture for opinion on buildings and cities.
[0127) The hardware elements, operating systems and programming languages of systems according to the invention are conventional in nature, and it is presumed that those skilled in the art are adequately familiar therewith. Of course, the system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.
[0128] Hence, aspects of the invention outlined above may be embodied in programming.
Program aspects of the technology may be though of as "products" or "articles of manufacture" typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine readable medium. "Storage" type media include any or all of the memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunications networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the network operator or carrier into the computer platform(s) that serve as the document processing system. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and other various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to tangible "storage" media, terms such as computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.
(0 129] Hence, a machine readable medium may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the document processing. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including eth wires that comprise a bus within a computer system. Carrier-wave transmission media can take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD- ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer can read programming code and or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[0 ! 30] Those skilled in the art will recognize that the present teachings are amenable to a variety of modifications and or enhancements.
[0131 ] While the foregoing has described what are considered to be the best mode and/or other examples, it is understood that various modifications may be made therein and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.

Claims

Claims
1. A document processing system comprising:
a document identification module configured to search document sources identified by a document source database;
a rating module configured to apply a rating value to language fragments contained in the identified documents; and
a calculation module configured to determine an average rating value for language fragments containing a match for a selected keyword.
2. A document processing system according to claim 1, wherein the rating module is
configured to generate language fragments from text contained within the identified documents for storage in a database.
3. A document processing system according to claim 1 further comprising a document classification module configured to classify documents identified by the document identification module.
4. A document processing system according to claim 2 or 3 wherein each document is classified into one or both of a vertical corresponding to a field of the document, and a geographical market.
5. A document processing system according to claim 4 wherein each document is classified into a vertical and a topic within that vertical.
6. A document processing system according to any one of claims 3-5 wherein classification of the documents is determined using na'ive Bayesian analysis.
7. A document processing system according to any preceding claim wherein each document is assigned to a time period encompassing a determined publication date of the document and wherein the calculation module is configured to calculate the average rating for each time period.
8. A document processing system according to any preceding claim wherein the language fragments comprise words expressing opinion, and wherein the rating is determined by the expressed opinion.
9. A document processing system according to claim 8 wherein the ratings are determined by analysing each language fragment to determine an opinion expressed in the document selected from a group comprising positive opinion and negative opinion, and applying a rating according to the determined opinion.
10. A document processing system according to claim 9 wherein the group further comprises neutral opinion.
1 1 . A document processing system according to claim 8, 9 or 10 wherein the average rating value relates to opinion of a physical object.
12. A document processing system according to any preceding claim wherein the language fragments are analysed using naive Bayesian analysis to determine the rating value.
13. A document processing system according to any preceding claim further comprising a source scoring module configured to apply a weighting to each rating value.
14. A document processing system according to claim 13 wherein the weighting value is determined by the source of the language fragment.
15. A document processing system according to claim 13 wherein the weighting value is determined, at least in part, by an external quality indicator.
16. A document processing system according to claim 15 wherein external quality indicators are selected from one or more of: a website ranking, and number of times a website has been bookmarked.
17. A document processing system according to claim 14, 15 or 16 wherein documents identified by the document identifier are stored in a document storage database and wherein the weighting value is determined, at least in part, by the number of times the source of the language fragment has been cited by documents stored in the document storage database.
18. A document processing system according to any preceding claim comprising a module for determining the language of the document.
19. A document processing system according to claim 14 wherein the language of the document is determined using a n-gram probabilistic approach.
20. A document processing system according to any preceding claim wherein the keyword is a predetermined keyword stored in a keyword database.
21. A document processing system according to claim 20 wherein the system is configured to identify the keyword associated with the highest average rating value.
22. A document processing system according to claim 21 wherein the system is configured to generate a list of keywords in order of the average rating value associated with each keyword.
23. A document processing system according to any preceding claim wherein the keyword is a keyword selected by a system user.
24. A document processing system according to any preceding claim, the system further comprising a display module for displaying the average rating value on a display screen.
25. A document processing system according to claims 7 and 24 wherein the display module is configured to display, for each time period, the number of language fragments containing the selected keyword.
26. A document processing system according to claims 7 and 24 or 25 wherein the display module is configured to display the average rating for each time period.
27. A document processing system according to claim 24 wherein the display module is configured to display the list according to claim 22.
28. A document processing system according to any preceding claim, the system further comprising an application programming interface for linking the system to a machine.
29. A document processing system according to claim 28 wherein the machine is an automated product ordering or manufacturing control system.
30. A rating module for use in a document processing system, the rating module being configured to:
receive documents identified by a document identification module;
generate language fragments from text contained within the identified documents and apply a rating value to language fragments contained in the identified documents; and communicate with a document index database for storage of rated language fragments.
3 1. A method implemented in a document processing system comprising the steps of:
identifying a plurality of documents comprising text;
identifying language fragments in the text of the identified documents;
applying a rating to each language fragment; and
determining an average rating value for language fragments containing a selected keyword.
32. Computer program code which when run on a computer causes the computer to perform the method according to claim 31.
33. A carrier medium carrying computer readable code which when run on a computer causes the computer to perform the method according to claim 31.
34. An article of manufacture comprising:
a machine-readable storage medium, and
executable program instructions embodied in the machine readable storage medium that when executed by a programmable system causes the system to perform the steps recited in claim 1 .
35. A computer program product comprising program code for performing the method of claim 31.
36. A computer readable medium recorded with computer readable code arranged to cause a computer to perform the method of claim 31
37. Computer program code for performing the method of claim 31.
PCT/GB2010/001975 2009-10-26 2010-10-25 Document processing system and method WO2011051656A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0918741.0 2009-10-26
GBGB0918741.0A GB0918741D0 (en) 2009-10-26 2009-10-26 Document processing system and method

Publications (2)

Publication Number Publication Date
WO2011051656A2 true WO2011051656A2 (en) 2011-05-05
WO2011051656A9 WO2011051656A9 (en) 2012-03-08

Family

ID=41426712

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2010/001975 WO2011051656A2 (en) 2009-10-26 2010-10-25 Document processing system and method

Country Status (2)

Country Link
GB (1) GB0918741D0 (en)
WO (1) WO2011051656A2 (en)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A. MCCALLUM, K. NIGAM, A COMPARISON OF EVENT MODELS FOR NAIVE BAYES TEXT CLASSIFICATION, 1998, pages 41 - 48, Retrieved from the Internet <URL:http://www.kamalnigam.comlpapers/multinomial-aaaiws98.pdf>
W. B. CAVNAR, J. M. TRENKLE, N-GRAM-BASED TEXT CATEGORIZATION, 1994, pages 161 - 175, Retrieved from the Internet <URL:http://www.let.rug.nI/~vannoord/TextCat/textcat.pdf>
WILSON, E. B.: "Probable inference, the law of succession, and statistical inference", JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, vol. 22, 1927, pages 209 - 212

Also Published As

Publication number Publication date
WO2011051656A9 (en) 2012-03-08
GB0918741D0 (en) 2009-12-09

Similar Documents

Publication Publication Date Title
US11663254B2 (en) System and engine for seeded clustering of news events
US9535911B2 (en) Processing a content item with regard to an event
Scaffidi et al. Red Opal: product-feature scoring from reviews
JP5431727B2 (en) Relevance determination method, information collection method, object organization method, and search system
JP4637969B1 (en) Properly understand the intent of web pages and user preferences, and recommend the best information in real time
US10558666B2 (en) Systems and methods for the creation, update and use of models in finding and analyzing content
US20130110839A1 (en) Constructing an analysis of a document
JP2013168186A (en) Review processing method and system
JP2009521750A (en) Analyzing content to determine context and providing relevant content based on context
Vosecky et al. Searching for quality microblog posts: Filtering and ranking based on content analysis and implicit links
US9672269B2 (en) Method and system for automatically identifying related content to an electronic text
JP2008234090A (en) Latest popular information informing program, recording medium, device, and method
US20140006328A1 (en) Method or system for ranking related news predictions
EP2884451A1 (en) Product and content association
CN110175264A (en) Construction method, server and the computer readable storage medium of video user portrait
JP2011108053A (en) System for evaluating news article
CA2956627A1 (en) System and engine for seeded clustering of news events
US20110131213A1 (en) Apparatus and Method for Mining Comment Terms in Documents
CN116431895A (en) Personalized recommendation method and system for safety production knowledge
JP2016197332A (en) Information processing system, information processing method, and computer program
Vasconcelos et al. Popularity dynamics of foursquare micro-reviews
KR101987301B1 (en) Sensibility level yielding system through web data Analysis associated with a stock and a social data and Controlling Method for the Same
WO2011051656A2 (en) Document processing system and method
WO2015125088A1 (en) Document characterization method
Pisal et al. AskUs: An opinion search engine

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10824277

Country of ref document: EP

Kind code of ref document: A2