WO2014000764A1

WO2014000764A1 - A system and method for automatic generation of a reference utility

Info

Publication number: WO2014000764A1
Application number: PCT/EP2012/062269
Authority: WO
Inventors: Walid Magdy
Original assignee: Qatar Foundation; Hoarton, Lloyd
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2014-01-03

Abstract

A method for automatic generation of a reference utility, the method including the steps of: providing a pattern-matching data source for matching token patterns to question structures; receiving textual information of an information resource; tokenising the textual information to form a tokenised string comprising one or more tokens, the or each token being indicative of a subject matter category of at least part of the textual information; storing the tokenised string in a memory; identifying a question structure by comparing the tokenised string with one or more token patterns provided by the pattern-matching data source; forming a question and corresponding answer based on the identified question structure; and generating a reference utility comprising the formed question and corresponding answer, wherein the answer includes a reference to the textual information in the information resource. The method may further comprise assigning a confidence value to an answer of the reference utility, the confidence value being representative of a confidence that the answer is the correct answer to the corresponding question

Description

Title: A System and Method for Automatic Generation of a Reference Utility Background

The present invention relates to a system and method for automatic

generation of a reference utility. Sources of data, whether in printed format or digital format, may contain a vast amount of information. A traditional encyclopaedia may contain thousands of pages of information, and online resources such as collaborative

encyclopaedias (e.g. Wikipedia), digital publication libraries (e.g. Google docs, arXiv.org, etc.) and online knowledge forums, provide large amounts of sometimes relatively categorised information. It is known to generate indices for use for publications via an automated process, which tracks occurrences of specific terms and creates a list of keywords with references to where they might be found in the publication (i.e. an index). A drawback with merely providing an index to a user is that the user must understand some element of the information for which he is searching in order to access the correct term in the index. The user may then be presented with a large number of references which must be accessed in order to view the potentially relevant information within the publication so as to ascertain whether it is relevant or not. It is also known to summarise information provided in a resource, so as to provide a brief overview of the information available in the resource. However, a user must review the summary, which may be several sentences or paragraphs in length, in order to ascertain whether the resource is relevant. The user is then faced with the problem of locating the relevant information within the resource.

There is, therefore, a desire to overcome one or more of the problems associated with the prior art and create, for example, a reference utility that provides a user or application with access to information relevant to a particular topic, such that a user can access information about a topic based on a very limited prior knowledge of the topic (without compromising the integrity of the data or the detail provided in the information resources).

Embodiments of the present invention seek to ameliorate one or more problems associated with the prior art.

Summary of the invention

Another aspect of the present invention provides a method for automatic generation of a reference utility, the method including the steps of: providing a pattern-match ing data source for match ing token patterns to question structures; receiving textual information of an information resource; tokenising the textual information to form a tokenised string comprising one or more tokens, the or each token being indicative of a subject matter category of at least part of the textual information; storing the tokenised string in a memory; identifying a question structure by comparing the tokenised string with one or more token patterns provided by the pattern-matching data source; forming a question and corresponding answer based on the identified question structure; and generating a reference util ity comprising the formed question and corresponding answer, wherein the answer includes a reference to the textual information in the information resource. The method may further include: creating one or more additional tokenised strings from further textual information of the information resource, the or each additional tokenised string comprising one or more tokens; identifying one or more further question structures by comparing the or each additional tokenised string with one or more token patterns provided by the pattern-matching data source; forming one or more further questions and one or more corresponding further answers based on the or each identified question structure; and generating the reference utility such that the reference utility comprises the or each formed further question and corresponding further answer, wherein the or each further answer includes a reference to the further textual information in the information resource.

The method may further comprise assigning a confidence value to an answer of the reference utility, the confidence value being representative of a confidence that the answer is the correct answer to the corresponding question.

The confidence value may be determined by comparing two or more answers with each other and assigning a higher confidence value to an answer which concurs with at least one other answer, the two or more answers having respective corresponding questions which are similar or identical.

The step of receiving textual information from an information resource may comprise receiving textual information from an information resource located on a storage device that is locally-accessible or accessible over a local or wide area network connection.

The step of receiving textual information from an information resource may comprise receiving textual information from a plurality of information resources. The method may further include comparing first and second sets of questions and answers formed from textual information from respective first and second information resources, and configuring the output questions and answers based on the comparison. Comparing the first and second sets of questions and answers may include comparing answers to duplicate questions, and configuring the output may include indicating, for each answer corresponding to a duplicated question, a confidence rating associated with the respective answer based on the proportion of answers that are identical or very similar to that answer corresponding to the duplicated questions.

Configuring the output may comprise including a reference to a plurality of the information resources that include textual information from which the duplicated question was generated. Configuring the output may comprise presenting the answers to duplicated questions in order of decreasing confidence rating.

Configuring the output may comprise removing references to answers that have a confidence rating that is lower than a predetermined threshold.

Each formed answer may include a hyperlink reference to a section of the information resource comprising the textual information from which its corresponding question was generated. Tokenising the textual information may include comparing each of a plurality of words in the textual information to a set of known categories, and where the word matches a category, assigning a token representing that category to the word. The method may further include receiving a search term from a user, and using the reference utility to search for a question related to the search term.

Another aspect of the present invention provides a system for automatic generation of a reference utility, the system including: a computing device having a processor and a memory: and a storage device; the computing device being configured to: perform the method. The system may further include a visual display for displaying an interface to a user, and to receive a search term from a user, such that the input of a search term by the user causes the computing device to output to the interface a formed question and answer corresponding to the search term.

Another aspect of the present invention provides a computer-readable medium storing instructions which when executed to run on a processor cause the processor to perform the steps according to the method.

Another aspect of the present invention may provide a reference utility and information resource, wherein the reference utility includes a question-and- answer index based on textual information of the information resource, the question-and-answer index including one or more questions and one or more respective correspond ing answers, the one or each answer includ ing a reference to textual information in the information resource relevant to that answer.

The reference utility and information resource may be provided as part of an electronic book.

Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, of which:

Figure 1 is a schematic diagram of a system according to an embodiment of the invention; Figure 2 is a flow chart representing an overview of the method according to an embodiment invention; Figure 3 is a flow chart of a method according to an embodiment dimension; and Figure 4 is a flow chart of an exemplary method according to an embodiment. Detailed Description

With reference to Figure 1 of the drawings, a system 100 for automatically generating a reference guide according to embodiments of the invention is shown. The system 100 comprises one or more computing devices 102, each including a processing arrangement 104 with associated memory 106, and at least one of the computing devices 102 may include one or more input devices (not shown).

The one or more computing devices 102 have access to at least one storage device 108 which may or may not be provided locally. For example, a storage device 108 may be made available over a local or wide area network, on a remote server (as in the case of information being hosted or made available online), or it may be an internal component of a computing device 102 such as a hard disk (or other computer readable medium), or may be a plug-and-play device, such as a USB storage device, or the like.

A visual display 1 10 may be communicatively coupled to the computing device 1 02 and may provide information to the user to enable documents to be viewed. The visual display 1 10 may display a user interface which allows a user to interact with and/or control software executing on the processing arrangement 104 of the computing device 102. The user interface may be displayed under the control of the computing device 102. One or more information resources are provided on at least one storage device 108, either locally or over a network connection (a local area

connection, or alternatively an online server accessible via an internet connection (or other wide area network)). The information resources typically comprise computer files including textual information, such as word processor documents or text files (e.g. files with XT, .DOC, .ODT file extensions), portable document format (PDF) documents, images comprising text (that are suitable for optical character recognition for extracting textual information from the documents), or web pages (typically in the formats denoted by the .HTML or .XHTML file extensions, for example), or the like. It should be understood that these examples of file-types are non-limiting, and that compatibility with other file-types is envisaged.

The term textual information refers to any information that includes text, comprising words and/or numbers, symbols, punctuation, characters, or the like. The textual information may include one or more graphemes. The term 'words' will be used herein, for the sake of simplicity, to refer to any component of textual information - including numbers, symbols and punctuation. A word may comprise one or more graphemes.

The textual information may be presented in one or more languages - e.g. Arabic, English, French, German, Russian, Cantonese, Japanese, and

Spanish to name but a few. With reference to Figure 2 of the drawings, an aspect of a method of an embodiment involves (in general terms) accessing a storage device 108 and extracting textual information from a text file (or other information resource), analysing the textual information to determine important aspects of the textual information, and outputting a reference utility that comprises one or more portions of the textual information (or information derived from one or more portions of the textual information) provided in the form of a question and associated answer - in other words, the reference utility may include a question-and-answer index. Each of the answers comprises a reference such as a file path, a hyperlink, a file address on a network, or any alternative link or reference to the original textual information in the context of the original information resource. By providing a link or reference to the textual

information in its original context, substantially no detail or information is lost when presenting the answer to a user. In this manner, embodiments of the invention may be used to simplify access to textual information in lengthy documents (or other information resources), in complicated and/or

inaccessible information resources, and in information resources such as manuals (or on a larger scale, in a digital library or other collection of information resources) in which it is difficult to navigate to specific pieces of information efficiently. Thus, a question-and-answer index may be formed according to embodiments of the invention, providing links to one or more sections of the original information resource.

In addition to creating a list or index of questions and corresponding answers relating to an information resource, embodiments may provide a search interface (which may be displayed on the visual display 1 10 and which may be part of the user interface) to allow a user to enter terms or questions of interest, in order to search the available list of questions. At least part of the list or index of questions and answers may be presented to the user through the visual display 1 10 (e.g. through the user interface) - the at least part of the list or index may comprise search results.

Embodiments may provide a selection of one or more questions relating to a topic of interest identified by a user and display those one or more questions to the user through an interface (which may be part of the user interface) on the visual display 1 10. In embodiments, auxiliary software may use the question-and-answer index and provide, for example a voice activated question-and-answer system, a search engine, or a document retrieval system. In this way, information content of an information resource may be tailored in a manner that seeks to allow succinct answers to be provided to a user without sacrificing the factual detail present in the original information resource. This may be particularly important when the information resource is a scientific journal or academic paper, for example. Past systems for summarising an information resource of this nature have suffered from oversimplification of the detail that is necessary to understand the subject-matter in question.

Embodiments may be implemented in relation to an information resource comprising an online manual, for example, to allow a user to access specific topics within the manual without having to navigate a complex conventional index system, and without needing knowledge of particular terminology related to the topic of interest. Access to the question-and-answer index which may be generated for an online manual by methods according to embodiments of the present invention, may be provided through the user interface and displayed on the visual display 1 10.

Of course, without knowing a great deal about a subject before formulating a question, the user may not be in a position to know how to ask an appropriate question to generate the answer for which he is searching. Accordingly, the user may be presented with a user interface which allows the user to browse a list or index of questions, and select those that may be relevant to the topic he is researching. The user may choose to follow one or more links to the original information resource, provided in the answers of relevant questions, and in so doing, access other relevant information that is provided on the topic. In embodiments, information is extracted from a plurality of information resources, and a single list of questions and answers is compiled from the data extracted from those resources. In this way, a consensus between resources may be assessed to provide a "confidence rating" in the answers to those questions. For example, if resources do not agree about an answer to a question (i.e. one of the answers is incorrect or the answer is subjective), this may be indicated to a user, and links to alternative answers may be provided to allow the user to assess the information for himself.

If there is a wide divergence between answers provided in different resources, the question and answers may be assigned a low confidence rating, and/or may not be shown to the user to avoid providing misleading information.

In embodiments, the questions and answers may be 'crowd-sourced' from one or more information resources such as online internet forums or newsgroups, for example, which each have a plurality of authors. In this way, questions and answers may be extracted from information submitted by authors (e.g. internet users), and factual accuracy may be estimated by combining and comparing answers to questions extracted from the information. In this way information may be extracted from one such information resource (e.g. an online discussion forum) on a particular subject and each answer may be assigned a confidence rating based on its correlation with similar or identical questions and answers provided on the same, or another, information resource (e.g. a forum).

In embodiments, the information may be extracted from a single information resource (e.g. a single word processor document). The analysis of the resource aims to identify questions and associated answers that are relevant to the most important aspects of the information in the resource. This may be determined by the frequency of terms appearing in the resource, for example. With reference to Figure 3 of the drawings, methods according to

embodiments of the invention are described in more detail. As will be appreciated, the methods disclosed herein may be performed by a computing device which is communicatively coupled to the or each storage device 108. The computing device may be the computing device 102 or may be some other form of computing device. The methods may be embodied in a plurality of instructions stored on a computer readable medium which, when executed by a computing device, cause the computing device to implement the methods. The output of one or more of the methods disclosed herein (e.g. a question-and-answer index) may be stored on one or more of the one or more storage devices 108).

In accordance with an embodiment, an information resource 201 is analysed by a parsing engine comprising two general steps. The first step involves tokenising the input via a tokeniser 202 - for example, a Name Entity Tagger (NET) as depicted. The tokeniser 202 deconstructs a body of information in the information resource 201 into one or more strings of words (each word comprising one or more graphemes).

The tokeniser 202 identifies one or more particular terms appearing in the one or more strings, and 'labels' the or each term with a token belonging to one or more categories.

For example, terms seemingly defining a name of a person or thing, a time or date, a person's occupation, or a location, or the like, can all be labelled as belonging to a corresponding subject matter category of information (e.g. a name, time, occupation, location, and the like) and can be assigned an appropriately named token (i.e. tagged or labelled). This process of tokenising the information results in one or more strings (which may correspond to sentences or paragraphs from the information resource) in which topics or facts that are referred to in the information resource have been 'tagged' with a corresponding token. An example of Tagged text' output from the tokeniser 202 is shown in Figure 3. The example output includes strings such as: (person)

_ (occupation)

(location)

(time) (date) _

(organisation) (country)

The strings have had various terms (a 'term' may comprise one or more words) tagged to indicate that they belong to a particular category. Terms belonging to each category are matched, by the tokeniser 202, against a set of terms stored in a pattern-matching data source, such as a list, in a database, or any other form of data structure (which may be stored on one of the one or more storage devices or elsewhere).

In embodiments, terms (which may be a single grapheme) that are not recognised as belonging to a particular category, or recognised as a standard grammatical term, may be identified by searching using a search engine or the like, so as to determine whether the unrecognised term belongs to a category of interest. Figure 4 shows an example of how a phrase (which is an example of a string) might be tokenised: "Albert Einstein" (person) was "a theoretical physicist" (occupation) who was born in "1879" (date) in "Germany" (location), and died in "1955" (date). Once information has been tokenised, the second step of the parsing process involves extracting meaning from the tokenised information via pattern- matching 205. This step comprises recognising the structure of the strings that have been extracted from the information on the basis of the token placement within those strings. In embodiments, this step is performed by using finite-state automata to match the strings to regular expressions from a known list of patterns (which may be stored on one or more of the one of more storage devices 108 or elsewhere). In embodiments, pattern-matching 205 is performed using a look-up table to cross-reference known patterns against corresponding question structures. Of course, it should be understood that alternative methods of pattern-matching 205 are also contemplated.

In embodiments, the pattern-matching step 205 includes subdividing a string containing multiple tokens into a plurality of shorter strings. Alternatively or additionally, one or more strings may be created comprising a subset of the words in the original string, so as to isolate particular tokens from one another. For example, the tokenised string shown in Figure 3 may, in embodiments, be subdivided into strings including:

"Albert Einstein" (person) was "a theoretical physicist" (occupation); and "Albert Einstein" (person) was born in "1879" (date)".

In this way individual facts may be extracted from a string (which may be a phrase or sentence) containing multiple facts. In embodiments, the step of subdividing a string into multiple strings does not occur literally - the multiple facts may be extracted simply through use of the automata (or other pattern- matching method 205). However, for the purpose of describing the present invention, a subdivision of strings is the simplest and most readily understandable method.

In embodiments, and as shown in Figure 4, a table of token patterns and questions is provided (which may be stored on one or more of the one or more storage devices 108 or elsewhere). A tokenised string is compared to known patterns in a table as part of a step to formulate one or more questions 206. For example, a tokenised string (or a part of a tokenised string) with the form (person) was (occupation), may produce a question with the form "What was the occupation of (person)?", and a corresponding answer "(occupation)". In this way, a tokenised string "Albert Einstein" (person) was "a theoretical physicist" (occupation), may result in the question "What was the occupation of Albert Einstein?" - and answer "a theoretical physicist' .

Once a question has been identified, then the question and answer may be matched 207 together (which may include the re-extraction of the answer from the information resource) and a link to the location of the answer within the information resource generated. This may, then be stored in a question-and- answer index 208. The index 208 may be stored in one or more of the data storage devices 108 or elsewhere.

In this way, information is extracted from an information resource (such as a passage of text) in order to form a series of questions based on those that information. Using this technique, any type of information resource (structure, semi-structured, or plain text, for example) may be transformed into a set of questions and answers, providing direct and specific pieces of information, and seeking to guarantee no loss of any fact from the original information resource.

An automaton may be used to evaluate a string by receiving each word or token from the string as an input, and changing state based on the type of word or token received. In this way, the state is updated as the string is read, until a sentence structure is recognised and a question is output accordingly.

The output questions and answers may be configu red accord ing to a comparison of questions and answers generated from textual data originating from different information resources. For example, duplicate questions may be identified between the sets of questions and answers for an information resource. If a question from a first set of questions is identical or very similar to a q uestion from a second set, then duplicates may be removed or combined. Of course, it may be the case that although duplicate questions appear, the corresponding answers to those questions vary. In that case, the output may include all of the answers generated, or may include a selection of the answers based on the confidence rating associated with each . The answers may be output in an order determined by the confidence ratings, so that the answer associated with the highest confidence rating is presented first.

In embodiments, each answer from the output list of questions and answers includes a link or reference to the original information resource. The link may be a hyperlink embedded in the answer. The link may direct a user to the particular sentence, or passage in the original information resource, from which the question was derived. A set of questions and answers formed in this way may be used as a 'Q-index' (i.e. a question-and-answer list forming an index at the end of a document), an instructor guide (e.g. for inclusion in a guide manual), or as a source for questions (for a competition, examination, or the like) on one or more particular information resources.

In embodiments a table of token patterns and question structures is created manually. A set of token patterns and corresponding question structures is chosen by an expert or program designer (e.g. a programmer, or an author, historian, or the like), according to the type of information present in the information resource. If a sufficiently large table of token patterns is provided, the table may, theoretically, be usable in relation to any information resource. Of course, it should be understood that different languages adopt different sentence structures and grammars, so in practical terms different tables may be required to recognise patterns in information resources written in different languages. Translation tools may be used to translate an information resource prior to application of the method of the present invention. In embodiments, a table of token patterns and questions may be generated automatically through a process of supervised machine learning. In this case, a training set of tokenised strings and corresponding question structures is used to train an algorithm (by constructing an appropriate finite-state automaton or a corresponding look-up table based on the outputs of the trained automaton, for example) to model the relationship between patterns of tokens in strings, and associated question structures. A pruning stage may be applied to the resulting table, to remove infrequent patterns so as to keep the size of the table relatively small (thereby improving the efficiency of the pattern-matching step).

The questions and answers output by the system may be ordered or searchable by question type. For example, questions relating to what, where, when, who, and why (among others) may each be grouped together to simplify the searching procedure by which a user may locate a question.

In embodiments, the tokeniser 202 includes a part of speech tagger (POS) instead of, or in addition to, the NET. A POS tagger may be used to indicate the presence of words from various parts of speech, such as common nouns, pronouns, adjectives and verbs, for example. Information regarding the grammatical significance of words present in a string may provide greater flexibility and expressive freedom when formulating questions and answers. It should be understood that other forms of text parser may be used.

In embodiments in which the information resource comprises pre-tokenised text, a tokeniser 202 is not required. It may be the case that an author of a document includes tokens in the original source text. For example, a webpage may include one or more styles or attributes that are used to tag particular information in the document. The document may then be parsed by applying the pattern-matching step on the basis of the tokens included in the original strings, rather than performing a separate step to insert tokens into the strings.

In embodiments, confidence in the information provided by an information resource may be assessed by comparing the answers to questions generated using the method of the present invention, to those generated in relation to similar information resources. Where multiple information resources have been used to generate questions and answers, a comparison of questions created for each resource may be carried out. This may be performed by pattern-matching tokens in the strings from which questions were generated, for example. In the case where the subject-matter is factual, rather than subjective opinion, and identical questions are produced, the answers from each resource should be identical . If this is not found to be the case, a comparison with further information resources may be carried out. In this way, a confidence in each answer may be obtained, to give a user a guide as to how reliable the resource is. In embodiments, an answer that disagrees with one or more other information resources may flag a resource (or the particular answer) as being unreliable, and the question and/or answer may be omitted from the list of questions and answers output by the system.

Systems and methods as described may be used by instructors teaching students, by enabl ing automatic generation of q uestions (for setting examinations, creating learning guides, or otherwise testing knowledge on a topic). The system may provide an automated method of producing FAQ sections for manuals, or for online tutorials, or the like.

As will be appreciated, the system 100 may also comprise the user interface that allows a user to input a question, or choose a question from a list of generated questions. The system may provide a searchable interface to allow a user to access questions on a topic, and follow links or references in the answers to those questions, to enable the user to locate and/or view the original information resource(s) from which the question and answer was generated.

A question-and-answer index generated according to embodiments of the present invention may be part of a resource utility. The resource utility may include the user interface and one or more other programs and/or information stored on one or more of the storage devices 108.

As will be appreciated, a question-and-answer index may be generated according to methods disclosed herein before the actual information resource is accessed by a user. In other words, the question-and-answer index may be pre-prepared and may be supplied with the information resource. The information resource may be an electronic book and the question-and-answer index may be supplied with the electronic book. The electronic book (or other information resource) may be supplied by a first provider and the question- and-answer index may be supplied by a second provider distinct from the first provider.

As will be appreciated, a question-and-answer index may be generated as a result of a user accessing an information resource. The generation of the index may occur locally to the user or at a remote location (e.g. at the origin of the information resource).

In embodiments, a translation module is provided which is configured such that information resources in a plurality of languages can be used to generate a question-and-answer index in a single language.

When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included . The terms are not to be interpreted to exclude the presence of other features, steps or components. The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.

Claims

1 . A method for automatic generation of a reference utility, the method including the steps of:

providing a pattern-matching data source for matching token patterns to question structures;

receiving textual information of an information resource;

tokenising the textual information to form a tokenised string comprising one or more tokens, the or each token being indicative of a subject matter category of at least part of the textual information;

storing the tokenised string in a memory;

identifying a question structure by comparing the tokenised string with one or more token patterns provided by the pattern-matching data source; forming a question and corresponding answer based on the identified question structure; and

generating a reference utility comprising the formed question and corresponding answer, wherein the answer includes a reference to the textual information in the information resource.

2. A method according to claim 1 , further including:

creating one or more additional tokenised strings from further textual information of the information resource, the or each additional tokenised string comprising one or more tokens;

identifying one or more further question structures by comparing the or each additional tokenised string with one or more token patterns provided by the pattern-matching data source;

forming one or more further questions and one or more corresponding further answers based on the or each identified question structure; and

generating the reference utility such that the reference utility comprises the or each formed further question and corresponding further answer, wherein the or each further answer includes a reference to the further textual infornnation in the infornnation resource.

3. A method according to claim 1 or 2, further comprising assigning a confidence value to an answer of the reference utility, the confidence value being representative of a confidence that the answer is the correct answer to the corresponding question.

4. A method accord ing to claim 3, wherein the confidence value is determined by comparing two or more answers with each other and assigning a higher confidence value to an answer which concurs with at least one other answer, the two or more answers having respective corresponding questions which are similar or identical.

5. A method according to any preceding claim, wherein the step of receiving textual information from an information resource comprises receiving textual information from an information resource located on a storage device that is locally-accessible or accessible over a local or wide area network connection.

6. A method according to any one of the preceding claims, wherein the step of receiving textual information from an information resource comprises receiving textual information from a plurality of information resources.

7. A method according to claim 6, further including comparing first and second sets of questions and answers formed from textual information from respective first and second information resources, and configuring the output questions and answers based on the comparison.

8. A method according to claim 7 wherein comparing the first and second sets of questions and answers includes comparing answers to duplicate questions, and configuring the output includes indicating, for each answer corresponding to a duplicated question, a confidence rating associated with the respective answer based on the proportion of answers that are identical or very similar to that answer corresponding to the duplicated questions.

9. A method accord ing to claim 8, wherein configuring the output comprises including a reference to a plurality of the information resources that include textual information from which the duplicated question was generated.

10. A method accord ing to cla im 9, wherein configuring the output comprises presenting the answers to duplicated questions in order of decreasing confidence rating.

1 1 . A method according to claim 9 or 1 0, wherein configuring the output comprises removing references to answers that have a confidence rating that is lower than a predetermined threshold.

12. A method according to any preceding claim, wherein each formed answer includes a hyperlink reference to a section of the information resource comprising the textual information from which its corresponding question was generated.

13. A method according to any preceding claim, wherein tokenising the textual information includes comparing each of a plural ity of words in the textual information to a set of known categories, and where the word matches a category, assigning a token representing that category to the word.

14. A method according to any preceding claim, further including receiving a search term from a user, and using the reference utility to search for a question related to the search term.

15. A system for automatic generation of a reference utility, the system including:

a computing device having a processor and a memory: and

a storage device;

the computing device being configured to:

perform the method according to any preceding claim.

16. A system according to claim 15, further including a visual display for displaying an interface to a user, and to receive a search term from a user, such that the input of a search term by the user causes the computing device to output to the interface a formed question and answer corresponding to the search term.

17. A computer-readable medium storing instructions which when executed to run on a processor cause the processor to perform the steps according to the method of any one of claims 1 to 14.

18. A reference utility and information resource, wherein the reference utility includes a question-and-answer index based on textual information of the information resource, the question-and-answer index including one or more questions and one or more respective corresponding answers, the one or each answer including a reference to textual information in the information resource relevant to that answer.

19. A reference utility and information resource according to claim 18, wherein the reference utility and information resource are provided as part of an electronic book.