GB2391647A

GB2391647A - Generating a List of Terms and a Thesaurus from Input Terms

Info

Publication number: GB2391647A
Application number: GB0218296A
Authority: GB
Inventors: Philip Glenny Edmonds
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-08-07
Filing date: 2002-08-07
Publication date: 2004-02-11
Also published as: JP2004070950A; GB0218296D0

Abstract

A list of terms ranked in accordance with their relatedness to an input term is generated as follows: A list of terms related, for example by meaning, and given scores indicating their strength of relatedness is retrieved by accessing a first resource with the input term. At least one further list of scored terms is retrieved by accessing at least one further resource with the input term. One or more lists of unscored terms may be retrieved by accessing other resources with the input term. Terms which are unscored are given arbitrary scores. Each returned list is merged into a single output list and each term in the single list is then assigned a score, for example based on the weighted average of all the scores for the term. Certain terms maybe filtered from the output list. A thesaurus may be generated by processing many terms in this manner.

Description

239 647

METHOD OF AND APPARATUS FOR GENERATING A LIST OF TERMS

AND A THESAURUS

The present invention relates to a method of and an apparatus for generating a list of terms. The present invention also relates to a thesaurus generated by such a method.

A thesaurus is often thought to be a dictionary of synonyms; however, in its most general sense (e.g., Roget's International Thesaurus), a thesaurus is actually a dictionary of related words or phrases. Words or phrases (referred to herein as terms) can be related by having a similar or different meaning (e.g., synonyms/antonyms), by being in various semantic relations (such as 'engine' is part of a 'car'), by being used in similar contexts, by co-occurrence with other terms, by pronunciation (as in rhymes), and so on.

Traditionally, thesauri have been built by hand by trained lexicographers, perhaps with the help of automated tools. However, over the last few years various methods have been proposed for automatically constructing thesauri. Each of these automatic methods focuses on one type of relation between words. The benefit of the automatic methods is that the individual relations are often scored by strength of relatedness.

Thus, for a given term, a scored and sorted list of terms related to it can be generated easily. US Patent 5,675,819 builds a thesaurus by first constructing a vector for every word type in a text. Each vector is in a multidimensional vector-space of the word types. The values of the vector on each dimension are frequency of co-occurrence with the word type of that dimension. Single value decomposition (SVD) is used to reduce the dimensionality of the space. The similarity of two words is computed as the cosine of their two vectors. Embodiments are given for use in query expansion in IR (information retrieval). US Patent 6,098,033 determines similarity by looking at path patterns between the words in a semantic network. A training phase looks at path patterns between known synonyms in the network and 'reams' which are the most indicative paths for similarity of meaning. US 6,070,134 shows how to identify the salient paths.

US Patent 5,652,898 discloses a method of storing a network of words (nodes) anal associations (edges) between words. Words have a degree of activity based on association with other words and a degree of likelihood based on expectation of occurrence. After setting an initial activity level based on occurrence and co- -

occurrence information, activation spreads between the nodes, adjusting all of the activation states until a state of equilibrium is reached.

US Patent 5,926,81 1 discloses a method of query expansion that uses a purpose-built "statistical thesaurus". The thesaurus contains a record for each term that consists of other terms related to it by co-occurrence in a particular document collection. The related terms are grouped into five sets according to how strongly related they are. Pre-

existing methods are used for determining strength of co-occurrence. Each document collection has its own thesaurus. The technique includes a method for efficiently building a ranked list of expansion terms from the various thesauri using a parallelisable algorithm. The method relies on the fact that the thesauri are each built using the same techniques for determining relatedness.

Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, MA and Lin, Dekang. 1998. Automatic Retrieval and Clustering of Similar Words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL '98), pp. 768-773 also compute term similarity based on similarity of the other terms that co-occur with the two terms in text.

Resnik, Philip. 1999. Semantic Similarity in a taxonomy: An InformationBased Measure and its Application to Problems of Ambiguity in Natural Language. In Journal of Artificial Intelligence Research, vol. l 1, pp. 95-130 discloses a method which computes term similarity using path distance in a semantic network combined with other corpus-based metrics.

According to a first aspect of the invention, there is provided a method of generating a list of terms from an input term, comprising: (a) retrieving, from a first resource containing terms scored by strength of relatedness to each other, a first list of terms scored by strength of relatedness to the input term; (b) retrieving from at least one further resource containing at least one further list of terms, each term of which is assigned a score; (c) forming an output list of terms comprising at least some of the terms of the first list; and (d) assigning to each term in the output list a composite score as a function of the or each score of the input term in the first and further lists.

The step (b) may comprise retrieving, from at least one second resource containing terms scored by strength of relatedness to each other, at least one second list of terms scored by strength of relatedness to the input term, and the step (c) may comprise; merging the at least one second list with the first list.

The step (b) may comprise retrieving, from at least one third resource containing unscored terms, at least one third list of terms related to the input term and assigning a predetermined arbitrary score to each term of the at least one third list, and the step (c) may comprise merging the at least one third list with the first list.

The step (b) may comprise retrieving, from at least one fourth resource containing terms scored by a factor other than relatedness to each other, at least one fourth list of terms.

Relatedness may be relatedness by meaning.

The function may be a weighted average.

The scores from the first and, when present, the second and fourth lists may be normalized to a common range before the composite scores are assigned. Each noTnalised score may be calculated as:

score- mm max - mln where max anal min are the highest and lowest scores, respectively, in the list containing the term. As an alternative, each norrnalised score may be calculated as: score- min max- min where max and min are the upper and lower limits, respectively, of the range of scores of the list containing the term. The predetermined arbitrary score may be the upper limit of the common range.

The method may comprise removing from the output list terms which are present in at least one filter list. The at least one filter list may contain at least one antonym of the at least one input term.

The first and, when present, the second and third resources may be indexed by terms.

The first and, when present, the second and third resources may comprise at least one thesaurus. The first and, when present, the second and third resources may comprise at least one of a co-occurrence list, a similarly list, a semantic network, a word association list and a dictionary. The method may comprise repeating the steps (a) to (d) for each of a plurality of input terms. The method may comprise the further step of merging the output lists to form a composite output list. The method may comprise the further step of summing the composite scores of any term occurring in more than one of the output lists. The method may comprise generating a thesaurus.

According to a second aspect of the invention, there is provided a thesaurus generated by a method according to the first, second or third aspect of the invention.

According to a third aspect of the invention, there is provided a computer program for programming a computer to perform a method according to any of the first, second and third aspects of the invention.

According to a fourth aspect of the invention, there is provided a storage medium containing a program according to the fifth aspect of the invention.

According to a fifth aspect of the invention, there is provided a computer programmed by a program according to the fifth aspect of the invention.

According to an sixth aspect of the invention, there is provided an apparatus for performing a method according to any of the firsts second and third aspects of the invention. It is thus possible to provide a technique which generates lists of related terms from various sources with consistent scoring of the terms. Resources which are not in themselves scored can nevertheless be used. The scores of terms occurring in more than one of the lists can be combined in a useful and meaningful way. It is also possible to make use of scores based on lists of terms scored by measures such as frequency of usage or level of concreteness to provide better quality scores for general relatedness.

It is also possible to ignore terms from relatively long lists, for example containing thousands of terms. In such lists, many of the terms with scores below a threshold dependent on the individual type of list are not related in any apparent or useful way. It is possible to make use only of the most strongly related terms with less related terms being ignored as errors or a form of noise.

The invention will be further described by way of example, with reference to the accompanying drawing, which is a schematic flow diagram of a method constituting an embodiment of the invention.

A method is provided for generating a list of terms related to an input term in which each related tend is assigned a numerical strength of relatedness to the input term by way of example only, "relatedness" in the following description is assumed to be

relatedness by meaning. However, other types of relatedness may alternatively or additionally be used, such as relatedness by pronunciation (e.g. rhymes) and relatedness by association (e.g. that "turkey' is something which is eaten at "Christmas").

The method generates the list by combining (or merging) any number of other existing lists of at least three types: Scored relatedness lists: Terms that are related by meaning to the input term, with scores indicating strength of relation.

Unscored relatedness lists: Terms that are related by meaning to the input term, without scores.

Term lists: Terms that are not necessarily related to the input term by meaning, but can influence the final list of related terms, with scores of some kind.

A strength of relatedness for each term in the merged list is computed as the weighted average of the term's scores in the individual lists.

An optional final step removes all terms from the merged list that occur in any number of filter lists. For example, a filter list might contain antonyms of the input word.

These antonyms would then be filtered out of the merged list.

Scored relatedness lists, include, but are not limited to: Terns that co-occur with the input word in a large corpus of natural language.

Mutual information, t-scores, or other significance scores can be assigned.

Terms that are used in similar contexts in a large corpus. Many similarity metrics are available in the literature to assign strength of similarity.

Terms that are related to the input word in a pre-built semantic network. A resource such as WordNet can be used. Relations include synonyms,

hypernyms, etc. Scores might be computed by, for instance, path distance within the semantic network.

Terms from a psychologically-based word association thesaurus.

Unscored relatedness lists, include, but are not limited to: Terms that are related to the input word in a pre-built semantic network. A resource such as WordNet can be used. Relations include synonyms, hypernyrns, etc. Terms that are synonyms or antonyms according to a thesaurus.

Terms that occur in the dictionary entry of a word, in the definition, or in exemplar sentences, or in other fields. Scores might not be available.

Term lists (non-related), include, but are not limited to: Terms ranked by various factors such as concreteness, familiarity, etc. Terms ranked by their frequency of usage in a particular corpus of texts. I This method may also be used to take several terms as input and output a list of terms related to all of the input terms. It could do this by generating separate lists for each of the input terms and then joining the lists into a single list.

A more detailed example of such a technique is illustrated in the accompanying drawing. In a step I, an input term is supplied. A term may be a word, a phrase, a set of words or the like.

Following supply of the input term, the drawing illustrates three branches being performed simultaneously. However, in practice, these branches may be performed one after the other.

In a step 2, an electronically readable resource is accessed by the input term so as to obtain scored lists of words which are related to the input term. The resources may be in any appropriate form of memory which is readable by a computer and a typical example is a CD ROM. Alternatively, a computer may access a remote resource, for

example via the internet, in order to obtain the scored lists. Such resources include interfaces which receive the input term and which index the resources or search them in any suitable way so as to return the scored lists. The terms in the lists may be related to the input term in different ways but, in general, the returned lists of terms are related by meaning and examples are given hereinbefore.

In a step 3, the scores of the terms in the resumed lists are examined and words having a score below a threshold are removed from the lists. The scores are then norrnalised to a range from 0.0 to 1.0 by applying any suitable norrnalising function. Different functions may be used for different relatedness lists. For example, a nornalised score for each term of each list may be obtained by calculating: score- min max - min where min and max may be the minimum and maximum scores in the list or the theoretical minimum and maximum values for the list.

In a step 4, other electronically readable resources are accessed or interrogated by the input term to return lists of unscored words related to the input term. The step 4 is effectively similar to the step 2 except that the lists of related terms in the resources are not scored. A subsequent step 5 allocates a score of 1.0 to all of the terms in the lists resumed in the step 4.

In a step 6, further electronically readable resources are accessed so as to return scored wordlists. These resources are not interrogated by or on the basis of the input term.

Instead, all of the terms in these resources are resumed or made available. Also, the scores of the terms in these resources are not related to each other but are related to some factor associated with the individual terms. For example, the each score may depend only on frequency of usage of the terns, concreteness of meaning, formality or some other stylistic factor.

In a step 7, the scores are normalized, for example as described hereinbefore with respect to the step 3. Because the wordlists returned in the step 6 are independent of the input term, the steps 6 and 7 need only be performed once and the results of these steps are then available for any number of supplied input terms.

The steps 2 to 7 result in the scores of all of the returned terms being in a common range. In a step 8, the scored lists returned by the steps 2 to 5 are merged into a single list, for example by using a set union operation such that, if a term occurs in any of the lists, it is inserted in the merged list. If a term occurs in more than one of the lists, it is inserted only once in the merged list. Alternatively, a set intersection operation may be used such that a term appears in the merged list only if it is present in all of returned lists. A step 9 is then performed so as to apply one or more filter lists to the merged list. In particular, if any term in the filter lists occurs in the merged list, the word is deleted from the merged list.

In a step 10, scores for the terms in the merged list are computed as a weighted average.

In particular, for each term, its score for the merged list is computed by multiplying the score of the word in each list by a weighting coefficient and adding all of these products together. If a word does not appear in any list, it is assigned (implicitly or explicitly) a score of O for that list. The weights used in the averaging computation may be assigned following experimentation or by any other technique appropriate to provide each term with a score indicating its strength of relatedness to the input word. Finally, the merged list with the composite scores is output in a step 11 with optional sorting of the terms in accordance with the composite scores.

As described above, this embodiment of the invention is used as a component that generates sorted and scored lists of terms related to an input term. Such a technique may also be used to generate an entire thesaurus. Such a thesaurus could be given a user interface so that people can browse it, or a machine interface so that it can be used as a component in other applications.

This technique may also be repeated for a plurality of input terms and the resulting merged lists may be combined into a single list. The scores of any tempt in more than one of the merged lists may be summed to give the score for the term in the combined single list.

This technique has applications in query expansion for information retrieval, particularly in multimedia search engines and in an enhanced/multimedia short message services (EMS and MMS). The technique may provide a core component for query expansion in a search engine. For instance, starting from a short query, this method can be used to expand the query with terms associated to the query words. The expanded query will be more apt to match relevant documents in a collection. In fact, it is especially useful for short text queries for images in an image collection in which the; annotations or captions are open short and basic. By expanding the initial query, one is more likely to find annotations that match the expanded query. This method is better than using any one particular type of relatedness because it draws in more related words. The amount that each type of relation contributes can be adjusted by changing the weights. Query expansion techniques are disclosed, for example, in our copending patent application bearing reference number P52092GB.

Other applications of word relatedness are in the area of natural language processing.

Knowledge of word relationships can be used in at least: I) detecting the semantic; cohesion of texts in order to segment automatically the texts into meaningful units; 2); building semantic representations of texts; 3) text summarization.

Claims

( CLAIMS:

I. A method of generating a list of terms from an input term, comprising the steps of: (a) retrieving, from a first resource containing terms scored by strength of relatedness to each other, a first list of terms scored by strength of relatedness to the input term; (b) retrieving from at least one further resource at least one further list of terms, to each term of which is assigned a score; (c) forming an output list of terms comprising at least some of the terms of the first list; and (d) assigning to each term in the output list a composite score as a function of the or each score of the term in the first and further lists.
2. A method as claimed in claim 1, in which the step (b) comprises retrieving, from at least one second resource containing terms scored by strength of relatedness to each other, at least one second list of terms scored by strength of relatedness to the input term, and the step (c) comprises merging the at least one second list with the first list.
3. A method as claimed in claim I or 2, in which the step (b) comprises retrieving, from at least one third resource containing unscored terms, at least one third list of terms related to the input term and assigning a predetermined arbitrary score to each term of the at least one third list, and the step (c3 comprises merging the at least one third list with the first list.
4. A method as claimed in any one of the preceding claims, in which the step (b) comprises retrieving, from at least one fourth resource containing terms scored by a factor other than relatedness to each other, at least one fourth list of terms.
5. A method as claimed in any one of the preceding claims, in which relatedness is relatedness by meaning.
6. A method as claimed in any one of the preceding claims, in which the function is a weighted average.
7. A method as claimed in any one of the preceding claims, in which the scores from the first and, when present, the second and fourth lists are normalised to a common range before the composite scores are assigned.
8. A method as claimed in claim 7, in which each normalised score is calculated as: score - min max - min where max and min are the highest and lowest scores, respectively, in the list containing the term.
9. A method as claimed in claim 7, in which each normalised score is calculated as: score - min max - min where max and min are the upper and tower limits, respectively, of the range of scores of the list containing the term.
10. A method as claimed in any one of claims 7 to 8 when dependent on claim 3, in which the predetermined arbitrary score is the upper limit of the common range.
11. A method as claimed in any one of the preceding claims, comprising removing from the output list terms which are present in at least one filter list.
12. A method as claimed in claim 11, in which the at least one filter list contains at feast one antonym of the at least one input term.
13. A method as claimed in any one of the preceding claims, in which the first and, when present, the second and third resources are indexed by terms.
14. A method as claimed in claim 13, in which the first and, when present, the second and third resources comprise at least one thesaurus.
15. A method as claimed in any one of the preceding claims, in which the first and, when present, the second and third resources comprise at least one of a co-occurrence list, a similarity list, a semantic network, a word association list and a dictionary.
16. A method as claimed in any one of the preceding claims, comprising repeating the steps (a) to (d) for each of a plurality of input terms.
17. A method as claimed in claim 16, comprising the further steps of merging the output lists to form a composite output list.
18. A method as claimed in claim 17, comprising the further step of summing the composite scores of any term occurring in more than one of the output lists.
19. A method as claimed in claim 16, comprising generating a thesaurus.
20. A thesaurus generated by a method as claimed in claim 19.
21. A computer program for programming a computer to perform a method as claimed in any one of claims 1 to 19.
22. A storage medium containing a program as claimed in claim 21.
23. A computer programmed by a program as claimed in claim 21.
24. An apparatus for performing a method as claimed in any one of claims I to 19.