WO1997049045A1

WO1997049045A1 - Apparatus and method for generating optimal search queries

Info

Publication number: WO1997049045A1
Application number: PCT/IB1997/000743
Authority: WO
Inventors: Scott Deerwester
Original assignee: Kdl Technologies Limited
Priority date: 1996-06-21
Filing date: 1997-06-19
Publication date: 1997-12-24
Also published as: ZA975476B; AU3044397A; EP0978058A1

Abstract

A computer creates a data structure that reveals the relationship among words contained within a sample of documents. The computer uses that data structure, which includes a similarity matrix, to form a search query for identifying documents whose subject matter relates to that of a document containing information of interest.

Description

Title of the Invention

APPARATUS AND METHOD FOR

GENERATING OPTIMAL SEARCH QUERIES

Background Art

The present invention relates to text searching of large databases. More particularly, the present invention relates to apparatus and methods for generating optimal queries for such searching

Computer networks aliow users to access enormous volumes of data stored in

electronic form A public network, such as the Internet, links together not only individual computer systems, but also local computer networks, thereby creating a larger network that provides access to a massive amount of information The volume of information that a user can access makes managing and identifying specific

information of interest very difficult.

One technique to identify information of interest is to use search queries. A search query contains key terms reflecting specific words or the types of information the user is interested in accessing. A search engine identifies and retrieves

information matching the search query.

The success of this technique, however, is no better that the words in the search query On the one hand, if a user selects terms that do not distinguish information of interest from other documents, the user may receive irrelevant

CONFIRMATION COPv information. On the other hand, if a user omits words having the same meaning as a term in the search query, the user may never find information of interest.

Disclosure of the Summary Accordingly, the present invention is directed to an apparatus and method that improves the likelihood of finding all desired information and decreases the likelihood of finding irrelevant information.

In one aspect, the invention includes a method of generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest The method comprises the computer-executable steps of selecting a sample of documents to represent the set of documents and creating a data structure revealing relationships among words within the sample of documents The method aiso includes the step of generating a search query for searching the set of documents based upon the relationship of the search document to the data structure.

In another aspect, the invention includes an apparatus for generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest. The apparatus comprises means for selecting a sample of documents to represent the set of documents; means for creating a data structure revealing relationships among words within the sample of documents; and means for generating a search query for searching the set of documents based upon the relationship of the search document to the data structure It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.

Brief Description of the Drawings

The accompanying drawings provide a further understanding of the invention. They illustrate a preferred implementation of the invention and together with the description, explain the principles of the invention.

Fig. 1 is a flow chart illustrating an overview of the preferred implementation for generating optimal search queries;

Fig. 2 is a block diagram of an exemplary computer for carrying out the methods of the invention;

Fig. 3 is a flow chart illustrating the preferred implementation for creating the term-pair data structure according to step 100;

Figs. 4A-4B show similarity matrix M and index, respectively, formed by the method represented by the fiow chart shown in Fig. 3,

Fig. 5 is a representation of the preferred implementation for determining a set of descriptive words Wd associated with each descπptive word, according to step 310;

Fig. 6 is a representation of the preferred implementation for creating matrix 510 shown in Fig. 5;

Fig. 7 is a flow chart illustrating the preferred implementation for generating a search query using the data structure according to step 110. Best Mode for Carrying Out the Invention Overview

The invention serves to identify information whose subject matter is the same or similar to that of information of interest. The information may be contained in discrete elements, or "documents." The term "document" refers to any files or portions of such files containing electronically-stored data, including text files and Web pages. The invention takes advantage of knowledge discovery techniques described in U.S. patent application Serial No. entitled "Method and System for Revealing

Information Structures in Collections of Data Items" by Scott Deerwester, filed on June 21, 1996, the content of which is hereby incorporated by reference.

In the preferred implementation, the invention facilitates the retrieval of documents containing text whose content relates to text contained in an accessed document, or "search document." The invention extracts information from the search document and applies the extracted information to a data structure to generate a search query containing terms. The retrieved documents are preferably accessible by a network, such as the Internet, and may be stored in databases. For example, a legal database accessible on the Internet could store documents containing data that represents court opinions. In that case, the invention could create a search query for identifying documents within that legal database related to a court opinion of interest. A broad overview of the preferred implementation of the invention is shown in the flow chart of Fig. 1. First, a term-pair data structure is created (step 100). The data structure reveals the likelihood that, given a sample of documents, different pairs of words occur in any single document. Once the data structure has been created, it can be used to generate a search query to identify documents related to a search document (step 110). This data structure can be continually reused to generate different search queries corresponding to different search documents.

Hardware Implementation

Fig. 2 shows computer 200 for carrying out the methods of the invention. Specifically, computer 200 preferably creates the term-pair data structure (step 100 of Fig. 1 ), as described in greater detail in connection with Figs. 3-6, and generates a search query (step 110 of Fig.1), as described in greater detail in connection with Fig. 7. In an alternative implementation, one computer could be used to create the data structure and another to generate search queries.

Computer 200 is preferably a general-purpose personal computer programmed to perform the functions described herein and, in particular, those described in connection with Figs. 1 and 3-7. As shown in Fig. 2, computer 200 comprises processor 210, input device 220, random access memory (RAM) 230, output device 240, storage device 250, and a network link 260. Processor 210 is connected to each of the other elements of computer 200.

Processor 210 may be any commercially-available processor capable of running software for performing the functions described above. Preferably, processor 210 is a high-speed processor, such as an Intel PentiumTM processor. Input device 220 may be a keyboard, mouse, or any other device for inputting data from the user to computer 200.

RAM 230 stores program instructions to be executed by processor 210. Output device 240, which may be a display or printing device, outputs data from computer 200 to the user. Storage device 250 stores program code representing the functions described in connection with Figs. 1 and 3-7, which is loaded into RAM 230 for execution by processor 210. Storage device 250 also stores the term-pair data structure used to generate the optimal search queries, namely a similarity matrix and an index, as described below. Storage device 250 includes floppy disks, hard disks, optical disks, CD ROMs, tapes, other magnetic media, programmable semiconductor devices, hard-wired circuitry, and any other medium for storing data In addition, storage device 250 may also store program code representing a search engine receiving the optimal search query as input The search engine may be any conventional program capable of searching for documents on a network based upon a search query

Network link 260 connects computer 200 to a network, such as the Internet Through link 260, computer 200 can access documents on the network Link 260 includes optical fiber, coaxial cable, telephone wire, wireless transmission, and any other connection for linking a computer to a network

Although computer 200 is described above as a programmed general-purpose computer, in alternative implementations, any machine capable of performing the functions of the methods can be used to carry out the methods of the invention

Create Data Structure

In a preferred implementation, the data structure created in step 100 and used in step 110 includes a similanty matrix M and an index The similaπty matrix M contains data revealing the likelihood that different pairs of words occur within a single document belonging to a sample of documents The index stores the addresses of the word pairs stored in the matrix. The index can be used to retrieve data from the similarity matrix more efficiently. The similarity matrix and the index will be described in greater detail below in connection with Figs. 4A and 4B.

Fig. 3 is a flow chart illustrating a series of steps, in accordance with a preferred implementation, for creating the data structure used to generate search queries. The steps shown in Fig. 3 can be carried out by a programmed general-purpose personal computer, such as computer 200 shown in Fig. 2, or any other suitable machine capable of performing the functions described herein.

Computer 200 accesses a sample of documents containing a representative number of words (step 300). For example, computer 200 can access a set of documents available on the Internet relating to a variety of different subject matters These documents should contain a sufficient number of different words to allow a robust enough similarity matrix to meet the requirements in searching.

Computer 200 identifies from the document sample a set D of "descriptive" words d to serve as a basis for distinguishing the documents (step 300). In a preferred implementation, the test for "descriptiveness" is frequency Therefore, set D preferably includes those words occurring in the documents more than a minimum number of times, Tmin, and less than a maximum number, Tmax, since words that appear too few or too many times do not distinguish documents sufficiently

After selecting a set D of "descπptive" words, computer 200 determines a subset Wd of set D that is closely associated with each "descriptive" word d contained within set D (step 310). In a preferred implementation, "closely associated" means, for each word d in set D, those descriptive words in subset Wd having a discrimination value exceeding a certain threshold in the set of documents that contain word d, the threshold being determined empirically according to such factors as the speed of computer 200 and the size of the database being searched. Identifying a set Wd for each "descriptive" word d reveals inherent relationships that exist among the descπptive words and the documents within the sample. When computer 200 determines set Wd for each word d, it preferably uses cross-correlation analyses on the words in set D, as described in U.S. patent application Serial No. , previously incorporated by reference. Fig. 5 illustrates a representation of this analysis, which is performed for each "discrimination" word d. As shown in Fig. 5, this analysis includes vectors 500, 520, 530, 550, and 560 and maps 510 and 540. Query vector 500 represents a "discrimination" word d. In other words, the only non-zero value corresponds to the word d. Map 510 represents a "descriptive words"-to-"words" map containing values representing the degree to which descriptive words are related to words, whether descπptive or nondescπptive, occurring in the sample of documents, based on the occurrence of the words in the documents containing the descπptive words. For example, pairs of words occurring together frequently in documents will correspond to higher values, while pairs of words occurring together infrequently lower values.

Maps 510 and 540 are formed using discrimination analysis, a representation of which is shown in Fig. 6. As seen in Fig. 6, this analysis includes vectors 600, 620, 640, 650, and 660 and maps 610 and 630. Query vector 600 represents a single word occurring in the documents of the sample. Map 610 is a word-to-document map of entries representing the number of each word appearing in the corresponding documents of the sample and can be formed by counting the occurrences of words in the documents of the sample. Computer 200 "composes" vector 600 with map 610 to form result vector 620 indicating the documents containing the word from vector 600

The composing function is described in U.S. Patent Application No , previously incorporated by reference.

Computer 200 then composes vector 620 with document-to-word map 630, the inverse of map 610, to form result vector 640 representing the frequency that words occur in documents that contain the word of vector 600. Computer 200 normalizes vector 640 using profile vector 650 to form discrimination vector 660 Discrimination vector 660 contains discrimination values for respective words as they appear in the documents of vector 620. Performing discrimination analysis for each word yields a set of discrimination vectors forming the respective rows of map 510

As represented by the downward-curved arrow shown in Fig. 5, computer 200 composes vector 500 with map 510 to produce result vector 520, containing values representing the relative frequency to which words occur in documents that contain the descπptive word of vector 500. Computer 200 retains the k1 number of words having the highest relative frequency to form query vector 530

As shown by the curved arrow pointing to the right, computer 200 composes vector 530 with map 540, which is the inverse of map 510, to form result vector 550 Vector 550 contains values representing the degree to which descriptive words are associated to the k1 number of words associated with the descπptive word of vector 500 Computer 200 selects the k2 number of descriptive words in vector 550 having the highest discrimination values to form vector 560 and set Wd. Both k1 and k2 can be chosen empirically to provide a sufficient degree of discrimination.

Using the sets D and Wd of descriptive words, computer 200 constructs a similarity matrix M (step 320). Matrix M, shown in Fig. 4A, contains data about pairs of words, each pair including words d1 from set D and words d2 from corresponding set Wd whose similarity measures exceed a predetermined threshold. A similarity measure indicates the degree to which d1 and d2 are similar, based on the common occurrence of the words that best discriminate documents that contain words d1 and d2, respectively

In the preferred implementation, the similarity measure, cos (d1 , d2), results from the following operation: cos (d1. d2) = sqr (sum (d1 ^* d2^ sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) )

which is performed with respect to map 510, where "sqr" is the square root function, "(d1 * d2)" represents the products of corresponding values in map 510 for d1 and d2. and "sum" is the summation of those products. Cos (d1 , d2) equals 1 when words d1 and d2 have the highest degree of similarity because words that distinguish documents in which each appear are the same. Cos (d1 , d2) equals 0 when words d1 and d2 have no similarity because the words that distinguish documents in which each appear are completely different. Preferably the similarity matrix only includes pairs with similarity measures exceeding a threshold, such as 0.25.

To access data efficiently from the similarity matrix, computer 200 preferably sorts the entries (e.g., pairs of d1 and d2 having similarity measures exceeding a threshold) according to the frequency of word d2. Computer 200 preferably records the frequency of each word d2 added to the similarity matrix and sorts the entries in the matrix by frequency of word d2 (i.e., from word d2 occurring the most frequently to the least frequently) Computer 200 then assigns to each word g2 an integer identifier that increases in value as the frequency of word d2 decreases.

After constructing similarly matrix M, computer 200 compresses it and generates an index (step 330) The similarity matrix can be compressed using any conventional data compression technique for preserving memory in which the similarity matrix is stored In the preferred implementation, the similarity matrix is compressed using blockwise prefix omitted encoding, as described in U.S. patent application Serial No , previously incorporated by reference

The index, shown in Fig 4B, preferably comprises a list of entries, each including a word g1 , an identifier representing a word d2, and the corresponding address in the similarity matrix This index permits efficient access of data from the similarity matrix

In one embodiment, the sample of documents serving as a basis for matrix M and the index relate to a variety of different topics and subject matters As such, matrix M and the index can be used, as described below, to generate search queries for documents of that variety of topics and subject matters In alternative embodiments, the sample of documents serving as the basis for matrix M and the index relate to a specific topic (e.g , medical records), in which case the search query generated from such a matrix M would be useful in obtaining documents of that topic (e.g , medical records) In addition, matrix M and the index, once generated, can be used multiple times to generate search queries, as discussed below. Preferably, however, matrix M and the index are regenerated periodically from an updated sample of documents representative of a current state of the documents to be searched. Generate Query Using Data Structure

Once the data structure in Fig. 4A and 4B is created, it can be used to generate search queries for identifying documents whose subject matter is the same or similar to that of a "search" document. The generated search query preferably includes one or more search terms that serve as the input to a search engine that searches databases or storage devices, such as ones accessible on the Internet, for documents containing those terms. The search can employ any commercially-available search engine suitable for performing those functions. Preferably, the search engine performs relevance ranking, so as to provide those documents matching the search query in the order of documents containing the greatest number of terms in the search query to the least.

Fig. 7 is a flow chart illustrating the preferred implementation for generating search queries using the data structure. Like the steps shown in Fig. 3, the steps shown in Fig. 7 can also be carried out by a programmed general-purpose computer similar to computer 200 shown in Fig. 2, or any other machine capable of performing the functions described herein.

As shown in Fig. 7, computer 200 accesses a "search" document containing information of interest available on, for example, the Internet (step 210). In a preferred implementation, a user selects a document or a portion of a document through a user interface provided by computer 200. The user can request additional documents similar in content to the one he is accessing, which becomes the "search" document.

Computer 200 extracts words from the search document (step 700) and evaluates each extracted word (steps 705, 735, and 745) and generates the search query based upon this evaluation. Specifically, computer 200 determines whether each extracted word is a "stopword" (step 710) and whether it is in the index (step 715) A "stopword" refers to words that are particularly poor at distinguishing documents These necessarily include words such as "the" and "an", which occur in virtually every document, as well as words that serve as function words in a particular document collection, such as the words "information" and "address" in Web documents

If the extracted word is not a "stopword" and is not in the index, then computer 200 adds the extracted word to a secondary query term list (step 720) Words that are in the index are added to a primary query term list (step 725) As described below, the words in the primary and secondary query term lists will be added to other words from an auxiliary query term list to form the search query

If the extracted word is not a "stopword" and is in the index, that is, if the word is a primary query term, computer 200 retrieves from the similarity matrix the words associated with the extracted word and their respective scores (step 730), calculated by multiplying the similarity measure retrieved from the data structure with a normalized frequency, which is computed according to the formula (1 + log2 (freq)), where freq is the number of times that the word appears in the document being processed. Computer 200 places the extracted words into an auxiliary query list (step 730). For extracted words that are associated with more than one primary query term, the final score is computed by adding all of the individual scores associated with the individual primary query terms. Once computer 200 has evaluated ail of the extracted words, it selects up to a set number of words from the auxiliary query list having scores exceeding a threshold and adds those selected words to the words in the primary query list to form the search query (step 740).

Finally, computer 200 constructs a query that has a number of primary, secondary and auxiliary query terms, where each of these numbers may be set by the user (step 740). In this way, the search query includes words that are highly associated with words in the document of interest as well as words that are not in the similarity matrix. As described above, computer 200 can pass the search query to a search engine, preferably one capable of relevance ranking, to identify documents, for example, accessible on the Internet matching the query. In the preferred implementation, the search query comprises a list of words selected using the above-described techniques. In alternative implementations, Boolean connects may be included to define relationships among those words Conclusion

By utilizing knowledge discovery techniques, relationships among words in a sampie of documents can be realized. Those relationships can then be used to form search queries for a search document. The search query can then be passed to a search engine, preferably one having relevance ranking, to retrieve documents matching the search query. It will be apparent to those skilled in the art that various modifications and vaπations can be made in the method and apparatus of the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

Claims:

1 A method of generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest, comprising the computer-executable steps of: selecting a sample of documents to represent the set of documents; creating a data structure revealing relationships among words within the sample of documents; and generating a search query for searching the set of documents based upon the relationship of the search document to the data structure

2 The method according to claim 1 , wherein the step of creating a data structure comprises the substeps of selecting, in accordance with predetermined selective criteria, a set D of descriptive words that occur within the sample of documents, determining, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective criteria; and constructing as the data structure a similarity matrix containing similarity values for pairs of descriptive words for each word d in set D and corresponding set Wd

3. The method according to claim 2, wherein the substep of selecting a set D comprises the substep of selecting words occurring within the sample of documents more than a minimum threshold and less than a maximum threshold

4. The method according to claim 2, wherein the substep of determining the subset Wd, for each word d of set D, comprises the substeps of: composing a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d, forming a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 ; composing the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector; forming set Wd from the words contained in the second result vector corresponding to vaiues exceeding a threshold value k2.

5. The method according to claim 4, wherein the substep of constructing a similarity matrix comprises the substeps of: determining a simiiarity measure for each of pair of words from set D and set Wd; and adding to the similarity matrix those pairs of words having similarity measures exceeding a threshold.

6. The method according to claim 5, wherein the substep of determining a similarity measure comprises the substep of performing the following equation: cos (d1 , d2) = sqr (sum (d1 ^* d2^ sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set Wd, sqr is the square root function, (d1 ^* d2) represents products of corresponding values in the first map for d1 and d2, and sum is the summation of the products.

7. The method according to claim 2, further comprising the substeps of: compressing the similarity matrix; and generating an index containing words and their addresses in the similarity matrix

8 The method according to claim 1 , wherein the generating step comprises the substeps of extracting words from the search document, evaluating the extracted words based upon the data structure; and forming a search query based upon the evaluation

9. The method according to claim 8, wherein the substep of evaluating the extracted words includes the substeps of determining that extracted words occurring within the sample of documents more than a predetermined threshold are stopwords; and determining whether the extracted words are contained within the data structure.

10. The method according to claim 9, wherein the substep of evaluating the extracted words further includes the substeps of obtaining words from the data structure related to the extracted words contained in the data structure; adding the extracted words contained in the data structure to a primary query term list,

adding the determined words to an auxiliary query term list; and adding the extracted words that are not contained in the data structure and that are not stopwords to a secondary query term list

11 The method according to claim 10, wherein the step of forming a search query includes the substep of forming a search query based upon words contained in the primary query term list, secondary query term list, and auxiliary query term list satisfying a predetermined selective criteria

12 The method according to claim 1 , further comprising the step of providing the generated search query to a search engine

13 The method according to claim 12, further comprising the step of searching for documents matching the search query using the search engine

14 The method according to claim 1 , wherein documents comprise at least one of text files and Web pages

15 A method of creating a data structure comprising the steps of selecting, in accordance with predetermined selective criteria, a set D of descriptive words that occur within a sample of documents, determining, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective cπteπa; and constructing as the data structure a similarity matrix containing similarity values for pairs of descriptive words for each word d in set D and corresponding set Wd

16 The method according to claim 15, wherein the step of selecting a set D comprises the substep of selecting words occurring within the sample of documents more than a minimum threshold and less than a maximum threshold.

17 The method according to claim 15, wherein the step of determining the subset Wd, for each word d of set D, comprises the substeps of composing a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d, forming a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 , composing the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector, forming set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2

18 The method according to claim 17, wherein the step of constructing a similarity matrix comprises the substeps of determining a similarity measure for each of pair of words from set D and set Wd; and adding to the similarity matrix those pairs of words having similarity measures exceeding a threshold

19. The method according to claim 18, wherein the substep of determining a similarity measure comprises the substep of performing the following equation: cos (d1 , d2) = sqr (sum (d1 * d2H sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set

Wd, sqr is the square root function, (d1 ^* d2) represents products of corresponding values in the first map for d1 and d2, and sum is the summation of the products.

20 The method according to claim 15, further comprising the steps of: compressing the similarity matrix; and generating an index containing words and their addresses in the similarity matrix

21. A method of generating a search query corresponding to a search document based upon a data structure, comprising the steps of extracting words from the search document, evaluating the extracted words based upon the data structure, and forming a search query based upon the evaluation

22 The method according to claim 21 , wherein the step of evaluating the extracted words includes the substeps of determining that extracted words occurring within the sample of documents more than a predetermined threshold are stopwords: and determining whether the extracted words are contained within the data structure.

23 The method according to claim 22, wherein the step of evaluating the extracted words further includes the steps of obtaining words from the data structure related to the extracted words contained in the data structure, adding the extracted words contained in the data structure to a primary query term list, adding the determined words to an auxiliary query term list; and adding the extracted words that are not contained in the data structure and that are not stopwords to a secondary query term list

24 The method according to claim 23, wherein the step of forming a search query includes the substep of forming a search query based upon words contained in the primary query term list, secondary query term list, and auxiliary query term list satisfying a predetermined selective criteria

25 An apparatus for generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest, comprising

means for selecting a sample of documents to represent the set of documents, means for creating a data structure revealing relationships among words within the sample of documents; and means for generating a search query for searching the set of documents based upon the relationship of the search document to the data structure

26 The apparatus according to claim 25, wherein the means for creating a data structure comprises means for selecting, in accordance with predetermined selective criteria, a set D of descriptive words that occur within the sample of documents, means for determining, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descπptive word d in accordance with predetermined selective criteria, and means for constructing as the data structure a similarity matrix containing similarity values for pairs of descriptive words for each word d in set D and corresponding set Wd

27 The apparatus according to claim 26, wherein the means for selecting a set D comprises means for selecting words occurring within the sample of documents more than a minimum threshold and less than a maximum threshold

28 The apparatus according to claim 26, wherein the means for determining the subset Wd for each word d of set D, comprises means for composing a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d; means for forming a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 ; means for composing the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector; means for forming set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2

29. The apparatus according to claim 28, wherein the means for constructing a similarity matrix comprises- means for determining a similarity measure for each of pair of words from set D and set Wd; and means for adding to the similarity matrix those pairs of words having similarity measures exceeding a threshold.

30. The apparatus according to claim 29, wherein the means for determining a similarity measure comprises means for performing the following equation- cos (d1 , d2) = sor (sum (d1 ^* d2» sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set

Wd, sqr is the square root function, (d1 ^* d2) represents products of corresponding values in the first map for d1 and d2, and sum is the summation of the products

31. The apparatus according to claim 26, further comprising: means for compressing the similarity matrix; and means for generating an index containing words and their addresses in the similarity matrix.

32. The apparatus according to claim 25, wherein the means for generating comprises: means for extracting words from the search document; means for evaluating the extracted words based upon the data structure; and means for forming a search query based upon the evaluation.

33. The apparatus according to claim 32, wherein the means for evaluating the extracted words includes: means for determining that extracted words occurring within the sample of documents more than a predetermined threshold are stopwords; and means for determining whether the extracted words are contained within the data structure.

34 The apparatus according to claim 33, wherein the means for evaluating the extracted words includes: means for obtaining words from the data structure related to the extracted words contained in the data structure; means for adding the extracted words contained in the data structure to a primary query term list; means for adding the determined words to an auxiliary query term list; and means for adding the extracted words that are not contained in the data structure and that are not stopwords to a secondary query term list.

35. The apparatus according to claim 34, wherein the means for forming a search query includes means for forming a search query based upon words contained in the primary query term list, secondary query term list, and auxiliary query term list satisfying a predetermined selective criteria.

36. The apparatus according to claim 35, further comprising means for providing the generated search query to a search engine.

37 The apparatus according to claim 36, further comprising means for searching for documents matching the search query using the search engine.

38. The apparatus according to claim 25, wherein documents comprise at least one of text files and Web pages.

39. An apparatus for creating a data structure comprising: means for selecting, in accordance with predetermined selective criteria, a set D of descriptive words that occur within a sample of documents; means for determining, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective criteria; and means for constructing as the data structure a similarity matrix containing similarity values for pairs of descriptive words for each word d in set D and corresponding set Wd.

40. The apparatus according to claim 39, wherein the means for selecting a

set D comprises: means for selecting words occurring within the sample of documents more than a minimum threshold and less than a maximum threshold.

41 The apparatus according to claim 39, wherein the means for determining the subset Wd, for each word d of set D, comprises: means for composing a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d, means for forming a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 , means for composing the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector, means for forming set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2

42 The apparatus according to claim 41 , wherein the means for constructing a similarity matrix comprises means for determining a similarity measure for each of pair of words from set D and set Wd; and means for adding to the similarity matrix those pairs of words having similarity measures exceeding a threshold

43 The apparatus according to claim 42, wherein the means for determining a similarity measure comprises means for performing the following equation cos (d1 , d2) = sqr (sum (d1 ^* d2» sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) )

with respect to the first map, where d1 is a word from set D, d2 is a word from set Wd, sqr is the square root function, (d1 ^* d2) represents products of corresponding values in the first map for d1 and d2, and sum is the summation of the products.

44. The apparatus according to claim 39, further comprising: means for compressing the similarity matrix; and means for generating an index containing words and their addresses in the similarity matrix.

45. An apparatus for generating a search query corresponding to a search document based upon a data structure, comprising: means for extracting words from the search document; means for evaluating the extracted words based upon the data structure; and means for forming a search query based upon the evaluation.

46. The apparatus according to claim 45, wherein the means for evaluating the extracted words includes. means for determining that extracted words occurring within the sample of documents more than a predetermined threshold are stopwords: and means for determining whether the extracted words are contained within the data structure.

47 The apparatus according to claim 46. wherein means for evaluating the extracted words further includes: means for obtaining words from the data structure related to the extracted words contained in the data structure; means for adding the extracted words contained in the data structure to a primary query term list; means for adding the determined words to an auxiliary query term list; and means for adding the extracted words that are not contained in the data structure and that are not stopwords to a secondary query term list

48 The apparatus according to claim 47, wherein the means for forming a search query includes means for forming a search query based upon words contained in the primary query term list, secondary query term list, and auxiliary query term list satisfying a predetermined selective criteria.

49 An article of manufacture for causing a computer to generate a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest, comprising means for causing a computer to select a sample of documents to represent the set of documents, means for causing a computer to generate a data structure revealing relationships among words within the sample of documents, and means for causing a computer to generate a search query for searching the set of documents based upon the relationship of the search document to the data structure

50 The article of manufacture according to claim 49, wherein the means for causing a computer to create a data structure comprises means for causing a computer to select, in accordance with predetermined selective criteria a set D of descriptive words that occur within the sample of documents, means for causing a computer to determine for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective criteria, and means for causing a computer to construct as the data structure a similarity matrix containing similarity values for pairs of descriptive words for each word d in set D and corresponding set Wd

51 The article of manufacture according to claim 50, wherein the means for causing a computer to determine the subset Wd, for each word d of set D, comprises means for causing a computer to compose a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d, means for causing a computer to form a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 , means for causing a computer to compose the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector, means for causing a computer to form a set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2

52 The article of manufacture according to claim 51 , wherein the means for causing a computer to construct a similarity matrix comprises means for causing a computer to determine a similarity measure for each of pair of words from set D and set Wd, and means for causing a computer to add to the similarity matrix those pairs of words having similarity measures exceeding a threshold

53 The article of manufacture according to claim 52, wherein the means for causing a computer to determine a similarity measure comprises means for causing a computer to perform the following equation cos (d1 , d2) = sαr ⁽sum ⁽d1 ^* d2^ sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set Wd, sqr is the square root function, (d1 ^* d2) represents products of corresponding values in the first map for d1 and d2, and sum is the summation of the products.

54 The article of manufacture according to claim 49, wherein the means for causing a computer to generate comprises: means for causing a computer to extract words from the search document; means for causing a computer to evaluate extracted words based upon the data structure; and means for causing a computer to form a search query based upon the

evaluation.

55. An article of manufacture for causing a computer to create a data

structure comprising: means for causing a computer to select, in accordance with predetermined selective criteria, a set D of descriptive words that occur within a sample of documents: means for causing a computer to determine, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective criteria; and means for causing a computer to construct as the data structure a similarity matrix containing similarity values for pairs of descπptive words for each word d in set

D and corresponding set Wd.

56. The article of manufacture according to claim 55, wherein the means for causing a computer to select a set D comprises- means for causing a computer to select words occurring within the sample of documents more than a minimum threshold and less than a maximum threshold.

57. The article of manufacture according to claim 55, wherein the means for causing a computer to determine the subset Wd, for each word d of set D, comprises: means for causing a computer to compose a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d; means for causing a computer to form a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 , means for causing a computer to compose the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector; means for causing a computer to form set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2.

58 The article of manufacture according to claim 57, wherein the means for causing a computer to construct a similarity matrix comprises means for causing a computer to determine a similarity measure for each of pair of words from set D and set Wd; and means for causing a computer to add to the similarity matrix those pairs of words having similarity measures exceeding a threshold

59. The article of manufacture according to claim 58, wherein the means for causing a computer to determine a similarity measure comprises means for causing a computer to perform the following equation: cos(d1,d2)= sqr(sum(d1 ^*d2Ω sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 * d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set

60 An article of manufacture for causing a computer to generate a search query corresponding to a search document based upon a data structure, comprising means for causing a computer to extract words from the search document, means for causing a computer to evaluate the extracted words based upon the data structure; and means for causing a computer to form a search query based upon the evaluation

61 The article of manufacture according to claim 60, wherein the means for causing a computer to evaluate the extracted words includes means for causing a computer to determine that extracted words occurring within the sample of documents more than a predetermined threshold are stopwords, and means for causing a computer to determine whether the extracted words are contained within the data structure

62 The article of manufacture according to claim 61 , wherein the means for causing a computer to evaluate the extracted words further includes means for causing a computer to obtain words from the data structure related to the extracted words contained in the data structure, means for causing a computer to add the extracted words contained in the data structure to a primary query term list, means for causing a computer to add the determined words to an auxiliary query term list; and means for causing a computer to add the extracted words that are not contained in the data structure and that are not stopwords to a secondary query term list.

63. The article of manufacture according to claim 62, wherein the means for causing a computer to form a search query includes means for causing a computer to form a search query based upon words contained in the primary query term list, secondary query term list, and auxiliary query term list satisfying a predetermined selective criteria

64. An article of manufacture storing a data structure formed by the computer-executable method comprising the steps of: selecting, in accordance with predetermined selective criteria, a set D of descriptive words that occur within a sample of documents; determining, for each descriptive word d in set D, a subset Wd of set D representing the words related to the descriptive word d in accordance with predetermined selective criteria; and constructing as the data structure a similarity matrix containing similarity values for pairs of descπptive words for each word d in set D and corresponding set Wd

65. The article of manufacture according to claim 64, wherein the step of determining the subset Wd, for each word d of set D, comprises the substeps of. composing a first query vector representing word d with a first map to form a first result vector representing the frequencies of words occurring in documents that contain d; forming a second query vector containing words from the result vector having frequencies exceeding a threshold value k1 , composing the second query vector with a second map to form a second result vector containing values representing the degree that words in set D relate to words of the second query vector; forming set Wd from the words contained in the second result vector corresponding to values exceeding a threshold value k2.

66. The article of manufacture according to claim 65, wherein the step of constructing a similarity matrix comprises the substeps of: determinmg a similarity measure for each of pair of words from set D and set Wd, and adding to the similarity matrix those pairs of words having similarity measures exceeding a threshold

67 The article of manufacture according to claim 66, wherein the substep of determining a similarity measure comprises the substep of performing the following equation cos (d1 d2) = sqr (sum (d1 ^* d2\) sqr (sqr (sum (d1 ^* d2) ) ^* (sqr (sum (d1 ^* d2) ) ) with respect to the first map, where d1 is a word from set D, d2 is a word from set

68 The article of manufacture according to claim 64, wherein the method further comprises the steps of compressing the similarity matrix, and generating an index containing words and their addresses in the similarity matrix