WO1997049045A1 - Apparatus and method for generating optimal search queries - Google Patents
Apparatus and method for generating optimal search queries Download PDFInfo
- Publication number
- WO1997049045A1 WO1997049045A1 PCT/IB1997/000743 IB9700743W WO9749045A1 WO 1997049045 A1 WO1997049045 A1 WO 1997049045A1 IB 9700743 W IB9700743 W IB 9700743W WO 9749045 A1 WO9749045 A1 WO 9749045A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- words
- computer
- data structure
- word
- causing
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
Definitions
- the present invention relates to text searching of large databases. More particularly, the present invention relates to apparatus and methods for generating optimal queries for such searching
- a public network such as the Internet, links together not only individual computer systems, but also local computer networks, thereby creating a larger network that provides access to a massive amount of information
- the volume of information that a user can access makes managing and identifying specific
- search queries contains key terms reflecting specific words or the types of information the user is interested in accessing.
- a search engine identifies and retrieves
- CONFIRMATION COPv information On the other hand, if a user omits words having the same meaning as a term in the search query, the user may never find information of interest.
- the present invention is directed to an apparatus and method that improves the likelihood of finding all desired information and decreases the likelihood of finding irrelevant information.
- the invention includes a method of generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest
- the method comprises the computer-executable steps of selecting a sample of documents to represent the set of documents and creating a data structure revealing relationships among words within the sample of documents
- the method aiso includes the step of generating a search query for searching the set of documents based upon the relationship of the search document to the data structure.
- the invention includes an apparatus for generating a search query to identify one of a set of documents whose subject matter relates to that of a search document containing information of interest.
- the apparatus comprises means for selecting a sample of documents to represent the set of documents; means for creating a data structure revealing relationships among words within the sample of documents; and means for generating a search query for searching the set of documents based upon the relationship of the search document to the data structure
- Fig. 1 is a flow chart illustrating an overview of the preferred implementation for generating optimal search queries
- Fig. 2 is a block diagram of an exemplary computer for carrying out the methods of the invention
- Fig. 3 is a flow chart illustrating the preferred implementation for creating the term-pair data structure according to step 100;
- Figs. 4A-4B show similarity matrix M and index, respectively, formed by the method represented by the fiow chart shown in Fig. 3,
- Fig. 5 is a representation of the preferred implementation for determining a set of descriptive words Wd associated with each desc ⁇ ptive word, according to step 310;
- Fig. 6 is a representation of the preferred implementation for creating matrix 510 shown in Fig. 5;
- Fig. 7 is a flow chart illustrating the preferred implementation for generating a search query using the data structure according to step 110. Best Mode for Carrying Out the Invention Overview
- the invention serves to identify information whose subject matter is the same or similar to that of information of interest.
- the information may be contained in discrete elements, or "documents.”
- the term "document” refers to any files or portions of such files containing electronically-stored data, including text files and Web pages.
- the invention takes advantage of knowledge discovery techniques described in U.S. patent application Serial No. entitled "Method and System for Revealing
- the invention facilitates the retrieval of documents containing text whose content relates to text contained in an accessed document, or "search document.”
- the invention extracts information from the search document and applies the extracted information to a data structure to generate a search query containing terms.
- the retrieved documents are preferably accessible by a network, such as the Internet, and may be stored in databases.
- a legal database accessible on the Internet could store documents containing data that represents court opinions.
- the invention could create a search query for identifying documents within that legal database related to a court opinion of interest.
- a term-pair data structure is created (step 100).
- the data structure reveals the likelihood that, given a sample of documents, different pairs of words occur in any single document.
- Fig. 2 shows computer 200 for carrying out the methods of the invention.
- computer 200 preferably creates the term-pair data structure (step 100 of Fig. 1 ), as described in greater detail in connection with Figs. 3-6, and generates a search query (step 110 of Fig.1), as described in greater detail in connection with Fig. 7.
- a search query step 110 of Fig.1
- one computer could be used to create the data structure and another to generate search queries.
- Computer 200 is preferably a general-purpose personal computer programmed to perform the functions described herein and, in particular, those described in connection with Figs. 1 and 3-7. As shown in Fig. 2, computer 200 comprises processor 210, input device 220, random access memory (RAM) 230, output device 240, storage device 250, and a network link 260. Processor 210 is connected to each of the other elements of computer 200.
- RAM random access memory
- Processor 210 may be any commercially-available processor capable of running software for performing the functions described above.
- processor 210 is a high-speed processor, such as an Intel PentiumTM processor.
- Input device 220 may be a keyboard, mouse, or any other device for inputting data from the user to computer 200.
- RAM 230 stores program instructions to be executed by processor 210.
- Output device 240 which may be a display or printing device, outputs data from computer 200 to the user.
- Storage device 250 stores program code representing the functions described in connection with Figs. 1 and 3-7, which is loaded into RAM 230 for execution by processor 210.
- Storage device 250 also stores the term-pair data structure used to generate the optimal search queries, namely a similarity matrix and an index, as described below.
- Storage device 250 includes floppy disks, hard disks, optical disks, CD ROMs, tapes, other magnetic media, programmable semiconductor devices, hard-wired circuitry, and any other medium for storing data
- storage device 250 may also store program code representing a search engine receiving the optimal search query as input
- the search engine may be any conventional program capable of searching for documents on a network based upon a search query
- Network link 260 connects computer 200 to a network, such as the Internet Through link 260, computer 200 can access documents on the network
- Link 260 includes optical fiber, coaxial cable, telephone wire, wireless transmission, and any other connection for linking a computer to a network
- computer 200 is described above as a programmed general-purpose computer, in alternative implementations, any machine capable of performing the functions of the methods can be used to carry out the methods of the invention
- the data structure created in step 100 and used in step 110 includes a similanty matrix M and an index
- the simila ⁇ ty matrix M contains data revealing the likelihood that different pairs of words occur within a single document belonging to a sample of documents
- the index stores the addresses of the word pairs stored in the matrix.
- the index can be used to retrieve data from the similarity matrix more efficiently.
- the similarity matrix and the index will be described in greater detail below in connection with Figs. 4A and 4B.
- Fig. 3 is a flow chart illustrating a series of steps, in accordance with a preferred implementation, for creating the data structure used to generate search queries.
- the steps shown in Fig. 3 can be carried out by a programmed general-purpose personal computer, such as computer 200 shown in Fig. 2, or any other suitable machine capable of performing the functions described herein.
- Computer 200 accesses a sample of documents containing a representative number of words (step 300). For example, computer 200 can access a set of documents available on the Internet relating to a variety of different subject matters These documents should contain a sufficient number of different words to allow a robust enough similarity matrix to meet the requirements in searching.
- Computer 200 identifies from the document sample a set D of "descriptive" words d to serve as a basis for distinguishing the documents (step 300).
- the test for "descriptiveness" is frequency Therefore, set D preferably includes those words occurring in the documents more than a minimum number of times, Tmin, and less than a maximum number, Tmax, since words that appear too few or too many times do not distinguish documents sufficiently
- computer 200 determines a subset Wd of set D that is closely associated with each "descriptive" word d contained within set D (step 310).
- "closely associated” means, for each word d in set D, those descriptive words in subset Wd having a discrimination value exceeding a certain threshold in the set of documents that contain word d, the threshold being determined empirically according to such factors as the speed of computer 200 and the size of the database being searched. Identifying a set Wd for each "descriptive" word d reveals inherent relationships that exist among the desc ⁇ ptive words and the documents within the sample.
- Fig. 5 illustrates a representation of this analysis, which is performed for each "discrimination" word d.
- this analysis includes vectors 500, 520, 530, 550, and 560 and maps 510 and 540.
- Query vector 500 represents a "discrimination" word d. In other words, the only non-zero value corresponds to the word d.
- Map 510 represents a "descriptive words"-to-"words" map containing values representing the degree to which descriptive words are related to words, whether desc ⁇ ptive or nondesc ⁇ ptive, occurring in the sample of documents, based on the occurrence of the words in the documents containing the desc ⁇ ptive words. For example, pairs of words occurring together frequently in documents will correspond to higher values, while pairs of words occurring together infrequently lower values.
- Maps 510 and 540 are formed using discrimination analysis, a representation of which is shown in Fig. 6. As seen in Fig. 6, this analysis includes vectors 600, 620, 640, 650, and 660 and maps 610 and 630.
- Query vector 600 represents a single word occurring in the documents of the sample.
- Map 610 is a word-to-document map of entries representing the number of each word appearing in the corresponding documents of the sample and can be formed by counting the occurrences of words in the documents of the sample.
- Computer 200 "composes" vector 600 with map 610 to form result vector 620 indicating the documents containing the word from vector 600
- Computer 200 then composes vector 620 with document-to-word map 630, the inverse of map 610, to form result vector 640 representing the frequency that words occur in documents that contain the word of vector 600.
- Computer 200 normalizes vector 640 using profile vector 650 to form discrimination vector 660
- Discrimination vector 660 contains discrimination values for respective words as they appear in the documents of vector 620. Performing discrimination analysis for each word yields a set of discrimination vectors forming the respective rows of map 510
- computer 200 composes vector 500 with map 510 to produce result vector 520, containing values representing the relative frequency to which words occur in documents that contain the desc ⁇ ptive word of vector 500.
- Computer 200 retains the k1 number of words having the highest relative frequency to form query vector 530
- Computer 200 composes vector 530 with map 540, which is the inverse of map 510, to form result vector 550
- Vector 550 contains values representing the degree to which descriptive words are associated to the k1 number of words associated with the desc ⁇ ptive word of vector 500
- Computer 200 selects the k2 number of descriptive words in vector 550 having the highest discrimination values to form vector 560 and set Wd. Both k1 and k2 can be chosen empirically to provide a sufficient degree of discrimination.
- Matrix M contains data about pairs of words, each pair including words d1 from set D and words d2 from corresponding set Wd whose similarity measures exceed a predetermined threshold.
- a similarity measure indicates the degree to which d1 and d2 are similar, based on the common occurrence of the words that best discriminate documents that contain words d1 and d2, respectively
- cos (d1. d2) sqr (sum (d1 * d2 ⁇ sqr (sqr (sum (d1 * d2) ) * (sqr (sum (d1 * d2) ) )
- Cos (d1 , d2) 1 when words d1 and d2 have the highest degree of similarity because words that distinguish documents in which each appear are the same.
- Cos (d1 , d2) equals 0 when words d1 and d2 have no similarity because the words that distinguish documents in which each appear are completely different.
- the similarity matrix only includes pairs with similarity measures exceeding a threshold, such as 0.25.
- computer 200 To access data efficiently from the similarity matrix, computer 200 preferably sorts the entries (e.g., pairs of d1 and d2 having similarity measures exceeding a threshold) according to the frequency of word d2.
- Computer 200 preferably records the frequency of each word d2 added to the similarity matrix and sorts the entries in the matrix by frequency of word d2 (i.e., from word d2 occurring the most frequently to the least frequently)
- Computer 200 then assigns to each word g2 an integer identifier that increases in value as the frequency of word d2 decreases.
- the similarity matrix can be compressed using any conventional data compression technique for preserving memory in which the similarity matrix is stored
- the similarity matrix is compressed using blockwise prefix omitted encoding, as described in U.S. patent application Serial No , previously incorporated by reference
- the index shown in Fig 4B, preferably comprises a list of entries, each including a word g1 , an identifier representing a word d2, and the corresponding address in the similarity matrix This index permits efficient access of data from the similarity matrix
- the sample of documents serving as a basis for matrix M and the index relate to a variety of different topics and subject matters
- matrix M and the index can be used, as described below, to generate search queries for documents of that variety of topics and subject matters
- the sample of documents serving as the basis for matrix M and the index relate to a specific topic (e.g , medical records), in which case the search query generated from such a matrix M would be useful in obtaining documents of that topic (e.g , medical records)
- matrix M and the index once generated, can be used multiple times to generate search queries, as discussed below.
- matrix M and the index are regenerated periodically from an updated sample of documents representative of a current state of the documents to be searched.
- the generated search query preferably includes one or more search terms that serve as the input to a search engine that searches databases or storage devices, such as ones accessible on the Internet, for documents containing those terms.
- the search can employ any commercially-available search engine suitable for performing those functions.
- the search engine performs relevance ranking, so as to provide those documents matching the search query in the order of documents containing the greatest number of terms in the search query to the least.
- Fig. 7 is a flow chart illustrating the preferred implementation for generating search queries using the data structure. Like the steps shown in Fig. 3, the steps shown in Fig. 7 can also be carried out by a programmed general-purpose computer similar to computer 200 shown in Fig. 2, or any other machine capable of performing the functions described herein.
- computer 200 accesses a "search" document containing information of interest available on, for example, the Internet (step 210).
- a user selects a document or a portion of a document through a user interface provided by computer 200. The user can request additional documents similar in content to the one he is accessing, which becomes the "search" document.
- Computer 200 extracts words from the search document (step 700) and evaluates each extracted word (steps 705, 735, and 745) and generates the search query based upon this evaluation. Specifically, computer 200 determines whether each extracted word is a "stopword” (step 710) and whether it is in the index (step 715)
- a "stopword” refers to words that are particularly poor at distinguishing documents These necessarily include words such as "the” and “an”, which occur in virtually every document, as well as words that serve as function words in a particular document collection, such as the words "information” and "address” in Web documents
- computer 200 adds the extracted word to a secondary query term list (step 720) Words that are in the index are added to a primary query term list (step 725) As described below, the words in the primary and secondary query term lists will be added to other words from an auxiliary query term list to form the search query
- computer 200 retrieves from the similarity matrix the words associated with the extracted word and their respective scores (step 730), calculated by multiplying the similarity measure retrieved from the data structure with a normalized frequency, which is computed according to the formula (1 + log2 (freq)), where freq is the number of times that the word appears in the document being processed.
- Computer 200 places the extracted words into an auxiliary query list (step 730). For extracted words that are associated with more than one primary query term, the final score is computed by adding all of the individual scores associated with the individual primary query terms.
- computer 200 constructs a query that has a number of primary, secondary and auxiliary query terms, where each of these numbers may be set by the user (step 740).
- the search query includes words that are highly associated with words in the document of interest as well as words that are not in the similarity matrix.
- computer 200 can pass the search query to a search engine, preferably one capable of relevance ranking, to identify documents, for example, accessible on the Internet matching the query.
- the search query comprises a list of words selected using the above-described techniques.
- Boolean connects may be included to define relationships among those words Conclusion
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU30443/97A AU3044397A (en) | 1996-06-21 | 1997-06-19 | Apparatus and method for generating optimal search queries |
EP97925219A EP0978058A1 (en) | 1996-06-21 | 1997-06-19 | Apparatus and method for generating optimal search queries |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US66754496A | 1996-06-21 | 1996-06-21 | |
US08/667,544 | 1996-06-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1997049045A1 true WO1997049045A1 (en) | 1997-12-24 |
Family
ID=24678643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB1997/000743 WO1997049045A1 (en) | 1996-06-21 | 1997-06-19 | Apparatus and method for generating optimal search queries |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP0978058A1 (en) |
AU (1) | AU3044397A (en) |
WO (1) | WO1997049045A1 (en) |
ZA (1) | ZA975476B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797216A (en) * | 2020-06-28 | 2020-10-20 | 北京百度网讯科技有限公司 | Retrieval item rewriting method, device, equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0687987A1 (en) * | 1994-06-16 | 1995-12-20 | Xerox Corporation | A method and apparatus for retrieving relevant documents from a corpus of documents |
-
1997
- 1997-06-19 WO PCT/IB1997/000743 patent/WO1997049045A1/en not_active Application Discontinuation
- 1997-06-19 AU AU30443/97A patent/AU3044397A/en not_active Abandoned
- 1997-06-19 EP EP97925219A patent/EP0978058A1/en not_active Withdrawn
- 1997-06-20 ZA ZA9705476A patent/ZA975476B/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0687987A1 (en) * | 1994-06-16 | 1995-12-20 | Xerox Corporation | A method and apparatus for retrieving relevant documents from a corpus of documents |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111797216A (en) * | 2020-06-28 | 2020-10-20 | 北京百度网讯科技有限公司 | Retrieval item rewriting method, device, equipment and storage medium |
CN111797216B (en) * | 2020-06-28 | 2024-04-05 | 北京百度网讯科技有限公司 | Search term rewriting method, apparatus, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
ZA975476B (en) | 1998-02-20 |
AU3044397A (en) | 1998-01-07 |
EP0978058A1 (en) | 2000-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4908214B2 (en) | Systems and methods for providing search query refinement. | |
US7243092B2 (en) | Taxonomy generation for electronic documents | |
Gravano et al. | Gloss: text-source discovery over the internet | |
US7233943B2 (en) | Clustering hypertext with applications to WEB searching | |
Agichtein et al. | Learning search engine specific query transformations for question answering | |
US5819258A (en) | Method and apparatus for automatically generating hierarchical categories from large document collections | |
US6654742B1 (en) | Method and system for document collection final search result by arithmetical operations between search results sorted by multiple ranking metrics | |
AU2004276906B2 (en) | Computer aided document retrieval | |
US5987457A (en) | Query refinement method for searching documents | |
US6321220B1 (en) | Method and apparatus for preventing topic drift in queries in hyperlinked environments | |
US6286018B1 (en) | Method and apparatus for finding a set of documents relevant to a focus set using citation analysis and spreading activation techniques | |
US6182091B1 (en) | Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis | |
US7603370B2 (en) | Method for duplicate detection and suppression | |
US8332439B2 (en) | Automatically generating a hierarchy of terms | |
CN101446940B (en) | Method and device of automatically generating a summary for document set | |
US20060200461A1 (en) | Process for identifying weighted contextural relationships between unrelated documents | |
US20080086453A1 (en) | Method and apparatus for correlating the results of a computer network text search with relevant multimedia files | |
US20020174095A1 (en) | Very-large-scale automatic categorizer for web content | |
CN102880623B (en) | Personage's searching method of the same name and system | |
US7340460B1 (en) | Vector analysis of histograms for units of a concept network in search query processing | |
JP2001519952A (en) | Data summarization device | |
Croft et al. | Implementing ranking strategies using text signatures | |
Pôssas et al. | Set-based vector model: An efficient approach for correlation-based ranking | |
Koll | WEIRD: An approach to concept-based information retrieval | |
James et al. | A survey on information retrieval models, techniques and applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW AM AZ BY KG KZ MD RU TJ TM |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1997925219 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP Ref document number: 98502603 Format of ref document f/p: F |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase |
Ref country code: CA |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1997925219 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1997925219 Country of ref document: EP |