CA2423476A1

CA2423476A1 - Extended functionality for an inverse inference engine based web search

Info

Publication number: CA2423476A1
Application number: CA002423476A
Authority: CA
Inventors: Giovanni B. Marchisio
Original assignee: Individual
Current assignee: VCVC III LLC
Priority date: 2000-09-25
Filing date: 2001-09-25
Publication date: 2002-04-04
Anticipated expiration: 2021-09-25
Also published as: EP1323067A1; EP1323067A4; CA2423476C; WO2002027536A1; AU2001296304A1

Abstract

An extension of an inverse inference search engine (Fig. 1) provides cross language document retrieval, in which the information matrix (52) used as input to the inverse inference engine is organized into rows of blocks (58) corresponding to languages within a predetermined set of natural languages. The information matrix (52) is organized into two column-wise partitions (60 ). The first partition consists of blocks of entries representing fully translated documents, while the second partition is a matrix of blocks of entries representing documents for which translations are not available in a ll of the predetermined languages.

Claims

1. An information retrieval method comprising the steps of:
generating a term-document matrix to represent electronic information files stored in a computer system, each element in said term-document matrix indicating a number of occurrences of a term within a respective one of said electronic information files, wherein said term-document matrix includes a first partition, said first partition including entries representing at least a first version and a second version of at least one reference document within said electronic information files, wherein said first version of said reference document is in a first natural language and said second version of said reference document is a translation of said first version of said reference document into a second natural language, and wherein said term-document matrix further includes a second partition, elements in said second partition representing at least one target document within said electronic information files, wherein said target document is in one of the set of natural languages consisting of said first natural language and said second natural language;
generating, responsive to said term-document matrix, a term-spread matrix, wherein said term spread matrix is a weighted autocorrelation of said term-document matrix, said term-spread matrix indicating an amount of variation in term usage in the information files and, also, the extent to which terms are correlated;
receiving a user query from a user, said user query consisting of at least one term;

in response to said user query, generating a user query vector, wherein said user query vector has as many elements as the rows of the term-spread matrix;
generating, responsive to said user query vector, an error-covariance matrix, wherein said error-covariance matrix reflects an expected degree of uncertainty in the initial choice of keywords of said user;
formulating, responsive to said term-spread matrix, error-covariance matrix, and user query vector, a constrained optimization problem, wherein the choice of a lambda value equal to a LaGrange multiplier value in said constrained optimization problem determines the extent of a trade-off between a degree of fit and the stability of all solutions to said constrained optimization problem;
generating, responsive to said constrained optimization problem, a solution vector including a plurality of document weights, each one of said plurality of document weights corresponding to one of each said target documents, wherein each of said document weights reflects a degree of correlation between said user query and the corresponding one of said target documents; and providing an information response to said user reflecting said document weights, wherein at least one of said document weights is positive and at least one of said document weights is negative, wherein said positive document weights represent the relevance of selected ones of said target documents in said first natural language to said user query, and wherein absolute values of said negative document weights represent the relevance of selected ones of said target documents in said second natural language to said user query.

2. The method of claim 1, wherein said providing said information response further comprises organizing display objects representing said target documents associated with said document weights according to the sign of each said of said document weights, whereby said documents in said first natural language are displayed in proximity to each other and documents in said second natural language are displayed in proximity to each other.

3. The method of claim 2, wherein said providing said information response further comprises organizing said display objects representing documents associated with said document weights according to the absolute value of each said of said document weights, such that said display object are displayed in decreasing absolute value of associated document weight.

4. The method of claim 1, wherein said step of generating said term-document matrix includes generating elements in said matrix reflecting the number of occurrences of each one of said terms in each one of said information files.

5. The method of claim 1, wherein rows of said term-document matrix are each associated with a respective term, and wherein a first set of said rows are associated with terms in said first natural language, and a second set of said rows are associated with terms in said second natural language.

6. The method of claim 5, wherein said first partition including entries representing at least a first version, and a second version of said at least one reference document, wherein said first version of said reference document is in said first natural language, and said second version of said reference document is a translation of said first version of said reference document into said second natural language.

7. The method of claim 1, wherein said second version of said reference document is another document that is topically related to said first version of said reference document.

8. The method of claim 1, wherein said term-document matrix is one of a plurality of term document matrices, each of said plurality of term document matrices associated with a translation from a source language to a target foreign language, and wherein said first natural language comprises said source language and said second natural language comprises said target natural language.

9. An information retrieval method comprising the steps of:
generating a term-document matrix to represent electronic information files stored in a computer system, each element in said term-document matrix indicating a number of occurrences of a term within a respective one of said electronic information files, wherein said term-document matrix includes a first partition, said first partition including entries representing at least one reference document within said electronic information files, wherein said reference document is predetermined to contain reliable information, and wherein said term-document matrix further includes a second partition, elements in said second partition representing a plurality of search documents within said electronic information files, wherein said search documents are predetermined to contain insufficient information for establishing semantic links;
generating, responsive to said term-document matrix, a term-spread matrix, wherein said term spread matrix is a weighted autocorrelation of said term-document matrix, said term-spread matrix indicating an amount of variation in term usage in the information files and, also, the extent to which terms are correlated;
receiving a user query from a user, said user, query consisting of at least one term;
in response to said user query, generating a user query vector, wherein said user query vector has as many elements as the rows of the term-spread matrix;
generating, responsive to said user query vector, an error-covariance matrix, wherein said error-covariance matrix reflects an expected degree of uncertainty in the initial choice of keywords of said user;
formulating, responsive to said term-spread matrix, error-covariance matrix, and user query vector, a constrained optimization problem, wherein the choice of a lambda value equal to a LaGrange multiplier value in said constrained optimization problem determines the extent of a trade-off between a degree of fit and the stability of all solutions to said constrained optimization problem;
generating, responsive to said constrained optimization problem, a solution vector including a plurality of document weights, each one of said plurality of document weights corresponding to one of said plurality of search documents, wherein each of said document weights reflects a degree of correlation between said user query and the corresponding one of said plurality of search documents; and providing an information, response to said user reflecting said document weights.

10. The method of claim 9, further comprising periodically accumulating information from multiple sources, and adding said information to said search documents.

11. The method of claim 8, wherein said reference document comprises an encyclopedia.

12. The method of claim 8, wherein said reference document comprises a collection of news reports.