JP2001511565A

JP2001511565A - Text input processing system using natural language processing techniques

Info

Publication number: JP2001511565A
Application number: JP2000504528A
Authority: JP
Inventors: コーストン，サイモン・エイチ; ドーラン，ウィリアム・ビー; ヴァンダーウェンデ，ルーシー・エイチ; ブラデン−ハーダー，リサ
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 1997-07-22
Filing date: 1998-07-17
Publication date: 2001-08-14
Anticipated expiration: 2018-07-17
Also published as: EP0996899B1; CN1165858C; EP0996899A1; JP4892130B2; EP0996899B8; JP4738523B2; JP2001511564A; US5933822A; CN1302412A; US6901399B1; JP2010079915A; WO1999005618A1

Abstract

(57)【要約】システム（１４８０）は、クエリに応答して文書記憶装置から検索した文書集合内にある文書を濾過する。システム（１４８０）は、クエリおよび文書集合内の文書の内選択した１つに基づいて、第１論理形態集合を得る（１８３０）。システム（１４８０）は、クエリおよび文書集合内の文書の別の１つに基づいて、第２論理形態集合を得る。次に、システム（１４８０）は、自然言語処理技法を用いて、第１論理形態を変更し（１８３２，１８３４）、変更論理形態集合を得る。システム（１４８０）は、変更論理形態集合と第２論理形態集合との間の所定の関係に基づいて、文書集合内の文書を濾過する（１８３６）。 (57) Summary The system (1480) filters documents in a set of documents retrieved from a document store in response to a query. The system (1480) obtains (1830) a first set of logical forms based on the query and a selected one of the documents in the document set. The system (1480) obtains a second set of logical forms based on the query and another one of the documents in the document set. Next, the system (1480) uses natural language processing techniques to modify the first logical form (1832, 1834) to obtain a modified logical form set. The system (1480) filters (1836) the documents in the document set based on a predetermined relationship between the modified set of logical forms and the second set of logical forms.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】（発明の背景）本発明は、テキスト入力の処理を扱うものである。更に特定すれば、本発明は
、テキスト入力の類似性を判定するための自然言語処理技法の使用に関する。本
発明は、情報検索、機械翻訳、自然言語理解、文書類似性／クラスタリング等の
ような、広範囲に及ぶ様々な用途において有用である。しかしながら、例示の目
的のみのために、主に、情報検索の関連において本発明の説明を行なう。BACKGROUND OF THE INVENTION The present invention deals with the processing of text input. More specifically, the present invention relates to the use of natural language processing techniques to determine the similarity of text inputs. The invention is useful in a wide variety of applications, such as information retrieval, machine translation, natural language understanding, document similarity / clustering, and the like. However, for purposes of illustration only, the invention will be described primarily in the context of information retrieval.

【０００２】一般に、情報検索とは、ユーザが大量の記憶情報からユーザに関連のある情報
を発見し検索するプロセスのことを言う。情報検索を実行する際、ユーザが必要
な情報全てを検索することが重要であり（即ち、完全であることが重要である）
、しかも同時に検索した無関係な情報をユーザに対して制限することも重要であ
る（即ち、選択的であることが重要である）。これらのディメンション（dimens
ion）は、回収率（完全性）および正確性（選択性）という用語で表わすことが多い。多くの情報検索システムでは、回収率および正確性双方のディメンション
にわたって優れた性能を発揮することが重要である。Generally, information search refers to a process in which a user finds and searches for information related to the user from a large amount of stored information. When performing an information search, it is important that the user search for all necessary information (ie, it is important that it be complete).
In addition, it is important to restrict the irrelevant information searched at the same time to the user (that is, it is important to be selective). These dimensions (dimens
ion) is often expressed in terms of recovery (completeness) and accuracy (selectivity). For many information retrieval systems, it is important to perform well across both recovery and accuracy dimensions.

【０００３】現在の検索システムの中には、膨大な情報量に対して問い合わせおよび探索す
ることができるものがある。例えば、インターネット、ディジタル・ビデオ・デ
ィスク、およびその他の一般的なコンピュータ・データ・ベース上で情報を探索
するように設定する情報検索システムもある。典型的に、情報検索システムは、
例えば、インターネット・サーチ・エンジン、およびライブラリ・カタログ・サ
ーチ・エンジンとして、具体化する。更に、従来のデスクトップ・コンピュータ
のオペレーティング・システム内部にもある種の情報検索機構を備えている。例
えば、オペレーティング・システムには、ユーザが入力したある用語に基づいて
、所与のデータベース上またはコンピュータ・システム上にある全てのファイル
を、ユーザが検索できるようにするツールを備えたものもある。Some current search systems can query and search for enormous amounts of information. For example, some information retrieval systems are configured to search for information on the Internet, digital video discs, and other common computer databases. Typically, information retrieval systems
For example, it is embodied as an Internet search engine and a library catalog search engine. In addition, some information retrieval mechanisms are provided within the operating system of conventional desktop computers. For example, some operating systems have tools that allow a user to search all files on a given database or computer system based on certain terms entered by the user.

【０００４】公知の情報検索技法は数多くある。このような技法におけるユーザ入力クエリ
は、典型的に、明示的なユーザ発生クエリ、または、既存の文書集合と類似の文
書をユーザが要求する場合のような、暗示的なクエリとして提示する。典型的な
情報検索システムは、単一の単語レベルまたは用語レベルのいずれかで、大容量
のデータ記憶装置において文書を検索する。文書の各々には関連性（類似性）ス
コアを割り当て、情報検索システムは、探索した文書の内ある部分集合をユーザ
に提示する。典型的に、この部分集合は、所与のスレシホルドを上回る関連性ス
コアを有するものである。There are many known information retrieval techniques. User-entered queries in such techniques are typically presented as explicit user-generated queries or implicit queries, such as when a user requests documents similar to an existing collection of documents. A typical information retrieval system retrieves documents in mass storage at either the single word or term level. Each document is assigned a relevance (similarity) score, and the information retrieval system presents a subset of the searched documents to the user. Typically, this subset has a relevance score above a given threshold.

【０００５】従来の統計的サーチ・エンジンがどちらかと言うと正確性に乏しいのは、単語
を独立した変数である、即ち、あらゆるテキストの一節にある単語は互いに独立
して発生すると仮定していることに起因する。この文脈における独立とは、ある
文書内に現れるあらゆる単語の条件的確率（conditional probability）を意味し、その中の他の単語の存在を常にゼロとする。即ち、文書は、単に、単語の非
構造化集合を含むに過ぎず、あるいは単に「単語の袋」を備えるに過ぎない。容
易に認められようが、この仮定は、あらゆる言語に関して、大きな誤りである。
英語は、他の言語と同様、豊富で複雑な文法および語彙−意味構造を有し、単語
の意味は、それらが用いられてる特定の言語学的な文脈に基づいて、多くの場合
広範囲に変化し、文脈は、あらゆる場合においても、単語の所与の意味、および
どのような単語（複数の単語）が続いて現れるのかを決定する。したがって、テ
キストの一節に現れる単語は、単純に互いに独立しているのではなく、むしろこ
れらの相互依存性は非常に高い。キーワードに基づくサーチ・エンジンは、この
微粒子的な言語構造（fine-grained linguistic structure）を完全に無視している。例えば、”How many hearts does an octopus have?”（蛸には心臓が幾つあるか）という自然言語で表現した例示のクエリについて検討する。含有単語
（content word）「心臓」および「蛸」、またはその形態学的語根について処理
する統計的サーチ・エンジンの場合、記憶してある文書で、”artichoke hearts
,squid,onions and octopus”（アーティチョーク・ハート、いか、たまねぎ、
および蛸）という単語、したがってその含有単語をその材料に有する調理法を含
むものをユーザに返す即ち差し向ける（direct）可能性が高い。このエンジンは
、２つの含有単語「蛸」および「心臓」の一致が得られたので、例えば、近接性
（proxiimity）および論理演算子を含む統計的尺度に基づいて、この文書は一致
度が高いと判定するであろう。実際には、この文書は実際にはクエリには全く無
関係である。[0005] What makes conventional statistical search engines less accurate is that words are independent variables, that is, words in every passage of text occur independently of each other. Due to that. Independence in this context means the conditional probability of every word appearing in a document, with the presence of other words in it always being zero. That is, the document simply contains an unstructured set of words, or simply comprises a "bag of words". As will be readily appreciated, this assumption is a major mistake for any language.
English, like other languages, has a rich and complex grammar and vocabulary-semantic structure, and the meaning of words often varies widely based on the particular linguistic context in which they are used. The context, however, determines in each case the given meaning of the word and what word (s) will follow. Thus, the words that appear in a passage of text are not simply independent of each other, but rather are highly interdependent. Keyword-based search engines completely ignore this fine-grained linguistic structure. For example, consider an example query expressed in natural language, "How many hearts does an octopus have?" In the case of a statistical search engine that processes for the content words "heart" and "octopus" or their morphological roots, the stored document contains "artichoke hearts"
, squid, onions and octopus ”(artichoke heart, squid, onion,
And octopus), and thus those containing recipes that have the contained word in the material, are likely to be returned or directed to the user. The engine has obtained a match between the two contained words "octopus" and "heart", so this document has a high degree of match based on statistical measures including, for example, proximity and logical operators. Will be determined. In fact, this document is actually completely irrelevant to the query.

【０００６】この技術分野には、構文上の句の要素を、無分類関係（unlabelled relation ）の主要部−修飾語対として抽出するための様々な手法が教示されている。これ
らの要素は、次に従来の統計的ベクトル−空間モデルにおいて用語（通常、内部
構造を含まない）としてインデックス化する。[0006] Various techniques are taught in the art for extracting syntactic phrase elements as the head-modifier pairs of an unlabelled relation. These elements are then indexed as terms (usually without internal structure) in a conventional statistical vector-space model.

【０００７】このような手法の一例が、J.L.Fagan（Ｊ．Ｌ．フェイガン）の”Experiments
in Automatic Phrase Indexing for Document Retrieval:A Comparison of Syn
tactic and Non-Syntactic Methods”（文書検索のための自動的句インデックス
化における実験：構文的および非構文的方法の比較）（コーネル大学、博士論文
、１９８８年、ｉ−２６１ページ）に教示されている。具体的には、この手法は
、自然言語処理を用いて英語の文章を分析し、構文的句構成要素を抽出する。こ
こでは、これらの句構成要素を用語として扱い、統計的ベクトル−空間モデルを
用いてインデックスにおいてインデックス化する。検索の間、ユーザは自然言語
のクエリを入力する。この手法の下では、分析のために自然言語の自然言語処理
を行ない、インデックス内に格納してある要素と類似した構文的句構成要素を抽
出する。その後、クエリからの構文的句構成要素の、インデックスに格納してあ
る要素との照合を試みる。著者は、この純粋的に構文的な手法を、確率的方法を
用いて構文的句内の要素を識別する統計的手法と比較している。著者は、自然言
語処理は、確率的手法に対して著しい改善を得ることができず、自然言語処理が
時として生成するわずかな正確性の改善では、自然言語処理に伴う多額の処理コ
ストは正当化されないと結論付けている。An example of such a technique is described in JLFagan (JL Fagan), “Experiments
in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syn
Tactics and Non-Syntactic Methods "(Experiment in Automatic Phrase Indexing for Document Retrieval: Comparison of Syntactic and Non-Syntactic Methods) (Cornell University, PhD Thesis, 1988, i-261). Specifically, this method analyzes natural language processing of an English sentence and extracts syntactic phrase components.Here, these phrase components are treated as terms, and a statistical vector- Index in the index using a spatial model. During the search, the user enters a natural language query, under which natural language processing is performed for analysis and stored in the index. Extract syntactic phrase components that are similar to an element, then try to match the syntactic phrase components from the query with the elements stored in the index. , Comparing this purely syntactic approach to a statistical approach that uses stochastic methods to identify elements in syntactic phrases. It concludes that no improvement can be gained and that the slight accuracy improvements that natural language processing sometimes produces do not justify the large processing costs associated with natural language processing.

【０００８】このような構文に基づく別の手法が、T.Strzalkowski (T.ストルザルコウスキ
)の”Natural Language Information Retrieval: TIPSTER-2 Final Report”（自然言語情報検索：ＴＩＰＳＴＥＲ−２最終報告）（Proceedings of Advances in Text Processing: TipsterProgram Phase 2 ,DARPA,1996年5月6〜8日、Tysons
Corner,Virginia,143〜148ページ（以降「ＤＡＲＰＡ論文」と呼ぶ）、およびT.
Strzalkowski （Ｔ．ストルザルコウスキ）の”Natural Language Information
Retrieval”（自然言語情報検索）（Information Processing and Management,V
ol.31,No.3,1995年、397〜417ページ）に、探索クエリ内に含ませるために適切な
用語を選択する最に自然言語処理を用いるという文脈において記載されている。
この手法は理論的な見込み（promise）を提供するが、ＤＡＲＰＡ論文の１４７〜８ページにおいて、著者は、基盤となる自然言語技法を実施するために必要な
精巧な処理のために、この手法は現在では非実用的であると結論付けている。Another approach based on such a syntax is described in T. Strzalkowski (T. Strzalkowski)
) “Natural Language Information Retrieval: TIPSTER-2 Final Report” ( Proceedings of Advances in Text Processing: TipsterProgram Phase 2 , DARPA, May 6-8, 1996, Tysons
Corner, Virginia, pp. 143-148 (hereinafter referred to as "DARPA dissertation");
“Natural Language Information” by Strzalkowski
Retrieval ”(natural language information retrieval) ( Information Processing and Management , V
ol. 31, No. 3, 1995, pp. 397-417), in the context of using natural language processing to select appropriate terms for inclusion in search queries.
Although this approach provides a theoretical promise, in pages 147-8 of the DARPA paper, the authors have found that due to the elaborate processing required to implement the underlying natural language techniques, It has now been concluded that it is impractical.

【０００９】「．．．我々の要件を満たす（また少なくともこれらの要件に近いと考えられ
る）ＮＬＰ（自然言語処理）技法は、その自然言語テキストを扱う能力において
、未だ全く洗練されていないことを覚えておくのは重要である。特に、概念的構
造化、論理形態等に関与する高等処理は、計算的に手の届かないところにある。
これらの高等技法は、表現レベルの限界という問題に対処するので、より一層効
果的であることが立証されると仮定することも可能である。しかしながら、実験
による証拠は希薄であり、どちらかと言えば小規模な検査に必然的に限定される
。」この主の更に別の確率に基づく手法が、B.Katz（Ｂ．カッツ）の”Annotating
the World Wide Web using Natural Language”（自然言語を用いたワールド・
ワイド・ウェブの注釈）（Conference Proceedings of RIAO97、 Computer-Assis ted Information Searching in Internet、 McGill University、Quebec、カナダ、１９９７年６月２５〜２７日、Vol.1、136〜155ページ（以後「カッツの発表」
と呼ぶ）に記載されている。カッツの発表に記載されているように、インターネ
ット構造を保存しつつ主語−動詞−目的語という表現を作成することにより、検
索の間細かな構文上の変形（alternation）に対応することができる。[...] NLP (Natural Language Processing) techniques that meet our requirements (and are at least considered to be close to these requirements) are not yet quite sophisticated in their ability to handle natural language text. It is important to remember, especially the advanced processing involved in conceptual structuring, logical forms, etc., is computationally inaccessible.
It can also be assumed that these higher techniques prove to be even more effective as they address the problem of limited representation. However, experimental evidence is sparse, and is necessarily limited to small tests. Yet another probability-based approach to this Lord is B. Katz's "Annotating
the World Wide Web using Natural Language ”
Wide Web Annotation) ( Conference Proceedings of RIAO97, Computer-assisted Information Searching in Internet, McGill University, Quebec, Canada , June 25-27, 1997 , Vol. Announcement"
). By creating the subject-verb-object expression while preserving the Internet structure, as described in Katz's announcement, it is possible to cope with fine syntactical alterations during the search.

【００１０】これら構文的手法は、たいした改善をもたらさなかった。即ち、当時使用可能
であった自然言語処理システムを改善する見込みがなかったため、この分野は、
クエリの初期結果の正確性および回収率を直接改善する試みから、ユーザ・イン
ターフェースの改良に移行した。即ち、具体的には、検索結果に対してユーザに
「近いものを選ばせる」応答によるといったような、ユーザとの双方向処理に基
づいてクエリを絞って行く（refine）方法、および適切なクラスタで結果を表示
することを含む、クエリ結果の視覚化方法によるユーザ・インターフェースの改
良である。[0010] These syntactic approaches have not provided much improvement. In other words, because there was no prospect of improving the natural language processing systems available at the time,
We moved from trying to directly improve the accuracy and recovery of initial query results to improving the user interface. That is, specifically, a method of refining a query based on an interactive process with the user, such as a response to the search result by the user “selecting the closest one”, and an appropriate cluster. An improvement of the user interface by a method of visualizing query results, including displaying the results in a query.

【００１１】これらの改良は、それら自体は有用であるが、これらの改良によって達成可能
な正確性の向上は未だにがっかりさせる程に低く、キーワード・サーチに特有な
ユーザのいらいらを徹底的に解消するには明らかに不十分である。即ち、ユーザ
は、関連する応答がまばらにしかない、比較的大きな文書集合を手作業でふるい
にかけることを要求されているのである。（発明の概要）代表的な実施形態の１つによれば、本発明は、テキスト入力間の類似性を判定
する方法および装置を提供する。第１テキスト入力に対して、第１論理形態集合
を得て、更に第２テキスト入力に対して、第２論理形態集合を得る。第１および
第２論理形態集合を比較し、この比較に基づいて第１および第２テキスト入力間
の類似性を判定する。While these improvements, while useful in themselves, the accuracy improvements achievable by these improvements are still low enough to be disappointing, they completely eliminate the frustration of users specific to keyword searches. Is clearly insufficient. That is, the user is required to manually sift a relatively large set of documents with only sparse associated responses. SUMMARY OF THE INVENTION According to one exemplary embodiment, the present invention provides a method and apparatus for determining similarity between textual inputs. For a first text input, a first set of logical forms is obtained, and for a second text input, a second set of logical forms is obtained. The first and second sets of logical forms are compared and a similarity between the first and second text inputs is determined based on the comparison.

【００１２】広義に言えば、この処理は、第１および第２テキスト入力にそれぞれ関連する
一致論理形態の生成、比較、および任意の重み付けを伴う。論理形態とは、任意
サイズのテキストを表わす単語を、分類関係（labelled relation）によってリンクした、有向グラフ（directed graph）である。即ち、論理形態は、入力スト
リング内にある重要な単語間の構造的関係（即ち、構文的および意味的関係）、
特に趣旨および／または付加詞の関係を描写する。この描写は、論理形態グラフ
またはそのいずれかのサブグラフというような、様々な特定の形態を取ることが
でき、後者は、例えば、論理形態三連体のリストを含み、三連体の各々は、例示
として、「単語＿関係＿単語」という形式であり、これらの形式のいずれもが本
発明と共に使用可能である。Broadly, this process involves the generation, comparison, and optional weighting of a matching logical form associated with the first and second text inputs, respectively. A logical form is a directed graph in which words representing text of any size are linked by a labeled relation. That is, the logical forms are structural relationships (ie, syntactic and semantic relationships) between important words in the input string,
In particular, the purpose and / or the relation between adjuncts are described. This depiction can take various specific forms, such as a logical form graph or any subgraph thereof, the latter comprising, for example, a list of logical form triples, each of which is illustratively , "Word_relation_word", any of which can be used with the present invention.

【００１３】本発明の一態様によれば、例示として形態学、構文および意味に関して各テキ
スト入力に自然言語処理を行ない、最終的に各テキスト入力における文章毎に適
切な論理形態を生成する。次に、第１テキスト入力に対する論理形態集合を、第
２テキスト入力に関連する論理形態集合と比較し、論理形態間の一致を確かめる
。According to one aspect of the present invention, natural language processing is performed on each text input with respect to morphology, syntax and meaning as an example, and finally an appropriate logical form is generated for each sentence in each text input. Next, the set of logical forms for the first text input is compared to the set of logical forms associated with the second text input to determine a match between the logical forms.

【００１４】類似性とは、ここで用いる場合、２つのテキスト入力が、意味的および構文的
構造または語彙的意味のいずれか、あるいは双方に関して、どれ位近いかに対す
るある尺度を得ることを意味する。Similarity, as used herein, means that two text inputs obtain a measure of how close they are in terms of semantic and syntactic structure and / or lexical meaning, or both. .

【００１５】例示的な用途の１つによれば、情報検索システムは、部分的に自然言語処理を
基本とする。意味情報を用いて、探索対象文書またはクエリのいずれか、または
双方についての情報をより多く取り込み、高性能化または高精度化を図る。一般
に、このようなシステムは、自然言語処理技法を用いて、第１テキスト入力（ク
エリのような）の意味内容を第２テキスト入力（探索対象の文書のような）のそ
れと照合しようとする。このようなシステムは、当技術分野において、特に情報
検索処理における正確性向上を得ることに関して、著しい前進を表わす。According to one exemplary application, the information retrieval system is based in part on natural language processing. Using the semantic information, more information about one or both of the search target document and the query is fetched to achieve higher performance or higher accuracy. Generally, such systems attempt to match the semantic content of a first text input (such as a query) with that of a second text input (such as a document to be searched) using natural language processing techniques. Such a system represents a significant advance in the art, especially with respect to obtaining improved accuracy in the information retrieval process.

【００１６】具体的には、入力クエリを１つ以上の論理形態に変換し、サーチ・エンジンに
よって検索した文書も論理形態に変換する。クエリに対する論理形態を、文書に
対する論理形態と比較する。文書の論理形態がクエリに対応する論理形態と正確
に一致する場合、その文書をランク付けして、ユーザに提示する。Specifically, the input query is converted into one or more logical forms, and the documents retrieved by the search engine are also converted into the logical form. Compare the logical form for the query with the logical form for the document. If the logical form of the document exactly matches the logical form corresponding to the query, the document is ranked and presented to the user.

【００１７】本発明の別の態様によれば、前述の照合プロセスに関連する厳格性を、言い換
え論理形態を用いることによって緩和する。例えば、情報検索の用途において、
濾過プロセスにおける厳格性を緩和し、関連文書の破棄を防止する必要がある場
合もある。例えば、時として、回収集合内にクエリ（またはキーワード・サーチ
）が正しく含む文書が、誤って破棄されることがある。これが発生する可能性が
あるのは、クエリからのキーワードが文書内にあるが、クエリに対して発生した
論理形態が必要とする正確な構文的／意味的関係にはないという場合である。こ
のように誤って破棄される文書は、以下の例によって例示することができる。こ
の例は論理形態三連体について論ずるが、論理形態の他のサブグラフも同様に使
用可能であることを注記しておく。クエリを以下のようなものと仮定する。 How do spiders eat their victims? （蜘蛛はどのようにその獲物を食べるのか）このクエリに対して発生する論理形態三連体は、 eat;Dsub;spider eat;Dobj;victim となる。According to another aspect of the invention, the stringency associated with the matching process described above is mitigated by using paraphrased logic forms. For example, in information retrieval applications,
It may be necessary to reduce the rigor of the filtration process and prevent the destruction of related documents. For example, at times, documents in a collection that are correctly included in a query (or keyword search) may be incorrectly discarded. This can occur when the keywords from the query are present in the document but not in the exact syntactic / semantic relationship required by the logical form generated for the query. A document that is erroneously discarded in this way can be illustrated by the following example. Although this example discusses the logical form triad, it should be noted that other subgraphs of the logical form can be used as well. Assume the query is as follows: (How do spiders eat their prey?) The logical form triple that occurs for this query is eat; Dsub; spider eat; Dobj; victim.

【００１８】関連文書には、”Many spiders consume their victims ...”（多くの蜘蛛は
その獲物を食べ尽くす．．．）という文章を含むものもあり得る。この文章に対
して発生する論理形態は、以下のようになる。 consume;Dsub;spider consume;Dobj;victim この文書に対応する論理形態三連体には、クエリに対応する論理形態三連体の
いずれとも正確に一致するものがないので、非常に関連性が高い場合であっても
、この文書を破棄する。Related documents may include the text "Many spiders consume their victims ..." (many spiders eat their prey ...). The logical form generated for this sentence is as follows. consume; Dsub; spider consume; Dobj; victim Since the logical form triple corresponding to this document does not exactly match any of the logical form triples corresponding to the query, If so, destroy this document.

【００１９】加えて、破棄しなければユーザに提示されてしまう、無関連文書を破棄する必
要がある場合もある。例えば、あるクラスの論理形態は、探索対象の大型データ
記憶装置にある文書に高い頻度で現れる場合がある。このような論理形態は、ク
エリの主題には無関係に、クエリ内に共通して存在する可能性もある。例えば、
クエリが、 Tell me about dogs. （犬について私に教えて下さい）であると仮定する。In addition, it may be necessary to discard unrelated documents that would otherwise be presented to the user unless discarded. For example, a class of logical forms may frequently appear in documents in the large data storage device being searched. Such logical forms may be common in the query, regardless of the subject of the query. For example,
Suppose the query is Tell me about dogs.

【００２０】このクエリに対して発生する１つの論理形態三連体は、 tell;Dobj;me となる。One logical form triple generated for this query is tell; Dobj; me.

【００２１】これは、犬とは関係ない多くの文書にも当然現れ得る。したがって、このよう
な無関連文書がユーザに提示されてしまう。つまり、本発明の一態様によれば、論理形態集合の言い換えを行なうか、ある
いはある論理形態を抑制することによって、一方または双方の論理形態集合（一
方または双方のテキスト入力に対する）を変更する。一方または双方の変更論理
形態集合を照合プロセスにおいて用いる。This can of course also appear in many documents that are not related to dogs. Therefore, such an unrelated document is presented to the user. That is, according to one aspect of the invention, one or both sets of logical forms (for one or both text inputs) are modified by paraphrasing the set of logical forms or suppressing certain logical forms. One or both sets of modified logical forms are used in the matching process.

【００２２】例示としての情報検索システムでは、システムは、クエリに応答して、文書記
憶装置から検索した文書集合内の文書を濾過する。システムは、クエリおよび文
書集合内の文書から選択した１つに基づいて、第１論理形態集合を得る。システ
ムは、クエリおよび文書集合内の文書の別の１つに基づいて、第２論理形態集合
を得る。次に、システムは、自然言語処理技法を用いて、第１論理形態を変更し
、変更論理形態集合を得る。システムは、変更論理形態集合と第２論理形態集合
との間の所定関係に基づいて、文書集合内の文書を濾過する。In an exemplary information retrieval system, the system filters documents in a retrieved document set from a document storage device in response to a query. The system obtains a first set of logical forms based on the query and one selected from the documents in the set of documents. The system obtains a second set of logical forms based on the query and another one of the documents in the document set. Next, the system modifies the first logical form using natural language processing techniques to obtain a modified logical form set. The system filters documents in the document set based on a predetermined relationship between the modified set of logical forms and the second set of logical forms.

【００２３】本発明の一態様によれば、自然言語処理技法を用いて、第１論理形態集合の言
い換えを示す第１言い換え論理形態集合を得る。本発明の別の態様によれば、自
然言語処理技法は、第１所定クラスの論理形態を抑制し、第１抑制論理形態集合
を得る。次に、言い換え論理形態集合および／または抑制論理形態集合に基づい
て、濾過を行なう。According to one aspect of the present invention, a first paraphrase logical form set indicating a paraphrase of the first logical form set is obtained using a natural language processing technique. According to another aspect of the present invention, a natural language processing technique suppresses a first predetermined class of logical forms to obtain a first set of suppressed logical forms. Next, filtering is performed based on the paraphrase logical form set and / or the suppression logical form set.

【００２４】一実施形態では、クエリを受信し、このクエリに基づいてクエリ論理形態を計
算する。クエリを実行し、このクエリに基づいて文書を検索する。論理形態は、
計算するか、あるいは検索した各文書毎に、データ記憶装置から検索する。高頻
度クエリ論理形態を抑制し、クエリ論理形態に基づいて、言い換え論理形態を計
算する。言い換えクエリ論理形態を、文書論理形態と照合する。（好適な実施形態の詳細な説明）（概要）本発明は、自然言語処理技法を利用し、第１テキスト入力および第２テキスト
入力に対応する論理形態集合を作成する。本発明は、この論理形態集合の比較に
基づいて、第１および第２テキスト入力間の類似性を判定する。本発明の別の態
様によれば、論理形態集合の一方または双方を、言い換えを得るまたはある論理
形態を抑制することによる等で変更する。本発明は、広範囲に及ぶ様々な用途に
おける使用を想定するが、ここでは、例示のみの目的のために情報検索という面
において主に説明する。In one embodiment, a query is received and a query logical form is calculated based on the query. Execute a query and search for documents based on this query. The logical form is
For each document searched or calculated, a search is made from the data storage device. Suppress frequent query logical forms and calculate paraphrase logical forms based on the query logical forms. The paraphrase query logical form is checked against the document logical form. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Overview The present invention utilizes natural language processing techniques to create a set of logical forms corresponding to a first text input and a second text input. The present invention determines the similarity between the first and second text inputs based on the comparison of the set of logical forms. According to another aspect of the invention, one or both of the logical form sets are modified, such as by paraphrasing or suppressing certain logical forms. Although the present invention is envisioned for use in a wide variety of applications, it is described herein primarily in terms of information retrieval for illustrative purposes only.

【００２５】情報検索の実施形態では、本発明は、入力したクエリに対応する論理形態集合
、および入力クエリに応答して戻される文書集合に対応する論理形態集合を作成
する。また、本発明は自然言語処理技法を利用し、クエリまたは文書集合のいず
れか、または双方に対応する論理形態を変更する。一実施形態では、変更した論
理形態を拡大し、言い換え（paraphrase）を含ませる。別の実施形態では、変更
した論理形態を処理して、種々の文書間の判別には有用でないことが分かった、
所定のクラスの論理形態を抑制する。このように論理形態を変更することによっ
て、本発明は照合技法に伴う厳重さを軽減し、こうして情報検索プロセスにおけ
る正確性および回収率双方を向上させることとした。In an information retrieval embodiment, the invention creates a set of logical forms corresponding to the input query and a set of logical forms corresponding to the set of documents returned in response to the input query. The present invention also utilizes natural language processing techniques to change the logical form corresponding to either the query or the document set, or both. In one embodiment, the modified logical form is expanded to include a paraphrase. In another embodiment, the modified logical form was processed and found not to be useful in discriminating between various documents,
Suppress the logical form of a given class. By altering the logical form in this manner, the present invention has reduced the severity associated with matching techniques, and thus has improved both accuracy and recovery in the information retrieval process.

【００２６】注記すべきは、本論述は、部分的に、単語が示す形態、構文的または意味的関
係、および他の単語を有する論理形態三連体（logical form triple）を参照しながら進めることである。しかしながら、本発明は、他のサブグラフ（subgraph
）の論理形態も同様に使用可能であることも想定しており、ここでは全てを総称
して論理形態と呼ぶことにする。It should be noted that this discussion proceeds, in part, with reference to the logical form triples that have the form, syntactic or semantic relationship indicated by the word, and other words. is there. However, the present invention is directed to other subgraphs.
It is also assumed that the logical form of ()) can be used as well, and all of them will be collectively referred to as a logical form here.

【００２７】以下の説明を検討した後には、多くの用途および殆どあらゆる情報検索システ
ムにおいて我々の本発明の教示を容易に利用し、そこで用いるサーチ・エンジン
が従来の統計的エンジンか否かには無関係に、当該エンジンの正確性向上が可能
であることを、当業者は明白に認めよう。更に、我々の発明は、殆どあらゆる形
態の大容量データ記憶装置、例えば、磁気、光学（例えば、ＣＤ−ＲＯＭ）また
はその他の媒体のいずれに格納してあっても、更にテキスト情報が存在するあら
ゆる特定の言語、例えば、英語、スペイン語、ドイツ語等にも無関係に、データ
ベースからテキスト情報を検索する際に、正確性を向上させるために利用可能で
ある。After reviewing the following discussion, many applications and almost any information retrieval system will readily utilize the teachings of the present invention to determine whether the search engine used therein is a conventional statistical engine or not. Regardless, those skilled in the art will clearly recognize that increased accuracy of the engine is possible. In addition, our invention is directed to any form of mass data storage, such as magnetic, optical (e.g., CD-ROM), or any other medium in which more textual information is present, whether stored on other media. It can be used to improve accuracy when retrieving text information from a database, regardless of a particular language, for example, English, Spanish, German, etc.

【００２８】このことを念頭に入れておき、図１は、我々の発明を利用する情報検索システ
ム５の最上位のブロック図を示す。システム５は、従来の検索エンジン２０、例
えば、キーワードに基づく統計的検索エンジン２０、およびそれに続くプロセッ
サ３０で形成してある。プロセッサ３０は、以下に述べるように、我々の発明の
自然言語処理技法を利用し、エンジン２０が生成した文書の濾過および再ランク
付けを行ない、ユーザが供給したクエリに対して、その他の場合に得られるより
も関連性が高い検索文書集合を順序付けて生成する。With this in mind, FIG. 1 shows a top-level block diagram of an information retrieval system 5 that utilizes our invention. The system 5 is formed by a conventional search engine 20, for example, a statistical search engine 20 based on keywords, followed by a processor 30. Processor 30 utilizes the natural language processing techniques of our invention to filter and re-rank the documents generated by engine 20, as described below, and to respond to user-supplied queries at other times. A search document set having higher relevance than obtained is ordered and generated.

【００２９】具体的には、動作において、ユーザが探索クエリをシステム５に供給する。ク
エリは、フル・テキスト（一般に「リテラル」（literal：文字通り）と呼んでいる）形態とし、自然言語処理によってその意味的内容を最大限利用し、それに
よってエンジン２０だけで得られるものに対して正確性の向上を図るようにすべ
きである。システム５は、このクエリをエンジン２０およびプロセッサ３０双方
に適用する。クエリに応答して、エンジン２０は、格納してある文書のデータセ
ット１０全体を探索し、そこから検索文書集合を生成する。次に、この文書集合
（ここでは、「出力文書集合」とも呼ぶ）を、ライン２５でシンボル化するよう
に、プロセッサ３０への入力として適用する。プロセッサ３０内部では、以下で
詳細に論ずるように、集合内の文書の各々に、例示的に、形態学的、意味的およ
び論理的形態の自然言語処理を施し、当該文書内の各文章毎に論理形態を生成す
る。文章に対するこのような論理形態の各々は、当該文章内の言語的句における
単語間の、例えば、意味的関係、即ち、趣旨（argument）および付加詞（adjunc
t）構造をエンコードする。プロセッサ３０は、同様にクエリを分析し、それに対応する論理形態集合を生成する。次に、プロセッサ３０はクエリに対する形態
集合を、当該集合内の文書の各々に関連する論理形態集合と比較し、クエリ集合
内の論理形態と各文書毎の論理形態との間のあらゆる一致を確かめる。一致が得
られない文書は、今後の検討から排除する。クエリ論理形態と一致する少なくと
も１つの論理形態を含む各残留文書を保持し、プロセッサ３０によって経験的に
スコアを決定する。以下で論ずるが、異なる各関係種別、即ち、論理形態三連体
内に発生し得る、深い主語（deep subject）、深い目的語（deep object）、機能語等に、既定の重みを割り当てる。このような文書各々の全重み（即ち、スコ
ア）は、例えば、それぞれ１つずつの一致した三連体全ての重みの総和である。
即ち、一致する同じ三連体を無視する。最終的に、プロセッサ３０は、保持して
ある文書を、そのスコアに基づいてランク順に整列しユーザに提示する。典型的
に、最も高いスコアを有する文書から始めて、例えば５または１０のような所定
数毎に集合化する。Specifically, in operation, a user supplies a search query to the system 5. Queries are in the form of full text (commonly referred to as “literals”), and take full advantage of their semantic content through natural language processing, thereby allowing them to be obtained only by the engine 20. The accuracy should be improved. The system 5 applies this query to both the engine 20 and the processor 30. In response to the query, the engine 20 searches the entire data set 10 of stored documents and generates a search document set therefrom. Next, this document set (also referred to herein as the “output document set”) is applied as an input to the processor 30 as symbolized by line 25. Within the processor 30, each of the documents in the collection is exemplarily subjected to natural language processing in morphological, semantic and logical forms, as discussed in detail below, and for each sentence in the document. Generate a logical form. Each of these logical forms for a sentence is associated with, for example, a semantic relationship between words in a linguistic phrase within the sentence, ie, an argument and an adjunct.
t) Encode the structure. Processor 30 also analyzes the query and generates a corresponding set of logical forms. Next, processor 30 compares the morphological set for the query to the logical morphological set associated with each of the documents in the ensemble to ascertain any matches between the logical morphology in the query ensemble and the logical morphology for each document. . Documents that do not match will be excluded from further consideration. Each residual document containing at least one logical form that matches the query logical form is kept and a score is determined empirically by the processor 30. As discussed below, predetermined weights are assigned to each of the different relationship types, ie, deep subjects, deep objects, functional words, etc., that can occur within the logical form triad. The total weight (i.e., score) of each such document is, for example, the sum of the weights of all one matched triple.
That is, the same matching triples are ignored. Finally, the processor 30 arranges the held documents in a rank order based on the score and presents them to the user. Typically, starting with the document with the highest score, it is grouped by a predetermined number, for example, 5 or 10.

【００３０】システム５は非常に汎用的であり、かつ広範囲に及ぶ異なる用途に適合化する
ことができるので、以下の論述を簡略化するために、１つの例示的な状況におけ
る我々の発明の使用について論ずることにする。その状況とは、従来のキーワー
ドに基づく統計的インターネット・サーチ・エンジンを採用し、ワールド・ワイ
ド・ウェブからのデータセット内にインデックス化されている英語文書の格納レ
コードを検索する情報検索システムである。一般的に、このようなレコードは各
々、以下に明記するように、対応する文書に対して既定の情報を含む。他のサー
チ・エンジンでは、レコードが文書自体全体を含む場合もある。以下の論述は、
対応する文書に関するある情報を収容し、当該文書を発見することができるウェ
ブ・アドレスを含むレコードを検索する従来からのインターネット・サーチ・エ
ンジンと共に用いるという状況における本発明を対象とするが、総括的に言えば
、そのエンジンが検索する最終的な項目は、実際には文書である。しかしながら
、一般的に、実際にウェブから文書にアクセスするには、当該アドレスを用いた
中間プロセスを用いる。以下の説明を検討した後には、他のあらゆる情報検索用
途における使用にも本発明がいかに簡単に適合化できるのかということを、当業
者は容易に認めるであろう。Since the system 5 is so versatile and can be adapted to a wide variety of different applications, the use of our invention in one exemplary situation to simplify the following discussion Will be discussed. The situation is an information retrieval system that employs a conventional keyword-based statistical Internet search engine and searches for stored records of English documents indexed in datasets from the World Wide Web. . In general, each such record contains predefined information for the corresponding document, as specified below. In other search engines, a record may include the entire document itself. The following discussion
The present invention is intended for use in conjunction with a conventional Internet search engine that contains certain information about a corresponding document and retrieves records containing web addresses where the document can be found, but is not intended to be comprehensive. In short, the final item that the engine searches is actually a document. However, in general, actually accessing a document from the Web uses an intermediate process using the address. After reviewing the following description, those skilled in the art will readily recognize how easily the present invention can be adapted for use in any other information retrieval application.

【００３１】図２は、インターネット・サーチ・エンジンとの関連で用いる我々の発明の特
定実施形態の上位ブロック図である。我々の発明は、主に、この特定実施形態の
関連において詳しく論ずることにする。図示のように、システム２００は、ネッ
トワーク接続部２０５を介して、ネットワーク２１０（ここでは、インターネッ
トであるが、例えば、イントラネットのような他のあらゆるこのようなネットワ
ークも代わりに使用可能である）、およびネットワーク接続部２１５を通じてサ
ーバ２２０に接続してある、クライアント・パーソナル・コンピュータ（ＰＣ）
のような、コンピュータ・システム３００を含む。サーバは、典型的に、コンピ
ュータ２２２を含む。コンピュータ２２２は、例えば、ALTA VISTAサーチ・エン
ジンに代表されるインターネット・サーチ・エンジンを運営（host）し（ALTA V
ISTAは、マサチューセッツ州MaynardのDigital Equipment Corporation（ディジ
タル・エクイップメント社）の登録商標である）、典型的に、サーチ・エンジン
によってインデックスしインターネット上でワールド・ワイド・ウェブを通じて
アクセス可能な文書レコードのデータセットである、大容量データ記憶装置２２
７に接続してある。このようなレコードの各々は、典型的に、（ａ）ウェブ・ブ
ラウザによって対応する文書にアクセス可能な、（一般にユニフォーム・リソー
ス・ロケータ−−ＵＲＬと呼ぶ）ウェブ・アドレス、（ｂ）当該文書に現れる、
既定の含有単語であって、ある種のエンジンでは、当該文書内の他の含有単語に
対するこのような各単語の相対アドレスを伴う既定の含有単語、（ｃ）多くの場
合文書のほんの数行の短い概要または文書の最初の数行、および恐らく（ｄ）そ
のハイパーテキスト・マークアップ言語（ＨＴＭＬ）記述フィールド内に与えら
れている、文書の記述を含む。FIG. 2 is a high-level block diagram of a specific embodiment of our invention for use in connection with an Internet search engine. Our invention will be discussed primarily in the context of this particular embodiment. As shown, the system 200 is connected via a network connection 205 to a network 210 (here, the Internet, but any other such network, such as, for example, an intranet) can be used instead. And a client personal computer (PC) connected to the server 220 through the network connection unit 215
, Including a computer system 300. The server typically includes a computer 222. The computer 222 operates (hosts) an Internet search engine represented by, for example, the ALTA VISTA search engine (ALTA V
ISTA is a registered trademark of Digital Equipment Corporation of Maynard, Mass., Typically data of document records indexed by search engines and accessible on the Internet through the World Wide Web. Mass data storage device 22 as a set
7 is connected. Each such record typically contains (a) a web address (commonly referred to as a uniform resource locator--URL) at which the corresponding document can be accessed by a web browser; appear,
A default content word, and in some engines, a default content word with the relative address of each such word relative to other content words in the document; (c) often only a few lines of the document. Contains a brief summary or the first few lines of the document, and possibly (d) a description of the document, given in its Hypertext Markup Language (HTML) description field.

【００３２】コンピュータ・システム３００に対峙するユーザが、例えば、このシステム上
で実行中の連動するウェブ・ブラウザ（Microsoft Corporation（マイクロソフト社）から入手可能であり、我々の発明の教示を含ませるために適切に変更した
、”Internet Explorer”バージョン４．０ブラウザに基づくようなもの）を通じて、サーバ２２０そして特定すると、そこで実行するサーチ・エンジン２２２
へのインターネット接続を確立する。その後、ユーザは、ここではライン２０１
でシンボル化するように、クエリをブラウザに入力する。一方、ブラウザは、シ
ステムを介し、更にサーバ２２０へのインターネット接続を通じて、サーチ・エ
ンジン２２５にクエリを送る。すると、サーチ・エンジンは、データセット２２
７内に格納してある文書レコードに対してクエリを処理し、エンジンがクエリに
関連あると判定した文書に対して検索したレコード集合を生成する。エンジン２
２５が実際に文書をインデックス化し文書レコードを形成してデータ記憶装置２
２７に格納する方法、およびエンジンがこのような格納してある文書レコードの
いずれかを選択するために行なう実際の分析は、双方とも本発明には無関係であ
るので、これ以上これらの態様のいずれについても論じないことにする。クエリ
に応答して、エンジン２２５はインターネット接続部を介してウェブ・ブラウザ
４２０に、検索文書レコード集合を返送すると言えば十分である。ブラウザ４２
０は、エンジン２２５が文書を検索している間、同時に、および／またはそれに
続いて、クエリを分析し、その対応する論理形態三連体集合を生成する。一旦サ
ーチ・エンジンがその探索を完了し、文書レコード集合を検索し、当該集合をブ
ラウザに供給し終えたなら、対応する文書（即ち、出力文書集合を形成する）自
体に、関連するウェブ・サーバからブラウザによってアクセスする（それと関連
し、格納文書の「レポジトリ」を集合的に形成するデータセット。このようなレ
ポジトリは、例えば、自己充足型のＣＤ−ＲＯＭに基づくデータ検索アプリケー
ションにおけるように、単体のデータ集合とすることも可能である）。一方、ブ
ラウザは、次に、アクセスした文書の各々（即ち、出力文書集合における）を分
析し、このような各文書毎に、論理形態三連体の対応する集合を形成する。その
後、以下で詳しく論ずるが、ブラウザ４２０は、クエリと検索した文書との間で
一致する論理形態三連体に基づいて、このような一致を有する各文書のスコアを
決め、ライン２０３でシンボル化するように、これらの文書をユーザに提示する
。文書は、典型的に、最も高いランキングを有する、既定の少数の文書群として
、スコアの降順でランク付けする。次いで、ユーザがブラウザを通じて選択した
場合、次のこのような群が続き、こうして提示した文書をユーザが十分な数だけ
検査し終えるまで、同様に続ける。図２は、例示として、ネットワーク接続を利
用して文書レコードおよび文書をリモート・サーバから得るものとして、我々の
発明を図示するが、我々の発明はそのように限定される訳ではない。図９Ａと関
連付けて以下で詳しく論ずるが、検索アプリケーションおよび我々の発明双方を
共通のコンピュータ、即ちローカルＰＣ上で実行し、添付データセットも、例え
ば、ＣＤ−ＲＯＭまたはその他の適した媒体に格納してあり、そこでアクセス可
能である場合には、このようなネットワーク状接続は不要である。A user facing computer system 300 can, for example, obtain an associated web browser running on this system (available from Microsoft Corporation) to include the teachings of our invention. (Such as based on the "Internet Explorer" version 4.0 browser), appropriately modified to the server 220 and, if specified, the search engine 222 running there.
Establish an Internet connection to Thereafter, the user, here the line 201
Enter the query into the browser as if symbolized by. The browser, on the other hand, sends queries to the search engine 225 through the system and through an Internet connection to the server 220. Then, the search engine returns the data set 22
7. The query is processed for the document records stored in 7 and a set of records searched by the engine for documents determined to be relevant to the query is generated. Engine 2
25 actually indexes the document to form a document record,
27, and the actual analysis that the engine performs to select any of such stored document records is both irrelevant to the present invention and will not be further described in any of these aspects. Will not be discussed. Suffice it to say that in response to the query, engine 225 returns the set of search document records to web browser 420 via the Internet connection. Browser 42
0 analyzes the query and generates its corresponding logical form triplet while engine 225 is searching for the document, simultaneously and / or subsequently. Once the search engine has completed its search, retrieved the set of document records and served the set to the browser, the web server associated with the corresponding document itself (ie, forming the output document set) itself (A dataset associated therewith and collectively forming a "repository" of stored documents. Such a repository may be a single entity, such as in a self-contained CD-ROM based data retrieval application). Can be a data set). Meanwhile, the browser then analyzes each of the accessed documents (i.e., in the output document set) and forms a corresponding set of logical form triples for each such document. Thereafter, as discussed in detail below, the browser 420 determines a score for each document having such a match based on the logical form triples that match between the query and the retrieved document, and symbolizes it at line 203. As such, these documents are presented to the user. Documents are typically ranked in descending score order as a predefined small group of documents having the highest ranking. Then, if the user selects through the browser, the next such group follows, and so on until the user has inspected a sufficient number of such presented documents. FIG. 2 illustrates, by way of example, our invention as obtaining document records and documents from a remote server utilizing a network connection, but our invention is not so limited. As discussed in detail below in connection with FIG. 9A, both the search application and our invention run on a common computer, i.e., a local PC, and the attached data set is also stored on, for example, a CD-ROM or other suitable medium. And if accessible there, such a network-like connection is not required.

【００３３】図３および関連する論述は、本発明を実施可能な適切な計算機環境について、
端的な概略的説明を行なうことを意図したものである。本発明の説明は、パーソ
ナル・コンピュータによって実行するプログラム・モジュールのような、コンピ
ュータ実行可能命令の一般的な状況で、少なくとも部分的に行なうが、これは必
須ではない。一般に、プログラム・モジュールは、ルーチン・プログラム、オブ
ジェクト、コンポーネント、データ構造等を含み、特定のタスクを実行するか、
あるいは特定の抽象的データ型を実装する。更に、ハンド・ヘルド・デバイス、
マルチプロセッサ・システム、マイクロプロセッサに基づくまたはプログラマブ
ルな消費者電子機器、ネットワークＰＣ、ミニコンピュータ、メインフレーム・
コンピュータ等を含む、他のコンピュータ・システム・コンフィギュレーション
を用いても本発明は実施可能であることを当業者は認めよう。また、本発明は、
通信ネットワークを通じてリンクしたリモート処理デバイスがタスクを実行する
分散型計算機環境においても実施可能である。分散型計算機環境では、プログラ
ム・モジュールは、ローカルおよびリモート・メモリ記憶装置双方に配置するこ
とができる。FIG. 3 and the related discussion describe a suitable computing environment in which the invention can be implemented.
It is intended to provide a brief, general description. The description of the present invention will be made at least in part in the general context of computer-executable instructions, such as program modules, being executed by a personal computer, but this is not required. Generally, program modules include routine programs, objects, components, data structures, etc., that perform particular tasks or
Or implement a specific abstract data type. In addition, handheld devices,
Multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframes
One skilled in the art will recognize that the invention may be practiced with other computer system configurations, including computers and the like. Also, the present invention
The present invention can also be implemented in a distributed computing environment in which remote processing devices linked through a communication network execute tasks. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

【００３４】図３を参照すると、本発明を実施する例示のシステムは、演算装置３２１（１
つ以上のプロセッサを含む場合がある）を含む、従来のパーソナル・コンピュー
タ３２０の形態の汎用計算機デバイス、システム・メモリ３２２、およびシステ
ム・メモリを含む種々のシステム・コンポーネントを演算装置３２１に結合する
システム・バス３２３を含む。システム・バス３２３は、メモリ・バスまたはメ
モリ・コントローラ、周辺バス、および種々のバス・アーキテクチャのいずれか
を用いるローカル・バスを含む、数種類のバス構造のいずれでもよい。システム
・メモリは、リード・オンリ・メモリ（ＲＯＭ）３２４、ランダム・アクセス・
メモリ（ＲＡＭ）３２５を含む。起動中等にパーソナル・コンピュータ３２０内
部のエレメント間で情報を転送するのを助ける基本的なルーチンを含む基本入出
力３２６（ＢＩＯＳ）を、ＲＯＭ３２４に格納してある。更に、パーソナル・コ
ンピュータ３２０は、ハード・ディスク（図示せず）との読み出しおよび書き込
みを行なうハード・ディスク・ドライブ３２７、リムーバブル磁気ディスク３２
９との読み出しまたは書き込みを行なう磁気ディスク・ドライブ３２８、および
ＣＤＲＯＭまたはその他の光学的媒体のようなリムーバル光ディスク３３１と
の読み出しまたは書き込みを行なう光ディスク・ドライブ３３０も含む。ハード
・ディスク・ドライブ３２７、磁気ディスク・ドライブ３２８、および光ディス
ク・ドライブ３３０は、それぞれ、ハード・ディスク・ドライブ・インターフェ
ース３３２、磁気ディスク・ドライブ・インターフェース３３３、および光ドラ
イブ・インターフェース３３４によって、システム・バス３２３に接続してある
。これらのドライブおよび関連するコンピュータ読み取り可能媒体は、コンピュ
ータ読み取り可能命令、データ構造、プログラム・モジュール、およびパーソナ
ル・コンピュータ３２０のためのその他のデータの不揮発性格納を可能にする。
ここに記載する例示の環境では、ハード・ディスク、リムーバル磁気ディスク３
２９およびリムーバブル光ディスク３３１を採用するが、磁気カセット、フラッ
シュ・メモリ・カード、ディジタル・ビデオ・ディスク、ベルヌーイ・カートリ
ッジ、ランダム・アクセス・メモリ（ＲＡＭ）、リード・オンリ・メモリ（ＲＯ
Ｍ）等のような、コンピュータによるアクセスが可能なデータを格納可能な、そ
の他の種類のコンピュータ読み取り可能媒体も、例示の動作環境において使用可
能であることは、当業者には認められよう。Referring to FIG. 3, an exemplary system embodying the present invention includes an arithmetic unit 321 (1
A general purpose computing device in the form of a conventional personal computer 320, a system memory 322, and various system components, including the system memory, to a computing unit 321 (which may include one or more processors). -Includes bus 323. System bus 323 may be any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory includes a read only memory (ROM) 324 and a random access memory (ROM).
A memory (RAM) 325 is included. A basic input / output 326 (BIOS) containing basic routines to help transfer information between elements within personal computer 320, such as during start-up, is stored in ROM 324. Further, the personal computer 320 includes a hard disk drive 327 for reading from and writing to a hard disk (not shown), and a removable magnetic disk 32.
Also included is a magnetic disk drive 328 for reading or writing to and from an optical disk drive 330 for reading or writing to a removable optical disk 331 such as a CD ROM or other optical media. The hard disk drive 327, magnetic disk drive 328, and optical disk drive 330 are connected to the system bus by a hard disk drive interface 332, a magnetic disk drive interface 333, and an optical drive interface 334, respectively. 323 is connected. These drives and associated computer-readable media allow nonvolatile storage of computer-readable instructions, data structures, program modules, and other data for personal computer 320.
In the example environment described here, a hard disk, a removable magnetic disk 3
29 and a removable optical disk 331, a magnetic cassette, a flash memory card, a digital video disk, a Bernoulli cartridge, a random access memory (RAM), and a read only memory (RO).
One skilled in the art will recognize that other types of computer readable media capable of storing computer accessible data, such as M), can be used in the exemplary operating environment.

【００３５】ハード・ディスク、磁気ディスク３２９、光ディスク３１１，ＲＯＭ３２４ま
たはＲＡＭ３２５上には、多数のプログラム・モジュールを格納することができ
、オペレーティング・システム３３５、１つ以上のアプリケーション・プログラ
ム３３６、その他のプログラム・モジュール３３７、およびプログラム・データ
３３８を含む。ユーザは、キーボード３４０やポインティング・デバイス３４２
のような入力デバイスを通じて、パーソナル・コンピュータ３２０にコマンドお
よび情報を入力することができる。他の入力デバイス（図示せず）には、マイク
ロフォン、ジョイスティック、ゲーム・パッド、衛星放物面反射器（satellite
dish）、スキャナ等も含むことができる。これらおよびその他の入力デバイスは
、多くの場合、シリアル・ポート・インターフェース３４６を介して、演算装置
３２１に接続する。シリアル・ポート・インターフェース３４６は、システム・
バスに結合するが、パラレル・ポート、ゲーム・ポートまたはユニバーサル・シ
リアル・バス（USB:universal serial bus）のようなその他のインターフェース
にも接続可能である。モニタ３４７またはその他の種類の表示装置も、ビデオ・
アダプタ３４８のようなインターフェースを介して、システム・バス３２３に接
続してある。モニタ３４７に加えて、パーソナル・コンピュータは、典型的に、
スピーカやプリンタのようなその他の周辺出力デバイス（図示せず）を含むこと
も可能である。A number of program modules can be stored on the hard disk, magnetic disk 329, optical disk 311, ROM 324 or RAM 325, including an operating system 335, one or more application programs 336, and other programs. -Contains module 337 and program data 338. The user can operate the keyboard 340 and the pointing device 342.
Commands and information can be input to the personal computer 320 through an input device such as. Other input devices (not shown) include a microphone, joystick, game pad, satellite parabolic reflector (satellite)
dish), a scanner and the like. These and other input devices often connect to the computing device 321 via the serial port interface 346. The serial port interface 346 is
It is coupled to a bus, but can also be connected to other interfaces, such as a parallel port, a game port or a universal serial bus (USB). The monitor 347 or other type of display device may also
It is connected to the system bus 323 via an interface such as an adapter 348. In addition to the monitor 347, personal computers typically include
Other peripheral output devices (not shown), such as speakers and printers, may also be included.

【００３６】パーソナル・コンピュータ３２０は、リモート・コンピュータ３４９のような
１つ以上のリモート・コンピュータへの論理接続を用いたネットワーク環境にお
いて動作することも可能である。リモート・コンピュータ３４９は、その他のパ
ーソナル・コンピュータ、サーバ、ルータ、ネットワークＰＣ、ピア・デバイス
またはその他のノードとしてもよく、典型的に、パーソナル・コンピュータ３２
０に関して先に述べたエレメントの多くまたは全てを含むことができるが、図１
にはメモリ記憶装置３５０のみを示した。図１に示す論理接続は、ローカル・エ
リア・ネットワーク（ＬＡＮ）３５１、およびワイド・エリア・ネットワーク（
ＷＡＮ）３５２を含む。このようなネットワーク処理環境は、オフィス、企業全
体に及ぶコンピュータ・ネットワーク・イントラネット、およびインターネット
には共通である。Personal computer 320 can also operate in a networked environment using logical connections to one or more remote computers, such as remote computer 349. Remote computer 349 may be any other personal computer, server, router, network PC, peer device, or other node, typically personal computer 32.
0 may include many or all of the elements described above with respect to FIG.
Shows only the memory storage device 350. The logical connections shown in FIG. 1 include a local area network (LAN) 351 and a wide area network (LAN).
WAN) 352. Such networking environments are common to offices, enterprise-wide computer network intranets, and the Internet.

【００３７】ＬＡＮネットワーク処理環境において用いる場合、パーソナル・コンピュータ
３２０はネットワーク・インターフェースまたはアダプタ３５３を介して、ロー
カル・エリア・ネットワーク３５１に接続する。ＷＡＮネットワーク処理環境に
おいて用いる場合、パーソナル・コンピュータ３２０は典型的に、モデム３５４
またはインターネットのようなワイド・エリア・ネットワーク３５２を通じて通
信を確立するためのその他の手段を含む。モデム３５４は内蔵型でも外付け型で
もよいが、シリアル・ポート・インターフェース３４６を介してシステム・バス
３２３に接続する。ネットワーク環境では、パーソナル・コンピュータ３２０に
関して示したプログラム・モジュール、またはその部分をリモート・メモリ記憶
装置に格納することができる。図示したネットワーク接続は例示であり、コンピ
ュータ間に通信リンクを確立する別の手段を用いてもよいことは認められよう。When used in a LAN network processing environment, the personal computer 320 connects to a local area network 351 via a network interface or adapter 353. When used in a WAN networking environment, the personal computer 320 typically has a modem 354
Or other means for establishing communication over a wide area network 352 such as the Internet. The modem 354 may be internal or external, but is connected to the system bus 323 via a serial port interface 346. In a networked environment, program modules depicted relative to the personal computer 320, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

【００３８】図４は、図３に示すコンピュータ３００内部で実行するアプリケーション・プ
ログラム４００の最上位ブロック図を示す。これらのプログラムは、本発明に関
連する範囲では、図４に示すように、我々の本発明を実施するために、検索プロ
セス６００（以下で図６Ａおよび図６Ｂに関連付けて詳細に論ずる）を備えた、
ウェブ・ブラウザ４２０を含む。ウェブ・ブラウザと、例えば、ＡＬＴＡＶＩ
ＳＴＡサーチ・エンジンのような、ユーザが選択した統計的サーチ・エンジンと
の間にインターネット接続が確立していると仮定すると、次にユーザは、図４に
示すライン４２２でシンボル化するように、プロセス６００にフル・テキスト（
「リテラル」）サーチ・クエリを供給する。このプロセスは、ライン４２６でシ
ンボル化するように、クエリをウェブ・ブラウザを通じてサーチ・エンジンに転
送する。加えて、具体的には示さないが、プロセス６００は内部的にもクエリを
分析し、それに対応する論理形態三連体を生成し、次いでコンピュータ３００内
部にローカルに格納する。クエリに応答して、サーチ・エンジンは、ライン４３
２でシンボル化するように、統計的に検索した文書レコード集合をプロセス６０
０に供給する。これらレコードの各々は、先に注記したように、当該文書にアク
セスすることができウェブ・アドレス、即ち、ＵＲＬと、その文書が位置するリ
モート・ウェブ・サーバが、インターネットを通じて、その文書を含むコンピュ
ータ・ファイルを十分にダウンロードするのに必要とする適切なコマンド（複数
のコマンド）とを含む。一旦プロセス６００が全てのレコードを受信したなら、
次にこのプロセスは、ウェブ・ブラウザ４２０を介して、そしてライン４３６で
シンボル化するように、適切なコマンドを送り、レコードが指定する全ての文書
にアクセスしダウンロードする（即ち、出力文書集合を形成する）。次に、対応
するウェブ・サーバからこれらの文書に順次アクセスし、ライン４４２でシンボ
ル化するように、ウェブ・ブラウザ４２０、具体的にはプロセス６００にダウン
ロードする。一旦これらの文書をダウンロードしたなら、プロセス６００はこの
ような文書を各々分析し、その論理形態三連体を生成し、ローカルに格納する。
その後、クエリに対する論理形態三連体を各文書毎の論理形態三連体と比較する
ことによって、プロセス６００は少なくとも１つの一致する論理形態三連体を含
む各文書のスコアを決定し、それらのスコアに基づいてこれら個々の文書をラン
ク付けし、最後に、ライン４４６でシンボル化するように、群毎に文書スコアを
降順に並べることによって、これらの個々の文書をランク順にユーザに提示する
ようにウェブ・ブラウザ４００に命令する。ブラウザ４００は、ディスプレイ３
８００（図３参照）の画面上に適切な選択ボタンを発生し、ユーザはこれを通じ
て、彼（彼女）のマウスを用いてその上で適切に「クリック」することによって
選択し、所望通りに、連続する各文書群を表示することができる。FIG. 4 shows a top-level block diagram of an application program 400 executed inside the computer 300 shown in FIG. These programs, to the extent relevant to the present invention, comprise a search process 600 (discussed in detail below in connection with FIGS. 6A and 6B) to implement our invention, as shown in FIG. Was
Web browser 420 is included. Web browser and, for example, ALTA VI
Assuming that an Internet connection has been established with a statistical search engine selected by the user, such as a STA search engine, the user may then symbolize at line 422 shown in FIG. Process 600 contains full text (
"Literals") provide search queries. This process forwards the query through the web browser to the search engine, as symbolized at line 426. In addition, although not specifically shown, the process 600 also internally analyzes the query to generate a corresponding logical form triple and then stores it locally within the computer 300. In response to the query, the search engine returns to line 43
The statistically retrieved document record set is symbolized by
Supply 0. Each of these records, as noted above, can access the document, a web address, i.e., a URL, and the remote web server where the document is located can be transmitted via the Internet to the computer containing the document. Include the appropriate command (s) needed to fully download the file. Once process 600 has received all records,
The process then sends the appropriate commands via the web browser 420 and as symbolized at line 436 to access and download all the documents specified by the record (ie, form the output document set). Do). These documents are then accessed sequentially from the corresponding web server and downloaded to web browser 420, specifically process 600, as symbolized at line 442. Once these documents have been downloaded, the process 600 analyzes each such document and generates its logical form triad and stores it locally.
Thereafter, by comparing the logical form triples for the query to the logical form triples for each document, the process 600 determines a score for each document containing at least one matching logical form triple and based on those scores. Web documents to present these individual documents to users in ranked order by ranking document scores by group in descending order, as symbolized by line 446, and finally ranking these individual documents. Instructs the browser 400. The browser 400 is connected to the display 3
800 (see FIG. 3) generates the appropriate select button on the screen through which the user selects by appropriately "clicking" on it using his (her) mouse, as desired. Each successive document group can be displayed.

【００３９】この時点において意味情報を判定し、保存しエンコードする際における論理形
態の有用性を最大限認識するために、我々の発明を実施する処理を論ずることか
ら逸れて、本発明において用いる論理形態および論理形態三連体を、関連のある
範囲で、図示しかつ説明し、更にこれらを生成する方法の端的な概要を示すこと
にする。At this point, in order to determine the semantic information and to maximize the usefulness of the logical form in storing and encoding, deviating from discussing the process of implementing our invention, the logic used in the present invention The forms and logical forms triples will be illustrated and described to the extent relevant, and will provide a brief summary of how to generate them.

【００４０】広義に言えば、論理形態とは、いずれかの任意のサイズのテキストを表わす単
語を、分類関係でリンクした、有向グラフ（directed graph）である。論理形態
は、句内にある重要な単語間の意味的関係を描写し、その上位語（nypernym）お
よび／または同義語を含む場合もある。図５Ａないし図５Ｄにおいて論じかつ示
すように、論理形態は、多数の異なる形態のいずれか１つ、例えば、論理形態グ
ラフ、または例えば論理形態三連体のリストのような、あらゆるサブグラフを利
用することができる。三連体の各々は、例示として、「単語−関係−単語」とい
う形式とする。我々の本発明は、特定して具体化すると、論理形態三連体を発生
し比較するものであるが、我々の発明は、先に注記したような、単語間の意味的
な関係の描写が可能な他のあらゆる形式も容易に利用することができる。ここで
用いる場合、その全てが論理形態という用語に含まれることとする。In a broad sense, a logical form is a directed graph in which words representing text of any arbitrary size are linked in a classification relationship. A logical form describes a semantic relationship between important words in a phrase and may include its nypernym and / or synonyms. As discussed and shown in FIGS. 5A-5D, a logical form may utilize any one of a number of different forms, for example, a logical form graph or any subgraph, such as a list of logical form triples. Can be. Each of the triples is in the form of “word-relation-word” as an example. Our invention, when specified and embodied, generates and compares logical form triples, but our invention allows for the depiction of semantic relationships between words, as noted above. Any other format is readily available. As used herein, all of them are included in the term logical form.

【００４１】論理形態三連体およびそれらの構造は、一連の徐々に複雑化する文章の例を通
じて最良に理解することができるので、最初に図５Ａについて検討する。この図
は、例示の入力ストリング５１０、具体的には、”The octopus has three hear
ts.”（蛸には心臓が３つある）という文章について、論理形態グラフ５１５および論理形態三連体５２５を示す。The logical form triads and their structure can be best understood through a series of increasingly complex text examples, so first consider FIG. 5A. This figure shows an example input string 510, specifically, “The octopus has three hear.
For the sentence "ts." (octopus has three hearts), a logical form graph 515 and a logical form triple 525 are shown.

【００４２】一般に、例示としての一実施形態では、例示の入力ストリング、例えば、入力
ストリング５１０に論理形態三連体を発生するためには、当該ストリングを最初
に解析し、その構成単語に分解する。その後、このような各単語毎に、格納して
ある語彙の中にある、既定のレコードを用いて（サーチ・エンジンが用いる文書
レコードと混同しないように）、既定の文法規則によって、これらの構成単語に
対応するレコード自体を、より大きな構造または分析に組み込み、次いで、再度
既定の文法規則によって順に結合し、構文的解析ツリーのような、更に大きな構
造を形成する。次いで、解析ツリーから、論理形態グラフを構築する。個々の規
則が個々の構成集合に適用可能か否かは、部分的に、ある種の対応する属性の有
無および単語レコードにおけるそれらの値によって支配（govern）される。次に
、論理形態グラフを一連の論理形態三連体に変換する。例示として、我々の発明
は、約１６５，０００個の主要部単語見出し（head word entry）を有するような語彙を用いる。この語彙は、例えば、代名詞、接続詞、動詞、名詞、機能語、
および数量詞というような様々なクラスの単語を含み、これらが入力ストリング
内の単語に固有な構文的および意味的特性を規定し、入力ストリングの解析ツリ
ーを構築できるようにする。明らかに、論理形態（また、更に言えば、意味的関
係を描写可能な論理形態内の論理形態三連体または論理形態グラフのような、別
の何らかの表現）は、予め計算することができる。一方、対応する文書は、イン
デックス化し、例えば、当該文書に対するレコード内に記録しておき、一旦当該
文書を検索したなら、後に計算せずに、続いてアクセスし使用できるようにして
おく。図１０ないし図１３Ｂに関連付けて以下で詳細に論ずる我々の発明の別の
実施形態において行なうように、このような事前計算および格納を用いると、我
々の発明にしたがって検索したあらゆる文書を処理するために必要な自然言語処
理量、したがってそれに伴う実行時間が劇的に減少するという利点がある。In general, in one exemplary embodiment, to generate a logical form triple in an exemplary input string, eg, input string 510, the string is first parsed and broken down into its constituent words. Then, for each such word, using a predetermined record in the stored vocabulary (not to be confused with a document record used by a search engine), these grammatical rules are used to construct these words. The records corresponding to the words themselves are incorporated into a larger structure or analysis, and then again joined in sequence by predefined grammar rules to form a larger structure, such as a syntactic parse tree. Next, a logical form graph is constructed from the parse tree. Whether individual rules are applicable to individual constituent sets is governed in part by the presence or absence of certain corresponding attributes and their values in word records. Next, the logical form graph is converted into a series of logical form triples. By way of example, our invention uses a vocabulary that has about 165,000 head word entries. This vocabulary includes, for example, pronouns, conjunctions, verbs, nouns, functional words,
And various classes of words, such as quantifiers, that define the syntactic and semantic properties that are unique to the words in the input string so that a parse tree of the input string can be constructed. Obviously, the logical form (and, for that matter, some other representation, such as a logical form triad or a logical form graph within a logical form capable of describing semantic relationships) can be pre-computed. On the other hand, the corresponding document is indexed and recorded in, for example, a record for the document, and once the document is searched, it can be subsequently accessed and used without being calculated later. Using such pre-computation and storage, as will be done in another embodiment of our invention discussed in detail below in connection with FIGS. 10-13B, to process any document retrieved according to our invention Has the advantage of dramatically reducing the amount of natural language processing required and therefore the execution time associated therewith.

【００４３】即ち、例示としての一実施形態では、図５Ａに示す文章５１０のような入力ス
トリングについて、その各構成単語毎に語彙の中で予め規定してあるレコードを
用いて、最初に形態学的に分析し、いわゆる「語幹」（または「ベース」）形態
をそのために発生する。語幹形状（stem form）を順番に用い、異なる単語形状、例えば、動詞の時制および名詞の単数複数の変化を、パーザが使用するための
共通な形態学的形状に正規化する。一旦語幹形状を生成したなら、文法規則およ
び構成単語のレコード内にある属性を用いて、パーザによって入力ストリングを
構文的に分析し、構文的解析ツリーをそのために生成する。このツリーは、入力
ストリングの構造、即ち、入力ストリングにおける各単語または句、例えば、名
詞句”The octopus”（蛸）、その対応する文法機能のカテゴリ、例えば、名詞句についてはＮＰ、およびその中の構文的に関係する各単語または句へのリンク
（複数のリンク）を描写する。例示の文章５１０については、その関連する構文
的解析ツリーは次のようになる。That is, in an exemplary embodiment, for an input string such as a sentence 510 shown in FIG. 5A, a morphology is first determined by using a record defined in the vocabulary for each constituent word. And a so-called "stem" (or "base") form is generated for it. Stem forms are used in turn to normalize different word forms, eg, verb tenses and noun singular variations, to a common morphological shape for use by the parser. Once the stem shape has been generated, the parser parses the input string using the grammar rules and the attributes in the constituent word records, and generates a syntactic parse tree for it. This tree shows the structure of the input string, i.e., each word or phrase in the input string, for example, the noun phrase "The octopus", its corresponding grammatical function category, for example, NP for the noun phrase, and Draw a link (several links) to each syntactically related word or phrase. For the example sentence 510, its associated syntactic parse tree is as follows:

【００４４】[0044]

【表１】 [Table 1]

【００４５】ツリーの左上側角に位置する開始ノードが、解析する入力ストリングの型を定
義する。文章型は、平叙文には（ここでは）”ＤＥＣＬ”、命令文には”ＩＭＰ
Ｒ”、および疑問文には”ＱＵＥＳ”を含む。右側に垂直にかつ開始ノードの下
に表示するのは、第１レベルの分析である。この分析は、アステリスクで示すヘ
ッド・ノードを有する。これは、典型的に、主動詞（ここでは、単語”has”）、前修飾語（premodifier）（ここでは、名詞句”The octopus”）であり、それ
に続いて後修飾語（postmodifier）（名詞句”three hearts”）が続く。ツリー
の各リーフは、語彙用語（lexical term）または句読点を含む。ここでは、ラベ
ルとして、”ＮＰ”は名詞句を指定し、”ＣＨＡＲ”は句読点を示す。The starting node, located in the upper left corner of the tree, defines the type of the input string to be parsed. The sentence type is “DECL” in the declarative sentence (here) and “IMP” in the command sentence.
R "and the query include" QUES. "Displayed vertically to the right and below the start node is a first level analysis, which has a head node denoted by an asterisk. This is typically a main verb (here, the word “has”), a premodifier (here, the noun phrase “The octopus”), followed by a postmodifier (noun, The phrase "three hearts") follows.Each leaf of the tree contains a lexical term or punctuation, where as labels "NP" specifies a noun phrase and "CHAR" indicates punctuation.

【００４６】次に、異なる１組の規則を用いて構文的解析ツリーを更に処理し、入力ストリ
ング５１０に対するグラフ５１５のような、論理形態グラフを生成する。論理形
態グラフを生成するプロセスは、入力ストリングの構文的分析から、基礎構造を
抽出することを伴う。論理形態グラフは、それらの間の意味的関係、および当該
関係の機能的性質を有するものとして規定した単語を含む。異なる意味的関係を
類別するために用いる「深い」ケース（deep case）および機能的役割は、次のものを含む。Next, the syntactic parse tree is further processed using a different set of rules to produce a logical form graph, such as graph 515 for input string 510. The process of generating a logical morphological graph involves extracting the underlying structure from a syntactic analysis of the input string. The logical form graph includes semantic relationships between them and words defined as having the functional nature of the relationships. The “deep” cases and functional roles used to categorize different semantic relationships include:

【００４７】[0047]

【表２】Ｄｓｕｂ−−深い主語Ｄｉｎｄ−−深い間接目的語Ｄｏｂｊ−−深い目的語Ｄｎｏｍ−−深い叙述主格語Ｄｃｍｐ−−深い目的補語表２入力ストリングにおける全ての意味的関係を識別するために、当該ストリング
の構文的解析ツリーにおける各ノードを試験する。前述の関係に加えて、例えば
、以下のような他の意味的役割も用いる。TABLE 2 Dsub-deep subject Dind-deep indirect object Dobj-deep object Dnom-deep predicative subject Dcmp-deep object complement Table 2 To identify all semantic relationships in the input string , Test each node in the parse tree of the string. In addition to the above relationships, other semantic roles are used, for example:

【００４８】[0048]

【表３】ＰＲＥＤ −−述語ＰＴＣＬ −−二部分動詞における不変化詞Ｏｐｓ −−機能語、例えば、数詞Ｎａｄｊ −−名詞を修飾する形容詞Ｄａｄｊ −−叙述形容詞ＰＲＯＰＳ−−節である、その他の未指定修飾語ＭＯＤＳ −−節でない、その他の未指定修飾語表３追加の意味的ラベルも同様に定義する。例えば、TABLE 3 PRED-predicate PTCL-invariant in two-part verb Ops-functional word, eg, adjective Nadj-adjective that modifies a noun Dadj-predicative adjective PROPS-other unspecified clauses Designated Modifiers MODS-Other unspecified modifiers that are not clauses Table 3 Additional semantic labels are defined similarly. For example,

【００４９】[0049]

【表４】ＴｍｅＡｔ−−時点ＬｏｃＡｔ−−場所表４いずれにしても、入力ストリング５１０に対するこのような分析の結果は、論
理形態グラフ５１５となる。入力ストリング内において意味的関係（例えば“Oc
topus”および“Have”）を相互に示す単語は、互いに連結して示し、それらの間の関係をリンク属性（例えば、Ｄｓｕｂ）として指定する。このグラフは、入
力ストリング５１０に対するグラフ５１５で代表するが、各入力ストリング毎に
、論拠（argument）および付加語の構造を捕らえる。とりわけ、論理形態分析は
、前置詞や冠詞のような機能的単語を、当該グラフ内で描写した特色（feature ）および構造的関係にマップする。また、一実施形態では、論理形態分析は前方
照応（anaphora）の解明も行なう。即ち、例えば代名詞と相互に関係のある名詞
句との間の正しい先行詞の関係を定義し、省略に対する適正な機能的関係を検出
し描写する。曖昧さおよび／または他の言語学的特異性に対処する試みにおいて
、論理形態分析中に追加の処理が行われることも当然あり得る。その場合、論理
形態グラフから従来のように対応する論理形態三連体を単に読み出し、１つの集
合として格納する。各三連体は、２つのノード単語を含み、それらの間の意味的
関係でリンクしたグラフとして描写する。例示の入力ストリング５１０では、論
理形態三連体５２５は、処理グラフ５１５から得られた。ここでは、論理形態三
連体５２５は３つの個々の三連体を含み、入力ストリング５１０に固有な意味情
報を共同して伝達（convey）する。In any case, the result of such an analysis on the input string 510 is a logical form graph 515. Semantic relationships within the input string (eg, "Oc
The words that indicate “topus” and “Have” are connected to each other and specify the relationship between them as a link attribute (eg, Dsub). This graph is represented by a graph 515 for the input string 510. Captures the argument and the structure of adjuncts for each input string.In particular, logical morphological analysis provides features and structures that describe functional words, such as prepositions and articles, in the graph. In one embodiment, logical form analysis also resolves anaphora, ie, defines the correct antecedent relationship between, for example, a pronoun and a correlated noun phrase. Detect and delineate the proper functional relationship to omissions. In that case, the corresponding logical form triad is simply read from the logical form graph and stored as a set, each triad containing two node words and their In the example input string 510, the logical form triad 525 was derived from the processing graph 515. Here, the logical form triad 525 is represented by three individual triads. , And jointly conveys semantic information unique to the input string 510.

【００５０】同様に、図５Ｂないし図５Ｄに示すように、入力ストリング５３０，５５０お
よび５７０では、具体的な例としての文章”The octopus has three hearts and
two lungs.”（蛸には心臓が３つおよび肺が２つある。）、”The octopus has
three hearts and it can swim.”（蛸には心臓が３つあり、泳ぐことができる
）、および”I like shark fin soup bowls.”（私はフカヒレ・スープが好きだ
）に対して、論理形態グラフ５３５，５５５および５７５ならびに論理形態三連
体５４０，５６０および５８０がそれぞれ得られる。Similarly, as shown in FIGS. 5B-5D, in the input strings 530, 550 and 570, the sentence “The octopus has three hearts and
two lungs. "(The octopus has three hearts and two lungs.)," The octopus has
Logical forms for "three hearts and it can swim." (the octopus has three hearts and can swim) and "I like shark fin soup bowls." Graphs 535, 555 and 575 and logical form triples 540, 560 and 580 are obtained, respectively.

【００５１】従来の方法とは別に、論理形態三連体全てを正確に生成するために追加の自然
言語処理に必要な論理形態構造が３つあり、その中に、論理形態グラフから論理
形態三連体を作成する、従来の「グラフ・ウオーク」（graph walk）を含む。”
The octopus has three hearts and two lungs”という文章例、即ち、入力スト
リング５３０におけるように、調整（coordination）の場合、単語、その意味的
関係、および調整する対象の構成要素の値の各々について、論理形態三連体を作
成する。「特殊な」グラフ・ウオークによれば、図５４０において、２つの論理
形態三連体”have-Dobj-heart”および”have-Dobj-lung”があるのがわかる。従来のグラフ・ウオークのみを用いた場合、一方の論理形態三連体”have-Dobj-
and”しか得られない。同様に、”The octopus has three hearts and it can s
wim”という文章例、即ち、入力ストリング５５０におけるように、関係項（ref
erent）（Ｒｅｆｓ）を有する構成要素の場合、従来のグラフ・ウオークが発生する三連体に加えて、単語、その意味的関係、およびＲｅｆ属性の値の各々につ
いて、論理形態三連体を作成する。この特殊なグラフ・ウオークによれば、三連
体５６０において、従来の論理形態三連体”swim-Dsub-it”に加えて、論理形態
三連体”swim-Dsub-octopus”があるのがわかる。最後に、”I like shark fin
soup bowls”という文章例、即ち、入力ストリング５７０におけるように、名詞
修飾語を有する構成要素の場合、名詞複合体の可能な内部構造を表わすために、
追加の論理形態三連体を作成する。従来のグラフ・ウオークは、論理形態三連体
”bowl-Mods-shark”、”bowl-Mods-fin”および”bowl-Mods-soup”を作成し、
可能な内部構造［［ｓｈａｒｋ］［ｆｉｎ］［ｓｏｕｐ］ｂｏｗｌ］を反映した
。特殊グラフ・ウオークでは、以下の可能な内部構造［［ｓｈａｒｋｆｉｎ］
［ｓｏｕｐ］ｂｏｗｌ］、および［［ｓｈａｒｋ］［ｆｉｎｓｏｕｐ］ｂｏｗ
ｌ］、および［［ｓｈａｒｋ［ｆｉｎ］ｓｏｕｐ］ｂｏｗｌ］をそれぞれ反映
するために、次の追加の論理形態三連体”fin-Mods-shark”、”soup-Mods-fin ”、および”soup-Mods-shark”を作成する。Apart from the conventional method, there are three logical form structures required for additional natural language processing in order to correctly generate all the logical form triples. , Including a traditional "graph walk". "
In the example of the sentence "The octopus has three hearts and two lungs", that is, in the case of coordination, as in the input string 530, for each word, its semantic relationship, and the value of the component to be adjusted, a logical Create a morphology triad: According to the "special" graph walk, in Figure 540 we see that there are two logical morphology triples "have-Dobj-heart" and "have-Dobj-lung". When only the conventional graph walk is used, one logical form triad "have-Dobj-
and ”. Similarly,“ The octopus has three hearts and it can s
wim ", as in the input string 550,
For elements with (erent) (Refs), a logical form triple is created for each of the words, their semantic relationships, and the value of the Ref attribute, in addition to the conventional triplet where the graph walk occurs. According to this special graph walk, in the triple 560, in addition to the conventional logical form triple "swim-Dsub-it", there is a logical form triple "swim-Dsub-octopus". Finally, "I like shark fin
In the example of the sentence "sou bowls", ie, for components having a noun modifier, as in the input string 570, to represent the possible internal structure of the noun complex:
Create additional logical form triples. The traditional graph walk creates logical form triads "bowl-Mods-shark", "bowl-Mods-fin" and "bowl-Mods-soup",
Reflected a possible internal structure [[shark] [fin] [sup] bowl]. In the special graph walk, the following possible internal structures [[shark fin]
[Soop] bowl] and [[shark] [fin soup] bow
l], and [[shark [fin] soop] bowl], respectively, with the following additional logical form triples “fin-Mods-shark”, “soup-Mods-fin”, and “soup-Mods”: -shark ”is created.

【００５２】形態学的、構文的、および論理的な形式処理の具体的な詳細は本発明には関連
がないので、それらに関する更なる詳細は全て省略する。しかしながら、この点
に関する更なる詳細については、１９９６年６月２８日に出願し、連番第０８／
６７４，６１０号を付与された”Method and System for Computing Semantic L
ogical Forms from Syntax Trees”（構文ツリーから意味的論理形態を計算する
方法およびシステム”、および特に１９９７年３月７日に出願し連番第０８／８
８６，８１４号を付与された”Information Retrieval Utilizaing Sematnic Re
presentation of Text”（テキストの意味的表現を利用した情報検索）と題する
同時係属中の米国特許出を読者に引用しておく。これらは双方とも本願の譲受人
に譲渡されており、この言及により本願にも含まれるものとする。Since the specific details of morphological, syntactic, and logical formalism are not relevant to the present invention, all further details regarding them are omitted. However, for further details in this regard, filed Jun. 28, 1996 and filed with serial no.
“Method and System for Computing Semantic L granted No. 674,610
ogical Forms from Syntax Trees ”and“ Serial Number 08/8, filed March 7, 1997, specifically filed on March 7, 1997.
86,814, "Information Retrieval Utilizaing Sematnic Re
Reference is made to the reader with co-pending U.S. patents entitled "presentation of Text" (information retrieval using semantic representation of text), both of which are assigned to the assignee of the present application. It shall be included in the present application.

【００５３】この論理形態およびその構造の概要を念頭に入れておき、これより我々の本発
明を実施する処理の論述に戻ることにする。図２、図３および図４に示した我々
の発明の具体的な実施形態において用いるような、検索プロセス６００において
利用する我々の発明のフローチャートを、ひとまとめにして図６Ａおよび図６Ｂ
に示す。これらの図に対する図面の正しい位置合わせを図６に示す。破線のブロ
ック２２５に示す動作を例外として、これらの図に示す残りの動作は、コンピュ
ータ・システム、例えば、クライアントＰＣ３００（図２および図３参照）、お
よび具体的にはウェブ・ブラウザ４２０内部で実行する。理解を簡単にするため
に、読者は、以下の論述を通じて図２、図３、図６Ａおよび図６Ｂも同時に参照
するとよい。With this summary of the logical form and its structure in mind, we will return to our discussion of the process of practicing the present invention. The flowcharts of our invention utilized in the search process 600, as used in the specific embodiment of our invention shown in FIGS. 2, 3 and 4, are collectively shown in FIGS. 6A and 6B.
Shown in The correct alignment of the drawings with respect to these figures is shown in FIG. With the exception of the operation shown in dashed block 225, the remaining operations shown in these figures are performed within a computer system, for example, client PC 300 (see FIGS. 2 and 3), and specifically within web browser 420. I do. For ease of understanding, the reader may simultaneously refer to FIGS. 2, 3, 6A and 6B throughout the following discussion.

【００５４】プロセス６００を開始すると、実行は最初にブロック６０５に進む。このブロ
ックを実行すると、ユーザにウェブ・ブラウザ４２０を通じてフル・テキスト（
リテラル）クエリを入力するように促す。クエリは、単一の質問（例えば、「バ
リにはエアコン付きのホテルはあるか」）または単一の文章（例えば、７月中に
シアトルで開催される全花火大会についての問い合わせ情報を私に下さい）、ま
たは文章の一部（例えば、「エクアドルにおける衣類」）の形式とすることがで
きる。一旦このクエリを得たなら、実行は、経路６０７を通ってブロック６１０
に、そして経路６４３を通って経路６４５にと分割して進めていく。ブロック６
４５を実行すると、ＮＬＰルーチン７００を呼び出し、クエリを分析して、その
対応する論理形態三連体集合を構築し、ローカルに格納する。ブロック６１０を
実行すると、破線６１５でシンボル化するように、フル・テキスト・クエリをウ
ェブ・ブラウザ４２０から、インターネット接続を介して、サーバ２２０上に位
置するエンジン２２５のような、リモート・サーチ・エンジンに送信する。この
時点で、サーチ・エンジンによってブロック６２５を実行し、クエリに応答して
文書レコード集合を検索する。一旦この集合を形成したなら、破線６３０でシン
ボル化するように、リモート・サーバからコンピュータ・システム３００に、そ
して具体的にはそこで実行中のウェブ・ブラウザ４２０にこの集合を返送する。
その後、ブロック６３５を実行してレコード集合を受信し、各レコード毎に当該
レコードからＵＲＬを抽出し、そのＵＲＬにあるウェブ・サイトにアクセスし、
そのレコードに対応する文書を含む関連ファイルをそこからダウンロードする。
一旦文書全てをダウンロードしたなら、ブロック６４０を実行する。このような
各文書毎に、このブロックはまず当該文書から全てのテキストを抽出する。テキ
ストには、当該文書に関連付けられているＨＴＭＬタグ内に位置するあらゆるテ
キストが含まれる。その後、一度に１つの文書のみに動作する自然言語処理を簡
便化するために、従来の文書分解部（document breaker）によって各文書のテキ
ストをテキスト・ファイルに分解する。この場合、各文章（または質問）は、フ
ァイル上の別個のラインを占める。その後、ブロック６４０が当該文書内のテキ
ストの各ライン毎に繰り返しＮＬＰルーチン７００（図７に関連付けて以下で詳
細に論ずる）を呼び出し、これら文書の各々を分析し、当該文書内のテキストの
各ライン毎に対応する論理形態三連体集合を構築し、ローカルに格納する。ブロ
ック６４５における動作は、本質的にブロック６１０，６３５および６４０にお
ける動作と並行して行なうものとして論じたが、実際の実施態様の考慮に基づい
て、形成部ブロックにおける動作は、ブロック６１０，６３５および６４０内の
動作の前または後のいずれかに連続的に実行することも可能である。あるいは、
図１０ないし図１３Ｂに関連付けて以下で論ずる我々の発明の別の実施形態の場
合におけるように、各文書毎の論理形態三連体を予め計算し、後のアクセスおよ
び使用のために、文書検索中に格納しておくことも可能である。その場合、これ
らの三連体は、文書検索中に計算せずに、単純にアクセスすることが可能である
。この場合、三連体は、何らかの方法で、その格納してある文書の特性として、
または、例えば、当該文書のレコードまたは当該文書を含むデータセットのいず
れかに別個のエントリとして格納しておくことも可能である。Upon starting the process 600, execution first proceeds to block 605. Executing this block causes the user to access the full text (
(Literal) prompts for a query. The query can be a single question (for example, "Is there an air-conditioned hotel in Bali?) Or a single sentence (for example, contact information about the entire fireworks festival in Seattle during July. Please), or part of a sentence (eg, "clothing in Ecuador"). Once this query is obtained, execution proceeds via path 607 to block 610
, And through the path 643 to the path 645. Block 6
Execution of 45 calls the NLP routine 700 to analyze the query and build its corresponding logical form triplet set and store it locally. Upon execution of block 610, a full-text query is transmitted from a web browser 420 via an Internet connection, such as an engine 225 located on a server 220, as symbolized by a dashed line 615, to a remote search engine. Send to At this point, block 625 is executed by the search engine to retrieve the document record set in response to the query. Once this set is formed, it is returned from the remote server to the computer system 300, and specifically to the web browser 420 running thereon, as symbolized by the dashed line 630.
Thereafter, block 635 is executed to receive the record set, extract a URL from the record for each record, access a web site at the URL,
Download the relevant file containing the document corresponding to the record from there.
Once all the documents have been downloaded, block 640 is executed. For each such document, the block first extracts all text from the document. The text includes any text located within the HTML tags associated with the document. Thereafter, in order to simplify the natural language processing that operates on only one document at a time, the text of each document is decomposed into a text file by a conventional document breaker. In this case, each sentence (or question) occupies a separate line on the file. Thereafter, block 640 repeatedly calls the NLP routine 700 (discussed in detail below in connection with FIG. 7) for each line of text in the document, analyzes each of these documents, and analyzes each line of text in the document. A corresponding logical form triad set is constructed for each and stored locally. Although the operations at block 645 have been discussed as essentially performing in parallel with the operations at blocks 610, 635 and 640, based on considerations of actual implementation, the operations at the formation block may be performed at blocks 610, 635 and It is also possible to perform continuously either before or after the operation in 640. Or,
As in the case of another embodiment of our invention discussed below in connection with FIGS. 10-13B, the logical form triad for each document is pre-computed and the document is searched for later access and use. It is also possible to store in. In that case, these triads can simply be accessed without calculating during document retrieval. In this case, the triad may in some way characterize the stored document:
Alternatively, for example, it is also possible to store the document as a separate entry in either a record of the document or a data set including the document.

【００５５】いずれにしろ、図６Ａおよび図６Ｂに示すプロセス６００に戻り、一旦論理形
態三連体集合を構築し、クエリに対し、および出力文書集合内の検索文書の各々
に対して完全に格納した後、ブロック６５０を実行する。このブロックは、クエ
リ内の論理形態三連体の各々を、検索文書の各々に対する論理形態三連体の各々
と比較し、クエリ内のいずれかの三連体と文書のいずれかにおけるいずれかの三
連体との間の一致を突き止める。一致の例示形態は、ノード・ワードに関して、
およびこれらの三連体における関係における、２つの三連体間の完全一致（iden
tical match）として定義する。即ち、例示としての論理形態三連体の対、wordl
a-relation1-word2aおよびword1b-relation2-word2bでは、ノード単語word1aおよびword 1bが互いに同一であり、ノード語word2aおよびword2bが互いに同一であり、relation1およびrelation2が同一である場合にのみ、一致が生ずる。１つ
の三連体の３エレメント全てが、別の三連体の対応するエレメントと完全に一致
しなければ、これら２つの三連体は一致しない。一旦ブロック６５０が完了した
なら、ブロック６５５を実行して、一致する三連体を示さない検索文書、即ち、
クエリ内の三連体のいずれにも一致する三連体がない検索文書全てを破棄する。
その後、ブロック６６０を実行する。ブロック６６０によって、残っている文書
全てに対し、これらの文書の各々について存在する、一致した三連体の関連型（
複数の型）およびその重みに基づいて、スコアを割り当てる。即ち、論理形態三
連体内に発生し得る異なる関係型毎に、図８Ａのテーブル８００に示すような、
対応する重みを割り当てる。例えば、図示のように、例示の関係Ｄｏｂｊ，Ｄｓ
ｕｂ，ＯｐｓおよびＮａｄｊには、所定の固定数値重み１００，７５，１０およ
び１０をそれぞれ割り当てるとよい。重みは、クエリと文書との間の正確な意味
的一致を示す際に、当該関係に帰せられる相対的重要度を反映する。これらの重
みの実際の数値は、通常経験に基づいて定義する。以下で図８Ｂに関連付けて詳
細に説明するが、残りの各文書に対するスコアは、それぞれ１つずつの一致三連
体（全ての重複する一致三連体は無視する）についての重みの既定の関数であり
、例示として、ここでは、数値合計（numeric sum）とする。一旦文書にこのように重み付けをしたなら、ブロック６６５を実行して、スコアの降順で文書をラ
ンク順に並び替える。最後に、ブロック６７０を実行して、最も高いスコアを示
す、典型的に５つまたは１０個の小さな既定の文書群に関して、典型的に、ラン
ク順で文書を表示する。その後、ユーザは、例えば、適切に彼（彼女）のマウス
を、ウェブ・ブラウザ４２０が表示する対応するボタン上で「クリック」するこ
とによって、コンピュータ・システム（クライアントＰＣ）３００に、ランク付
けした文書の次の群を表示させ、ランク付けした文書全てを連続してユーザが十
分に試験し終えるまで、このように続ける。試験し終えた時点で、プロセス６０
０は完了する。In any case, returning to the process 600 shown in FIGS. 6A and 6B, once the logical form triad set has been constructed and completely stored for the query and for each of the search documents in the output document set. Thereafter, block 650 is executed. This block compares each of the logical form triples in the query with each of the logical form triples for each of the search documents, and matches any of the triples in the query with any of the triples in any of the documents. For a match between An example form of a match is for a node word:
Perfect match between the two triads and the relationship in these triads (iden
tical match). That is, an exemplary pair of logical forms, wordl
In a-relation1-word2a and word1b-relation2-word2b, a match is found only if the node words word1a and word1b are the same, node words word2a and word2b are the same, and relation1 and relation2 are the same. Occurs. If all three elements of one triplet do not completely match the corresponding elements of another triplet, then the two triplets will not match. Once block 650 is completed, block 655 is executed to retrieve a search document that does not show a matching triple, ie,
Discard all search documents that do not have a triple that matches any of the triples in the query.
Thereafter, block 660 is executed. Block 660 indicates that for all remaining documents, the associated triad associated type (existing for each of these documents)
(Several types) and their weights. That is, for each of the different relation types that can occur in the logical form triad, as shown in the table 800 of FIG.
Assign corresponding weights. For example, as shown, the example relationships D obj, D s
Predetermined fixed numerical weights 100, 75, 10 and 10 may be assigned to ub, Ops and Nadj, respectively. The weight reflects the relative importance attributed to the relationship in indicating an exact semantic match between the query and the document. The actual numerical values of these weights are usually defined empirically. As will be described in more detail below in connection with FIG. 8B, the score for each of the remaining documents is a default function of the weight for each one matching triple (ignoring all overlapping matching triples). As an example, here, a numerical sum is used. Once the documents have been weighted in this way, block 665 is executed to sort the documents in rank order in descending score order. Finally, block 670 is executed to display the documents, typically in rank order, for a typical set of five or ten small documents that has the highest score. Thereafter, the user ranks the document into the computer system (client PC) 300 by, for example, appropriately "clicking" his (her) mouse on the corresponding button displayed by the web browser 420. , And so on until all the ranked documents have been fully tested by the user in succession. At the end of the test, process 60
0 is complete.

【００５６】図７は、ＮＬＤルーチン７００のフローチャートを示す。このルーチンは、入
力テキストの一ラインが与えられると、それがクエリであれ、文書内の文章であ
れ、またはテキストの断片であれ、それに対して対応する論理形態三連体を構築
する。FIG. 7 shows a flowchart of the NLD routine 700. This routine, given a line of input text, whether it is a query, a sentence in a document, or a fragment of text, constructs a corresponding logical form triple.

【００５７】即ち、ルーチン７００に入ると、最初にブロック７１０を実行し、入力テキス
トのラインを処理し、図５Ａに示した例示のグラフ５１５のような、論理形態グ
ラフを生成する。この処理は、例示として、形態学的および意味的処理を含み、
構文解析ツリーを生成し、次いでこれから論理形態グラフを計算する。その後、
図７に示すように、ブロック７２０を実行し、グラフから対応する論理形態三連
体集合を抽出する（読み出す）。一旦これを行なったなら、ブロック７３０を実
行し、このような論理形態三連体の各々を、別個で異なるフォーマットのテキス
ト・ストリングとして発生する。最後に、ブロック７４０を実行し、入力テキス
トのライン、および一連のフォーマットしたテキスト・ストリングとして、当該
ラインに対する論理形態三連体集合を、データセット（またはデータベース）に
格納する。一旦この集合を完全に格納したなら、実行はブロック７００から出る
。あるいは、論理形態三連体の代わりに、異なる表現、例えば、論理形態に関連
する論理形態グラフを、我々の発明と共に用いる。そうする場合、その特定の形
態を、フォーマットしたストリングとして発生するようにブロック７２０および
７３０を変更するのは容易であり、ブロック７４０では、論理形態三連体の代わ
りに、その形態をデータセットに格納する。That is, upon entering the routine 700, block 710 is first executed to process a line of input text and generate a logical form graph, such as the exemplary graph 515 shown in FIG. 5A. This processing includes, by way of example, morphological and semantic processing,
Generate a parse tree and then compute a logical form graph from this. afterwards,
As shown in FIG. 7, block 720 is executed to extract (read out) the corresponding set of logical form triples from the graph. Once this has been done, block 730 is executed to generate each such logical form triple as a separate and differently formatted text string. Finally, block 740 is executed to store the line of input text and the set of logical form triples for that line as a series of formatted text strings in a dataset (or database). Once this set has been completely stored, execution exits block 700. Alternatively, instead of a logical form triad, a different representation, eg, a logical form graph associated with a logical form, is used with our invention. If so, it is easy to modify blocks 720 and 730 to generate that particular form as a formatted string, and block 740 stores that form in a data set instead of a logical form triad I do.

【００５８】我々の発明が例示として、一致する論理形態三連体を比較し重み付けし、更に
対応する文書をランク付けする方法を完全に理解するために、図８Ｂを検討する
。この図は、我々の発明の教示による、論理形態三連体の比較、文書のスコア決
定、ランク付け、および選択プロセスを図式的に示す。これらのプロセスは、例
示のクエリおよび例示の３つの検索文書集合に対して、全て図６Ａおよび図６Ｂ
に示す、ブロック６５０，６６０，６６５および６７０内で行われる。例示の目
的上、ユーザがフル・テキスト・クエリ８１０を我々の発明の検索システムに供
給し、そのクエリが”How many hearts does an octopus have?”（蛸には心臓がいくつあるか）であると仮定する。また、このクエリに応答して、統計的サー
チ・エンジンによって、３つの文書８２０を最終的に検索したと仮定する。これ
らの文書の内、第１の文書（文書１と名付ける）は、アーティチョークの芯（ar
tichoke heart）および蛸を含む調理法である。第２の文書（文書２と名付ける）は、蛸に関する論文である。第３の文書（文書３と名付ける）は、鹿に関する
論文である。これら３つの文書およびクエリをその構成論理形態三連体に変換す
る。そのためのプロセスを総括的に「ＮＬＰ」（自然言語処理）で表わす。クエ
リおよび文書１、文書２および文書３に対して得られた論理形態三連体は、ブロ
ック８３０，８４０，８５０および８６０においてそれぞれ与えられる。To fully understand how our invention compares and weighs matching logical form triples, and further ranks corresponding documents, consider FIG. 8B. This figure schematically illustrates the logical form triad comparison, document scoring, ranking, and selection process in accordance with the teachings of our invention. 6A and 6B for the example query and the three example search document collections.
Are performed in blocks 650, 660, 665 and 670 shown in FIG. For illustrative purposes, a user supplies a full text query 810 to our search system, and the query is "How many hearts does an octopus have?" Assume. Also assume that in response to this query, three documents 820 were eventually retrieved by the statistical search engine. Of these documents, the first (named Document 1) is the artichoke core (ar
It is a recipe containing tichoke heart) and octopus. The second document (named Document 2) is a dissertation on octopus. The third document (named Document 3) is a dissertation on deer. Transform these three documents and queries into their constituent logical forms. The process for this is generally represented by "NLP" (natural language processing). The queries and the resulting logical form triples for Document 1, Document 2 and Document 3 are provided in blocks 830, 840, 850 and 860, respectively.

【００５９】一旦これらの三連体をこのように定義したなら、次に、破線８４５，８５５，
および８６５でシンボル化するように、クエリに対する論理形態三連体を、順次
、文書１、文書２および文書３に対する論理形態三連体とそれぞれ比較し、いず
れかの文書が、クエリ内のいずれかの論理形態三連体と一致するいずれかの三連
体を含むか否かについて確かめる。文書１の場合のように、このような一致する
三連体を含まない文書を破棄し、したがってこれ以上考慮しない。一方、文書２
および文書３は、一致する三連体を含む。即ち、文書２は、このような三連体を
３つ、例示として１つの文章に関連する”HAVE-Dsub-OCTOPUS”、”HAVE-Dsub-H
EART”および例示として別の文章に関連する”HAVE-Dsub-OCTOPUS”を含む（これらの文章は、具体的に示していない）。これらの三連体の内、２つは同一であ
る。即ち、”HAVE-Dsub-OCTOPUS”は同一である。例示として、この文書に対するスコアは、当該文書内において全ての一致する三連体１つずつの重みの数値合
計である。いずれの文書についても、重複して一致する三連体は全て無視する。
三連体内に発生し得る異なる型の関係の相対的な重み付けを、その最大重みから
最小重みまで降順でランク付けした例は、最初に、動詞−目的語の組み合わせ（
Ｄｏｂｊ）、動詞−主語の組み合わせ（Ｄｓｕｂ）、前置詞および機能語（例え
ば、Ｏｐｓ）、そして最後に修飾語（例えば、Ｎａｄｊ）となる。このような重
み付け方式を、図８Ａに示す例示の三連体重み付け表８００に示す。この図を簡
略化するために、表８００は、論理形態三連体内に発生し得る異なる関係の全て
は含まず、図８Ｂに示す三連体に関連のあるものだけを含む。このメトリックで
は、各文書においてそのスコアに寄与する個々の三連体をチェック（「レ」）マ
ークで示す。勿論、先に選択したもの以外に、文書にスコアを付けるために別の
メトリックを予め決めておき、用いてもよい。例えば、重みを加算する代わりに
乗算して文書選択性（判別）を高めたり、同じ型の多数の一致を含む、および／
または先に注記したもの以外の三連体の重みを除外するというような、異なる方
法を予め規定しておき、重みを加算する。加えて、いずれの文書についても、ス
コアは、何らかの方法で、当該文書内の三連体自体におけるノード語、あるいは
当該文書におけるこれらのノード語の頻度または意味的内容、当該文書内の特定
のノード語の頻度または意味的内容、あるいは特定の論理形態（またはその言い
換え）および／または当該文書内の特定の論理形態三連体全体としての頻度、な
らびに当該文書の長さを考慮に入れることも可能である。Once these triads have been defined in this way, then the dashed lines 845, 855,
The logical form triples for the query are sequentially compared to the logical form triples for Document 1, Document 2 and Document 3, respectively, as symbolized at and 865, and any document is identified as any logical form in the query. Check to see if any triads that match the morphological triad are included. Documents that do not contain such a matching triple, as in Document 1, are discarded, and are therefore not considered further. On the other hand, document 2
And Document 3 contains a matching triple. That is, the document 2 includes three such triples, for example, “HAVE-Dsub-OCTOPUS” and “HAVE-Dsub-H” related to one sentence.
EART ”and“ HAVE-Dsub-OCTOPUS ”by way of example, which are related to another sentence (these sentences are not specifically shown) .Of these triples, two are identical. That is, “HAVE-Dsub-OCTOPUS” is the same, For example, the score for this document is the numerical sum of the weights of all the matching triplets in the document. Also, ignore any triples that overlap.
An example of ranking the relative weights of the different types of relationships that can occur in a triad in descending order from their maximum weight to their minimum weight first shows a verb-object combination (
Doc), verb-subject combinations (Dsub), prepositions and functional words (eg, Ops), and finally modifiers (eg, Nadj). Such a weighting scheme is shown in the exemplary triple weighting table 800 shown in FIG. 8A. To simplify this figure, the table 800 does not include all of the different relationships that may occur within the logical form triad, but only those that are relevant to the triad shown in FIG. 8B. In this metric, the individual triads that contribute to that score in each document are indicated by a check ("re") mark. Of course, other than the previously selected one, another metric may be determined in advance for scoring the document and used. For example, multiply instead of add weights to increase document selectivity (discrimination), include multiple matches of the same type, and / or
Alternatively, a different method is defined in advance, such as excluding the weights of the triples other than those noted above, and the weights are added. In addition, for any document, the score may be calculated in some way by the node words in the triad itself in the document, or the frequency or semantic content of these node words in the document, the specific node words in the document. It is also possible to take into account the frequency or semantic content of the document, or the frequency of the particular logical form (or paraphrase) and / or the particular logical form triple in the document, and the length of the document .

【００６０】したがって、前述した例示のスコア決定メトリック、および図８Ａの表８００
に掲示した重みを仮定すると、文書２に対するスコアは１７５となる。これは、
文書内の最初の文章に関連し、ブロック８５０に示した最初の２つの三連体に対
する重み、即ち、１００および７５を組み合わせることによって形成したもので
ある。この文書の３番目の三連体は、その２番目の文章に関連があり、このブロ
ックに掲示してあり、既に、文書内に存在する他の三連体の１つと一致するので
、無視する。同様に、文書３に対するスコアは１００である。この特定の文書で
は、ブロック８６０に掲示するように、唯一の一致する三連体に対する重み、こ
こでは１００で形成する。これらのスコアに基づいて、文書２を文書３よりも高
くランク付けし、これらの文書をこの順序でユーザに提示する。ここでは発生し
ないが、いずれか２つの文書が同じスコアを有する場合、これらの文書は、従来
の統計的サーチ・エンジンが与える同じ順序でランク付けし、その順序でユーザ
に提示する。Accordingly, the exemplary score determination metric described above, and table 800 of FIG. 8A
Assuming the weights described in, the score for document 2 is 175. this is,
It is associated with the first sentence in the document, formed by combining the weights for the first two triples shown in block 850, ie, 100 and 75. The third triple in this document is relevant to the second sentence, is posted in this block, and is ignored because it matches one of the other triples already in the document. Similarly, the score for document 3 is 100. In this particular document, as posted at block 860, the weight for the only matching triple is formed, here 100. Based on these scores, Document 2 is ranked higher than Document 3 and these documents are presented to the user in this order. Although not occurring here, if any two documents have the same score, they are ranked in the same order provided by a conventional statistical search engine and presented to the user in that order.

【００６１】明らかに、我々の発明を実施するために用いる処理の種々の部分は、単一のコ
ンピュータ内に位置することも、あるいは集合的に情報検索システムを形成する
異なるコンピュータ間で分散することも可能であることは、当業者は容易に認め
よう。これに関して、図９Ａないし図９Ｃは、それぞれ、我々の本発明の教示を
組み込んだ情報検索システムの異なる実施形態を３つそれぞれ示す。Obviously, the various parts of the process used to implement our invention may be located on a single computer or distributed among different computers that collectively form an information retrieval system. Those skilled in the art will readily recognize that this is also possible. In this regard, FIGS. 9A-9C each show three different embodiments of an information retrieval system that incorporates the teachings of the present invention.

【００６２】このような代替実施形態の１つを図９Ａに示す。ここでは、全ての処理はＰＣ
のような単一のローカル・コンピュータ９１０内に位置する。この場合、コンピ
ュータ９１０は、サーチ・エンジンを運営し、そのエンジンを通じて、入力文書
をインデックス化し、ユーザが供給するフル・テキスト・クエリに応答して、デ
ータセット（ＣＤ−ＲＯＭまたはその他の記憶媒体のようにそこにローカルに位
置するもの、またはそのコンピュータにアクセス可能なもの）を探索し、最終的
に出力文書集合を形成する検索文書集合を生成する。また、このコンピュータは
、我々の発明の処理も担当し、クエリおよびこのような各文書双方を分析してそ
の対応する論理形態三連体集合を生成し、次いで三連体集合を比較し、先に論じ
たように文書を選択し、スコアを決め、ランク付けし、最終的に結果を、例えば
、そこにいるまたはそれにアクセス可能なローカル・ユーザに提示する。One such alternative embodiment is shown in FIG. 9A. Here, all processing is PC
Located within a single local computer 910. In this case, the computer 910 operates a search engine through which the input documents are indexed and in response to a user-supplied full-text query, a data set (CD-ROM or other storage medium). (Such as those locally located there or accessible to the computer) to generate a set of search documents that ultimately forms the set of output documents. This computer is also responsible for the processing of our invention, analyzing both the query and each such document to generate its corresponding logical form triad set, then comparing the triad sets and discussing it earlier. The documents are selected, scored and ranked as before, and the results are finally presented to, for example, a local user who is or has access to it.

【００６３】別の代替実施形態を図９Ｂに示す。これは、図２に示した具体的な内容を含み
、リモート・サーバにネットワークを通じて接続したクライアントＰＣで、検索
システムを形成する。ここでは、ネットワーク接続９２５を介してリモート・コ
ンピュータ（サーバ）９３０にクライアントＰＣ９２０を接続する。クライアン
トＰＣ９２０に位置するユーザがフル・テキスト・クエリを入力し、一方ＰＣは
ネットワーク接続を介してこのフル・テキスト・クエリをリモート・サーバに送
信する。また、クライアントＰＣは、クエリを分析し、その対応する論理形態三
連体集合を生成する。サーバは、例えば、従来の統計的サーチ・エンジンを運営
し、したがって、クエリに応答して統計的検索を引き受け、文書レコード集合を
生成する。次に、サーバはレコード集合を返送し、最終的に、クライアントの命
令によって、またはサーチ・エンジンまたは連動するソフトウエアの機能に基づ
いて自律的に、出力文書集合内の各文書をクライアントＰＣに返送する。次に、
クライアントＰＣは、出力文書集合内の対応する文書の各々を分析し、それに対
する論理形態三連体集合を生成するために受信する。次に、クライアントＰＣは
、適切に三連体集合を比較し、先に論じたように文書の選択、スコア決定、およ
びランク付けを行ない、最終的に結果をローカル・ユーザに提示することによっ
て、その処理を完了する。Another alternative embodiment is shown in FIG. 9B. This includes the specific contents shown in FIG. 2, and forms a search system with client PCs connected to a remote server via a network. Here, the client PC 920 is connected to the remote computer (server) 930 via the network connection 925. A user located at client PC 920 enters a full-text query, while the PC sends the full-text query over a network connection to a remote server. In addition, the client PC analyzes the query and generates a corresponding logical form triplet set. The server may, for example, operate a conventional statistical search engine, thus undertaking a statistical search in response to a query and generating a set of document records. The server then returns the record set, and eventually returns each document in the output document set to the client PC, either at the client's command or autonomously based on the capabilities of the search engine or associated software. I do. next,
The client PC analyzes each of the corresponding documents in the output document set and receives it to generate a logical form triad set for it. The client PC then compares the triad set appropriately, selects, scores, and ranks the documents as discussed above, and finally presents the results to the local user. Complete the process.

【００６４】更に別の実施形態を図９Ｃに示す。この実施形態は、図９Ｂにおけると同一の
物理的ハードウエアおよびネットワーク接続を採用するが、クライアントＰＣ９
２０はローカル・ユーザからフル・テキスト・クエリを受け入れ、ネットワーク
接続９２５を介してそのクエリを更にリモート・コンピュータ（サーバ）９３０
に送信する。このサーバは、単に従来のサーチ・エンジンを運営する代わりに、
我々の発明による自然言語処理も行なう。この場合、クライアントＰＣではなく
、サーバがクエリを適切に分析し、それに対して対応する論理形態三連体集合を
生成する。また、サーバは、必要であれば、出力文書集合内の各検索文書をダウ
ンロードし、次いでこのような各文書を分析し、それに対して対応する論理形態
三連体集合を生成する。その後、サーバはクエリおよび文書に対する三連体集合
を適切に比較し、先に論じたように、文書の選択、スコア決定およびランク付け
を行なう。一旦このランク付けを行なったなら、次にサーバ９３０は残っている
検索文書をランク順に、ネットワーク接続９２５を介して、クライアントＰＣ９
２０に送信し、そこで表示する。サーバは、これらの文書を、先に明記したよう
にユーザに命令されて、群毎に送信するか、あるいは全てを順次送信しそれらの
間で群毎に選択してクライアントＰＣにおいて表示することも可能である。Another embodiment is shown in FIG. 9C. This embodiment employs the same physical hardware and network connections as in FIG.
20 accepts a full text query from a local user and further passes the query over a network connection 925 to a remote computer (server) 930.
Send to Instead of just running a traditional search engine, this server
We also perform natural language processing according to our invention. In this case, the server, not the client PC, properly analyzes the query and generates a corresponding set of logical form triples for it. The server also downloads each search document in the output document set, if necessary, and then analyzes each such document to generate a corresponding logical form triplet set. The server then appropriately compares the triad set for the query and the document, and performs document selection, scoring and ranking as discussed above. Once this ranking has been performed, the server 930 then sorts the remaining search documents in order of rank via the network connection 925 to the client PC 9.
20 and display it there. The server may transmit these documents in groups as directed by the user as specified above, or may transmit all of them in sequence and select between them for each group to display on the client PC. It is possible.

【００６５】更に、リモート・コンピュータ（サーバ）９３０は、先に記した従来の検索、
自然言語、および関連する処理全てを行なう単一のコンピュータだけによって実
施する必要はなく、図９Ｄに示すような分散型処理システムとすることも可能で
ある。この場合、このサーバが引き受ける処理は、その中の個々のサーバ間で分
配する。ここでは、サーバ９３０をフロント・エンド・プロセッサ９４０で形成
し、接続９５０を介してメッセージを一連のサーバ９６０（サーバ１，サーバ２
，．．．，サーバｎを含む）に分散する。これらのサーバの各々は、我々の発明
プロセスの特定部分を実施する。この点に関して、サーバ１は、入力文書を、大
容量データ記憶装置上のデータセットにインデックス化し、後に検索可能にする
ために用いることができる。サーバ２は、従来の統計的エンジンのようなサーチ
・エンジンを実装し、ユーザが供給しこれに送出されたクエリに応答して、フロ
ント・エンド・プロセッサ９４０によって、大容量データ記憶装置から文書レコ
ード集合を検索することができる。対応するウェブ・サイトまたはデータベース
から、出力文書集合内の対応する各文書をダウンロードするというような、後の
処理のために、これらのレコードは、サーバ２から、フロント・エンド・プロセ
ッサ９４０を介して、例えば、サーバｎに送出する。また、フロント・エンド・
プロセッサ９４０は、クエリをサーバｎに送出する。すると、サーバｎはクエリ
および各文書を適切に分析し、対応する論理形態三連体集合を生成し、次いで三
連体集合を適切に比較し、先に論じたように文書の選択、スコア決定、およびラ
ンク付けを行ない、ランク付けした文書を、フロント・エンド・プロセッサ９４
０を介して、クライアントＰＣ９２０に返送し、ランク順でここに表示する。勿
論、我々の発明処理において用いる種々の動作は、多くの別の方法のいずれの１
つでも、スタティックであってもダイナミックであっても、ランタイムおよび／
またはそこで生じるその他の状態にしたがって、サーバ９６０間で分散すること
も可能である。更に、サーバ９３０は、例示として、公知のシスプレクス・コン
フィギュレーション（sysplex configuration）によって実施し、その中の全てのプロセッサ（あるいは他の同様な分散型マルチ処理環境）によってアクセス可
能な共用直接アクセス記憶装置（DASD:direct access storage device）を備え、例えば、自然言語処理のために用い双方ともその上に格納してある従来のサー
チ・エンジンおよび語彙のためのデータベースを有することも可能である。In addition, the remote computer (server) 930 can use the conventional search,
The processing need not be performed by a single computer that performs all of the natural language and related processing, and may be a distributed processing system as illustrated in FIG. 9D. In this case, the processing undertaken by this server is distributed among the individual servers therein. Here, the server 930 is formed by a front-end processor 940, and a message is transmitted via a connection 950 to a series of servers 960 (server 1, server 2).
,. . . , Server n). Each of these servers implements certain parts of our inventive process. In this regard, the server 1 can be used to index the input document into a data set on a mass data storage device and to make it searchable later. The server 2 implements a search engine, such as a conventional statistical engine, and responds to user-supplied and submitted queries by the front-end processor 940 from the mass data storage device for document records. Sets can be searched. These records are retrieved from the server 2 via the front end processor 940 for later processing, such as downloading each corresponding document in the output document set from the corresponding web site or database. , For example, to the server n. Also, the front end
Processor 940 sends the query to server n. Server n then properly analyzes the query and each document, generates a corresponding set of logical forms of triads, then compares the triad sets appropriately, selecting documents, scoring, and, as discussed above, The ranking is performed, and the ranked documents are processed by the front-end processor 94.
0 and is returned to the client PC 920 and displayed here in the order of rank. Of course, the various operations used in our inventive process may be any one of many alternatives.
Or static or dynamic, runtime and / or
Alternatively, distribution among the servers 960 is also possible according to other conditions occurring there. Further, server 930 is illustratively implemented in a known sysplex configuration, and has a shared direct access storage device (or other similar distributed multiprocessing environment) accessible therein. It is also possible to have a DASD (direct access storage device), for example, with a conventional search engine and a database for vocabulary both used for natural language processing and stored thereon.

【００６６】これまで、検索した各文書レコードに応答して文書をダウンロードし、次いで
例えばクライアントＰＣによってローカルにその文書を分析してその対応する論
理形態三連体を生成するものとして、本発明を説明してきたが、これらの三連体
は、代わりに、サーチ・エンジンによって文書をインデックス化している間に、
発生することも可能である。この点に関して、サーチ・エンジンが新たな各文書
を突き止め、例えば、ウェブ・クローラ（web crawler）の使用によってインデックス化しながら、エンジンは当該文書に対する完全なファイルをダウンロード
し、次いでその後直ちにまたは後に、バッチ・プロセスによって、当該文書を分
析し、その論理形態三連体を生成することによって、文書を予備処理することが
できる。予備処理を完了するために、サーチ・エンジンは次にこれらの三連体を
、当該文書に対するインデックス化レコードの一部として、そのデータベースに
格納する。続いて、サーチ・クエリに応答してというように、その文書レコード
を検索するときはいつでも、それに対する三連体を、文書レコードの一部として
、比較等の目的のためにクライアントＰＣに返送する。サーチ・エンジンにおい
て文書を予備処理することによって、クライアントＰＣにおける処理時間量の大
部分を削減し、これによってクライアント・スループットを向上させることがで
きるという利点がある。So far, the invention has been described as downloading a document in response to each retrieved document record and then analyzing the document locally, for example by a client PC, to generate its corresponding logical form triple. Instead, these triads, while indexing documents by search engines,
It can also occur. In this regard, while the search engine locates each new document and indexes it, for example, by using a web crawler, the engine downloads the complete file for the document and then immediately or later, The batch process allows the document to be pre-processed by analyzing the document and generating its logical form triad. To complete the preliminary processing, the search engine then stores these triples in its database as part of the indexed record for the document. Subsequently, whenever the document record is retrieved, such as in response to a search query, the triple for that document record is returned to the client PC as part of the document record for purposes such as comparison. Preprocessing documents in the search engine has the advantage that most of the processing time on the client PC can be reduced, thereby improving client throughput.

【００６７】更に、インターネットに基づくサーチ・エンジンと共に用いるという特定の状
況において我々の発明について論じて来たが、我々の発明は、（ａ）インターネ
ットに基づくか基づかないかには係わらず、専用ネットワーク設備またはその他
によってアクセス可能な、あらゆるネットワーク・アクセス可能なサーチ・エン
ジン、（ｂ）百科事典、年鑑またはその他の自己充足型単体データセットによっ
て代表される、ＣＤ−ＲＯＭに基づくデータ検索用途のように、それ自身に格納
したデータセットと共に動作する、個人用サーチ・エンジン（localized search
enjine）、および／または（ｃ）そのあらゆる組み合わせと共に用いるために等しく適用可能である。本発明は、その他の適切な用途であればいずれにおいて
も同様に使用可能である。Further, while our invention has been discussed in the specific context of use with an Internet-based search engine, our invention is directed to (a) dedicated network facilities, whether or not based on the Internet; Or any other network-accessible search engine accessible by, or (b) CD-ROM based data retrieval applications represented by encyclopedias, almanacs or other self-contained single datasets; A personalized search engine (localized search engine) that works with datasets stored on it
enjine), and / or (c) are equally applicable for use with any combination thereof. The invention can equally be used in any other suitable application.

【００６８】以上のことを念頭に入れておき、図１０Ａおよび図１０Ｂは本発明の更に別の
実施形態をひとまとめにして示す。これは、文書の予備処理によって論理形態三
連体を発生し、得られた三連体、文書レコードおよび文書自体を自己充足型単体
データセットとして、１つ以上のＣＤ−ＲＯＭまたはその他の運搬可能な大容量
媒体（リムーバブル・ハード・ディスク、テープ、あるいは光磁気または大容量
磁気または電子記憶装置によって代表される）のような共通の記憶媒体上に集合
的に格納し、エンド・ユーザに容易に分配可能としたものである。これらの図に
対する図面用紙の正しい図示を図１０に示す。共通媒体上に、検索アプリケーシ
ョン自体および検索対象である添付データセットを集合的に配することにより、
単体のデータ検索アプリケーションが得られ、したがって、文書を検索するため
にリモート・サーバにネットワーク接続する必要性を解消する。With the above in mind, FIGS. 10A and 10B collectively illustrate yet another embodiment of the present invention. This involves generating a logical form triple by pre-processing the document, and converting the resulting triple, document record and the document itself into a self-contained single dataset, on one or more CD-ROMs or other transportable large data sets. Collectively stored on a common storage medium such as a storage medium (represented by removable hard disk, tape, or magneto-optical or high-capacity magnetic or electronic storage) and easily distributed to end users It is what it was. The correct illustration of the drawing paper for these figures is shown in FIG. By collectively arranging the search application itself and the attached data set to be searched on a common medium,
A stand-alone data retrieval application is provided, thus eliminating the need to network to a remote server to retrieve documents.

【００６９】図示のように、この実施形態は、本質的に３つのコンポーネントから成る。文
書インデックス化コンポーネント１００５₁、複製コンポーネント１００５₂、お
よびユーザ・コンポーネント１００５₃である。コンポーネント１００５₁は、文
書を集め、データセット、例示としてデータセット１０３０内にインデックス化
する。一方、データセット１０３０は、例えば、百科事典、年鑑、特殊ライブラ
リ（判決報告書のような）、定期刊行物の収集等のような、自己充足型文書検索
用途のために文書レポジトリを形成する。ＣＤ−ＲＯＭおよび大量の記憶容量を
有するその他の形態の媒体の複製によるコスト激減により、この実施形態は、広
範なユーザ共同体に向けた、精度高い収集探索機能を備えた、費用効率的な大量
流通文書収集には特に魅力的である。As shown, this embodiment consists essentially of three components. Document indexing component 1005 ₁ replication component 1005 _2, and a user component 1005 _3. Component 1005 ₁ collects documents, data sets, to index the dataset 1030 as illustrated. Dataset 1030, on the other hand, forms a document repository for self-contained document search applications, such as, for example, encyclopedias, yearbooks, special libraries (such as judgment reports), periodicals collection, and the like. Due to the dramatic cost reduction of duplicating CD-ROMs and other forms of media with large storage capacities, this embodiment provides cost-effective mass distribution with accurate collection and search capabilities for a wide range of user communities. It is particularly attractive for document collection.

【００７０】いずれにしても、入来しデータセット内にインデックス化する文書を、あらゆ
る数の多種多様のソースから集め、順次コンピュータ１０１０に供給する。この
コンピュータは、メモリ１０１５内に格納してある適切なソフトウエアによって
、文書インデックス化エンジンを実現する。このエンジンは、このような各文書
毎にデータセット１０３０内部にレコードを確立し、当該文書に対するレコード
に情報を格納すると共に、データセット内に適切に格納し、文書自体のコピーを
含むエントリを確立する。エンジン１０１５は、三連体発生プロセス１１００を
実行する。このプロセスは、図１１に関連付けて以下で詳細に説明するが、イン
デックス化する文書毎に別個に実行する。要するに、このプロセスは、図６Ａお
よび図６Ｂに示したブロック６４０について先に論じたのと本質的に同様に、文
書内の原文句（textual phrase）を分析し、そうすることによって、対応する論
理形態三連体集合を当該文書に対して構築し、データセット１０３０内に格納す
る。図１０Ａおよび図１０Ｂに示す、文書をインデックス化するためのインデッ
クス化エンジン１００が実行する、適切なレコードの発生を含む、他のプロセス
全ては、本発明には無関係であるので、それらについて詳細に対応しないことに
する。一旦三連体集合をプロセス１１００によって発生したなら、エンジン１０
１５はこの集合を、文書自体のコピー、およびそのために作成した文書レコード
と共に、データセット１０３０上に格納することを言えば十分であろう。したが
って、データセット１０３０は、全てのインデックス化動作の終了時には、その
中にインデックス化したあらゆる文書の完全なコピー、およびそれに対するレコ
ードを格納するだけでなく、当該文書に対する論理形態三連体集合も格納する。In any event, documents to be indexed in the incoming data set are collected from any number of a wide variety of sources and provided to computer 1010 sequentially. The computer implements a document indexing engine with appropriate software stored in memory 1015. The engine establishes a record within dataset 1030 for each such document, stores the information in the record for that document, and stores the entry in the dataset appropriately, including a copy of the document itself. I do. The engine 1015 executes the triple generation process 1100. This process, which is described in detail below in connection with FIG. 11, is performed separately for each document to be indexed. In essence, the process analyzes a textual phrase in a document, essentially as discussed above with respect to block 640 shown in FIGS. 6A and 6B, and by doing so, A morphological triad set is constructed for the document and stored in dataset 1030. All other processes, including the generation of the appropriate records, performed by the indexing engine 100 for indexing documents, shown in FIGS. 10A and 10B, are not relevant to the present invention and will not be described in detail. I will not respond. Once the triad set has been generated by the process 1100, the engine 10
Suffice it to say that 15 stores this collection along with a copy of the document itself, and the document record created for it, on dataset 1030. Thus, at the end of every indexing operation, dataset 1030 not only stores a complete copy of every indexed document and records therein, but also a logical form triad set for that document. I do.

【００７１】一旦所望の文書全てを適切にインデックス化したなら、次に複製コンポーネン
ト１００５₂によって、「マスタ・データセット」として見なすデータセット１０３０自体の複製を作成する。コンポーネント１００５₂内部では、従来の媒体複製システム１０４０が、ライン１０３５を通じて供給されるマスタ・データセ
ットの内容のコピーを、ライン１０４３を通じて供給される検索プロセスおよび
ユーザ・インストール・プログラムを含む検索ソフトウエアの適切なファイルの
コピーと共に、１つ以上のＣＤ−ＲＯＭのような共通記憶媒体上に繰り返し書き
込み、集合的に単体文書検索アプリケーションを形成する。システム１０４０に
よって、個々の複製１０５０₁，１０５０₂，．．．１０５０_nを有する一連１０５０の媒体複製１０５０を生成する。具体的に複製１０５０₁に示すように、全ての複製は同一であり、ライン１０４３を通じて供給される文書検索アプリケー
ション・ファイルのコピー、およびライン１０３５を通じて供給されるデータセ
ット１０３０のコピーを含む。データセットのサイズおよび編成にしたがって、
各複製は、１枚以上の別個の媒体、例えば、別個のＣＤ−ＲＯＭに跨がる場合も
ある。続いて、破線１０５５でシンボル化するように、典型的には使用権の販売
によって、ユーザ共同体全体に複製を分配する。ユーザ・コンポーネント１００
５₃に示すように、一旦ユーザ、例えば、Ｕｓｅｒ_jがＣＤ−ＲＯＭ_j（ＣＤ−ＲＯＭ１０６０としても示す）のような複製を入手したなら、ユーザは、我々の本
発明を含む文書検索アプリケーションを、コンピュータ・システム１０７０（同
じアーキテクチャでないにしても、実質的に図３に示したクライアントＰＣ３０
０と同じアーキテクチャを有するＰＣ等）によって、ＣＤ−ＲＯＭ_jに格納してあるデータセットに対して実行し、所望の文書をそこから検索することができる
。即ち、ユーザがＣＤ−ＲＯＭ_jを得た後、ユーザはＣＤ−ＲＯＭをＰＣ１０７０内に装入し、ＣＤ−ＲＯＭ上に格納してあるインストール・プログラムの実行
に進み、文書検索アプリケーション・ファイルのコピーを作成し、ＰＣのメモリ
１０７５内、通常はハード・ディスク内の既定のディレクトリにインストールす
ることによって、ＰＣ上に文書検索アプリケーション１０８５を定着させる。こ
のアプリケーションは、サーチ・エンジン１０９０および検索プロセス１２００
を含む。一旦インストールが完了し、アプリケーション１０８５を呼び出したな
ら、ユーザは次に適切なフル・テキスト・クエリをアプリケーションに供給する
ことによって、ＣＤ−ＲＯＭ_j上のデータセットを通じて探索を行なうことができる。クエリに応答して、サーチ・エンジンは、データセットから、当該文書に
対するレコード、およびこのような各文書に対して格納してある論理形態三連体
を含む文書集合を検索する。また、クエリは検索プロセス１２００にも供給する
。このプロセスは、図６Ａおよび図６Ｂに関連付けて先に論じた検索プロセス６
００のそれと非常に類似しており、クエリを分析し、それに対する論理形態三連
体を構築する。その後、図１０Ａおよび図１０Ｂに示すプロセス１２００は、集
合内の検索した文書の各々、具体的にはそのレコードに対する論理形態三連体を
、クエリに対する三連体と比較する。それらの間で一致した三連体の発生および
その重みに基づいて、プロセス１２００は次に、先に詳細に述べた方法で、少な
くとも１つの一致した三連体を有する文書の各々についてスコアを決定し、降順
のスコアでこれらの文書をランク付けし、最終的に、最も高いランキングを有す
る、典型的に５ないし２０以下の小さな文書レコード群を、ユーザに視覚的に提
示する。ユーザは、これらのレコードを検討し、次に、興味があると思えるあら
ゆる関連文書のコピー全体を検索し、表示するように文書検索アプリケーション
に命令することができる。一旦ユーザが第１検索文書群に対する第１文書レコー
ド群を検討したなら、次にユーザは、次に高いランキングを有する次の文書レコ
ード群を要求し、このようにして検索した文書レコード全てを検討し終えるまで
、続けることができる。アプリケーション１０８５は初期においてクエリに応答
してランク付けした文書記録を返すが、このアプリケーションは、代わりに、ク
エリに応答して文書自体のランク付けコピーを戻すことも可能である。[0071] Once was properly index the all of the desired document, by then copy component 1005 _2, to create a duplicate of the data set 1 030 itself be regarded as the "master data set". The component 1005 ₂ internal, conventional media replication system 1040, a copy of the contents of the master data set supplied through line 1035, search software, including the search process and the user installation program is supplied through a line 1043 It is repeatedly written onto a common storage medium, such as one or more CD-ROMs, with appropriate file copies, collectively forming a single document search application. The system 1040, each replication 1050 _1, 1050 _2,. . . Generate a series of 1050 media replicas 1050 with 1050 _n . As shown specifically replicating 1050 ₁ replication of all hands are the same, including document search copy of the application files to be supplied through the line 1043, and a copy of the data set 1030 is supplied through a line 1035. Depending on the size and organization of the dataset,
Each copy may span one or more separate media, for example, separate CD-ROMs. Subsequently, the copies are distributed throughout the user community, typically by sale of usage rights, as symbolized by the dashed line 1055. User component 100
As shown in _3, once the user, for example, if User _j have obtained a copy, such as a CD-ROM _j (also shown as a CD-R OM1060), the user, the document retrieval application containing our invention , Computer system 1070 (even though not of the same architecture, the client PC 30 shown in FIG.
0, etc.), it is possible to execute a data set stored in the CD-ROM _j and retrieve a desired document therefrom. That is, after the user obtains the CD-ROM _j , the user inserts the CD-ROM into the PC 1070, proceeds to the execution of the installation program stored on the CD-ROM, and executes the document search application file. The document retrieval application 1085 is fixed on the PC by making a copy and installing it in a predetermined directory in the PC's memory 1075, typically on a hard disk. The application includes a search engine 1090 and a search process 1200.
including. Once the installation is complete and the application 1085 is invoked, the user can then search through the dataset on CD-ROM _j by providing the application with the appropriate full-text query. In response to the query, the search engine searches the dataset for a set of documents that includes a record for the document and a logical form triple stored for each such document. The query also feeds the search process 1200. This process is similar to the search process 6 discussed above in connection with FIGS. 6A and 6B.
Very similar to that of 00, it analyzes the query and builds a logical form triple for it. The process 1200 shown in FIGS. 10A and 10B then compares the logical form triple for each retrieved document in the collection, specifically the record, to the triple for the query. Based on the occurrence of triples among them and their weights, the process 1200 then determines a score for each of the documents with at least one matched triplet in the manner detailed above, The documents are ranked in descending order of score, and ultimately the user is visually presented with a small set of document records, typically 5 to 20 or less, having the highest ranking. The user can review these records and then instruct the document retrieval application to retrieve and display an entire copy of any relevant document that may be of interest. Once the user has reviewed the first set of document records for the first set of searched documents, the user then requests the next set of document records with the next highest ranking and reviews all the document records thus searched. You can continue until you are done. Although the application 1085 initially returns a ranked document record in response to the query, the application could alternatively return a ranked copy of the document itself in response to the query.

【００７２】図１１は、図１０Ａおよび図１０Ｂに示した文書インデックス化エンジン１０
１５が実行する、三連体発生プロセス１１００を示す。先に論じたように、プロ
セス１１００は、文書中の原文句を分析し、そうすることによって当該文書に対
して対応する論理形態三連体集合を構築し、データセット１０３０に格納するこ
とによって、このインデックス化すべき文書を予備処理する。即ち、プロセス１
１００に入ると、ブロック１１１０を実行する。このブロックは最初に、当該文
書に関連付けてあるＨＴＭＬタグ内に位置するあらゆるテキストを含む、当該文
書からのテキストを全て抽出する。その後、一度に１文章だけに動作する自然言
語処理を簡便化するために、各文書毎のテキストを、従来の文章分解部によって
、テキスト・ファイルに分解する。ここで、各文章（または質問）は、ファイル
内の別個のラインを示す。その後、ブロック１１１０は、当該文書内のテキスト
の各ライン毎に別個にＮＬＰルーチン１３００（図１３Ａに関連付けて以下で詳
細に論ずる）を呼び出し、この文書を分析し、そのラインに対応する論理形態三
連体を構築し、データベース１０３０内にローカルに格納する。一旦これらの動
作を完了したなら、実行はブロック１１１０およびプロセス１１００から出る。FIG. 11 shows the document indexing engine 10 shown in FIGS. 10A and 10B.
15 shows a triple generation process 1100 that is performed. As discussed above, the process 1100 analyzes this textual phrase in the document, thereby constructing a corresponding set of logical form triples for the document, and storing this in the data set 1030. Preprocess documents to be indexed. That is, process 1
Upon entering 100, block 1110 is executed. This block first extracts all text from the document, including any text located within the HTML tags associated with the document. Thereafter, in order to simplify the natural language processing that operates on only one sentence at a time, the text of each document is decomposed into a text file by a conventional sentence decomposing unit. Here, each sentence (or question) indicates a separate line in the file. Thereafter, block 1110 calls a separate NLP routine 1300 (discussed in detail below in connection with FIG. 13A) for each line of text in the document, analyzes the document, and interprets the logical form corresponding to that line. Build an association and store locally in database 1030. Once these operations have been completed, execution exits block 1110 and process 1100.

【００７３】図１０Ａおよび図１０Ｂに示した我々の発明の具体的な実施形態において用い
る、我々の発明の検索プロセス１２００のフローチャートを、ひとまとめにして
図１２Ａおよび図１２Ｂに示す。これらの図に対する図面用紙の正しい位置合わ
せを図１２に示す。検索プロセス６００（図６Ａおよび図６Ｂに示し、先に詳細
に論じた）とは対照的に、図１２Ａおよび図１２Ｂに示す動作は全て、共通のコ
ンピュータ・システム、ここではＰＣ１０７０（図１０Ａおよび図１０Ｂ参照）
上で実行する。理解を簡単にするために、読者は、以下の論述全体にわたって、
図１０Ａおよび図１０Ｂも同時に参照するとよい。A flowchart of our search process 1200 for use in the specific embodiment of our invention shown in FIGS. 10A and 10B is collectively shown in FIGS. 12A and 12B. The correct alignment of the drawing paper with respect to these figures is shown in FIG. In contrast to the search process 600 (shown in FIGS. 6A and 6B and discussed in detail above), the operations shown in FIGS. 12A and 12B are all common computer systems, here PC 1070 (FIGS. 10A and 10B). (See 10B)
Run on For ease of understanding, readers will read throughout the following discussion
10A and 10B may be simultaneously referred to.

【００７４】プロセス１２００に入ると、実行はまずブロック１２０５に進む。このブロッ
クを実行すると、ユーザにフル・テキスト・クエリを入力するように促す。一旦
このクエリを得たなら、実行は、経路１２０７を通ってブロック１２１０に、そ
して経路１２４３を通って経路１２４５に分割して進んで行く。ブロック１２４
５を実行すると、ＮＬＰルーチン１３５０を呼び出し、クエリを分析して、それ
に対応する論理形態三連体を構築し、メモリ１０７５内にローカルに格納する。
ブロック１２１０を実行すると、破線１２１５でシンボル化するように、フル・
テキスト・クエリをサーチ・エンジン１０９０に送信する。この時点において、
サーチ・エンジンはブロック１２２０を実行し、クエリに応答した文書レコード
集合、およびこのような各レコードに関連する関連論理形態三連体双方を検索す
る。一旦この集合および関連する論理形態三連体を検索したなら、破線１２３０
でシンボル化するように、双方をプロセス１２００に、そして具体的にはその中
のブロック１２４０に返送する。ブロック１２４０は、単にサーチ・エンジン１
０９０からこの情報を受信し、後の使用のためにこれをメモリ１０７５に格納す
るだけである。ブロック１２４５における動作は、ブロック１２１０，１０９０
および１２２０における動作と本質的に並行して実行するように論じたが、ブロ
ック１２４５における動作は、実際の実施態様の考慮に基づいて、ブロック１２
１０，１０９０または１２２０内の動作の前または後のいずれかに連続的に実行
することも可能である。Upon entering the process 1200, execution first proceeds to block 1205. Executing this block prompts the user to enter a full text query. Once this query is obtained, execution proceeds by splitting through path 1207 to block 1210 and through path 1243 to path 1245. Block 124
Execution of 5 calls the NLP routine 1350 to analyze the query and build a corresponding logical form triple and store it locally in memory 1075.
Executing block 1210 causes a full
Send the text query to the search engine 1090. At this point,
The search engine executes block 1220 to retrieve both the set of document records that responded to the query and the associated logical form triple associated with each such record. Once this set and its associated logical form triad have been retrieved, dashed line 1230
Are returned to the process 1200 and specifically to the block 1240 therein, as symbolized by. Block 1240 simply describes search engine 1
It only receives this information from 090 and stores it in memory 1075 for later use. The operations in block 1245 are described in blocks 1210, 1090
And that operations in block 1245 have been discussed as performing essentially in parallel with the operations in block 1245, the operations in block 1245 are based on considerations of actual implementation.
It is also possible to perform continuously either before or after the operation in 10, 1090 or 1220.

【００７５】一旦クエリおよび検索した文書レコードの各々に対する論理形態三連体集合を
メモリ１０７５に格納したなら、ブロック１２５０を実行する。このブロックは
、先に詳細に説明したように、クエリ内の論理形態三連体の各々を、検索した文
書レコードの各々に対する論理形態三連体の各々と比較し、クエリ内のいずれか
の三連体と、対応する文書のいずれかにおけるいずれかの三連体との間の一致を
突き止める。一旦ブロック１２５０を完了したなら、ブロック１２５５を実行し
、一致した三連体を有さない文書、即ち、クエリ内のいずれの三連体とも一致す
る三連体を有さない文書に対する検索レコードを全て破棄する。その後、ブロッ
ク１２６０を実行する。ブロック１２６０によって、残りの文書レコード全てに
、先に規定したように、そして一致した三連体の関係型（複数の型）および、対
応する文書の各々について存在する、その重みに基づいて、スコアを割り当てる
。一旦文書レコードにこのように重み付けをしたなら、ブロック１２６５を実行
し、スコアの降順でレコードをランク付ける。最後に、ブロック１２７０を実行
して、最も高いスコアを示す典型的に５つまたは１０個の小さな既定の文書群に
関して、典型的に、ランク順でレコードを表示する。その後、ユーザは、例えば
、適切に彼（彼女）のマウスを、コンピュータ・システム１０７０が表示する対
応するボタン上で「クリック」することによって、当該システムに、次のランク
付け文書レコード群を表示させ、ランク付けした文書レコード全てを連続してユ
ーザが十分に試験し（更にその中にある対象のあらゆる文書にアクセスし試験し
）終えるまで、このように続ける。試験し終えた時点で、プロセス１２００は実
行を完了し、そこから出る。Once the logical form triad set for each of the queried and retrieved document records has been stored in memory 1075, block 1250 is executed. This block compares each of the logical form triples in the query with each of the logical form triples for each retrieved document record, as described in detail above, and , Find a match between any of the triples in any of the corresponding documents. Once block 1250 is completed, block 1255 is executed to discard all search records for documents that do not have a matching triple, ie, do not have a triple that matches any of the triples in the query. . Thereafter, block 1260 is executed. Block 1260 causes all remaining document records to be scored as defined above, and based on the matched triad relation type (s) and the weights that are present for each of the corresponding documents. assign. Once the document records have been weighted in this manner, block 1265 is executed to rank the records in descending score order. Finally, block 1270 is executed to display the records, typically in rank order, for typically five or ten small predefined documents that have the highest scores. The user then causes the system to display the next set of ranked document records, eg, by "clicking" his (her) mouse on the corresponding button displayed by computer system 1070, as appropriate. And so on until the user has thoroughly tested (and has accessed and tested any document of interest therein) all of the ranked document records in succession. At the end of the test, the process 1200 has completed execution and exits.

【００７６】図１３Ａは、図１１に示した三連体発生プロセス１１００内で実行するＮＬＰ
ルーチン１３００のフローチャートを示す。先に述べたように、ＮＬＰルーチン
１３００は、入来しインデックス化する文書、具体的には、それに対するテキス
トの単一ラインを分析し、当該文書に対して対応する論理形態三連体集合を構築
し、図１０Ａおよび図１０Ｂに示したデータセット１０３０内にローカルに格納
する。ルーチン１３００は、図７に示し先に詳細に論じたＮＬＰルーチン７００
と本質的に同様に動作する。FIG. 13A shows the NLP executed within the triple generation process 1100 shown in FIG.
13 shows a flowchart of a routine 1300. As mentioned above, the NLP routine 1300 analyzes the incoming and indexed document, specifically a single line of text for it, and constructs the corresponding logical form triad set for that document. Then, it is stored locally in the data set 1030 shown in FIGS. 10A and 10B. Routine 1300 is an NLP routine 700 shown in FIG. 7 and discussed in detail above.
Works essentially the same as

【００７７】即ち、ルーチン１３００に入ると、ブロック１３１０を最初に実行し、入力テ
キストのラインを処理して、図５Ａに示す例示のグラフ５１５のような、論理形
態グラフを生成する。その後、図１３Ａに示すように、ブロック１３２０を実行
し、グラフから対応する論理形態三連体集合を抽出する（読み出す）。一旦これ
を行なったなら、ブロック１３３０を実行し、このような論理形態三連体の各々
を、別個で異なるフォーマットのテキスト・ストリングとして発生する。最後に
、ブロック１３４０を実行し、入力テキストのライン、および一連のフォーマッ
トしたテキスト・ストリングとして、当該ラインに対する論理形態三連体集合を
データセット１０３０に格納する。一旦この集合を完全に格納したなら、実行は
ブロック１３００から出る。あるいは、論理形態三連体の代わりに、異なる表現
、例えば、論理形態に関連する論理形態グラフまたはサブグラフを、我々の発明
と共に用いる。そうする場合、その特定の形態を、フォーマットしたストリング
として発生するように、ブロック１３２０および１３３０を変更するのは容易で
あり、ブロック１３４０は、論理形態三連体の代わりに、その形態をデータセッ
トに格納する。That is, upon entering the routine 1300, block 1310 is first executed to process the lines of input text to generate a logical form graph, such as the exemplary graph 515 shown in FIG. 5A. Thereafter, as shown in FIG. 13A, block 1320 is executed to extract (read) the corresponding set of logical form triples from the graph. Once this has been done, block 1330 is executed to generate each such logical form triple as a separate and differently formatted text string. Finally, block 1340 is executed to store in the data set 1030 the logical form triplet for the line, as a line of input text, and a series of formatted text strings. Once this set has been completely stored, execution exits block 1300. Alternatively, instead of a logical form triad, a different representation, such as a logical form graph or subgraph associated with a logical form, is used with our invention. If so, it is easy to modify blocks 1320 and 1330 so that the particular form occurs as a formatted string, and block 1340 replaces the form with a data set instead of a logical form triple. Store.

【００７８】図１３Ｂは、検索プロセス１２００内部で実行するＮＬＰルーチン１３５０の
フローチャートを示す。前述のように、ＮＬＰルーチン１３５０は、ユーザＵｓ
ｅｒ_jが文書検索アプリケーション１０８５（図１０Ａおよび図１０Ｂに示す）に供給するクエリを分析し、それに対して対応する論理形態三連体集合を構築し
、メモリ１０７５内にローカルに格納する。ルーチン１３５０と図１３Ａに関連
付けて先に詳細に論じたルーチン１３００との間における唯一の動作上の相違は
、対応する三連体を格納する場所にある。即ち、ＮＬＰルーチン１３００ではブ
ロック１３４０の実行によってデータセット１０３０に、そしてＮＬＰルーチン
１３５０ではブロック１３９０の実行によってメモリ１０７５に格納する。ルー
チン１３５０のその他のブロックが実行する動作、即ち、ブロック１３６０，１
３７０および１３８０は、ルーチン１３００におけるブロック１３１０，１３２
０および１３３０とそれぞれ実質的に同一であるので、前者のブロックを詳細に
論ずることはいずれも省略する。FIG. 13B shows a flowchart of an NLP routine 1350 that executes within search process 1200. As described above, the NLP routine 1350 determines whether the user Us
er _j analyzes the query that it supplies to the document retrieval application 1085 (shown in FIGS. 10A and 10B), constructs a corresponding set of logical form triples, and stores it locally in memory 1075. The only operational difference between routine 1350 and routine 1300, discussed in detail above in connection with FIG. 13A, is in the location where the corresponding triple is stored. That is, the NLP routine 1300 stores the data in the data set 1030 by executing the block 1340, and the NLP routine 1350 stores the data in the memory 1075 by executing the block 1390. The operations performed by the other blocks of routine 1350, ie, blocks 1360,1
370 and 1380 are blocks 1310, 132 in routine 1300
Since they are substantially the same as 0 and 1330, respectively, the detailed description of the former block is omitted.

【００７９】図１に関連付けて先に概略的に説明した、我々の発明の検索プロセスの性能を
実験的に検査するために、ALTA VISTAサーチ・エンジンを我々の検索システムに
おけるサーチ・エンジンとして用いた。このエンジンはインターネット上で公に
アクセス可能であり、３１，０００，０００ものウェブ・ページをインデックス
化してあることを誇り、広く用いられている（現在毎日約２８，０００，０００
ヒット程度である）、従来からの統計サーチ・エンジンである。ディレクトリ・
ファイルを含む、種々の自然言語処理コンポーネントを用いて、MICROSOFT OFFI
CE97プログラム・スイート（program suite）の一部を形成する文法チェッカ内に内蔵してある、標準的なPentium90ＭＨｚＰＣ上に、我々の発明の検索プロセス６００を実装した（”OFFICE”および”OFFICE97”はワシントン州RedmondのM
icrosoft Corporation（マイクロソフト社）の商標である）。オン・ライン・パ
イプライン型処理モデルを用いた。即ち、ユーザが次の結果を待っている間、文
書を集め、パイプライン状にオンラインで処理した。この特定のＰＣによって、
各文章毎に論理形態三連体を発生するには、約１／３ないし１／２秒を要した。In order to experimentally test the performance of the search process of our invention, schematically described above in connection with FIG. 1, the ALTA VISTA search engine was used as the search engine in our search system. . The engine is publicly accessible on the Internet, boasts of indexing 31,000,000 web pages, and is widely used (currently about 28,000,000 daily).
Hits), a traditional statistical search engine. directory·
MICROSOFT OFFI using various natural language processing components, including files
The search process 600 of our invention was implemented on a standard Pentium 90 MHz PC built into a grammar checker that forms part of the CE97 program suite ("OFFICE" and "OFFICE97" M in Redmond Washington
icrosoft Corporation (Microsoft Corporation). An on-line pipeline processing model was used. That is, documents were collected and processed online in a pipeline while the user was waiting for the next result. With this particular PC,
It took about 1/3 to 1/2 second to generate a logical form triple for each sentence.

【００８０】サーチ・エンジンに提示するためにフル・テキスト・クエリを発生するように
、ボランティアに要請した。合計１２１個の広範囲にわたるクエリを発生した。
以下に挙げるのはその代表である。”Why was the Celtic civilization so eas
ily conquered by the Romans?”（何故ケルト文明はそう簡単にローマ人によっ
て征服されたのか）、”Why do antibiotics work on colds but not on viruse
s?”（何故、抗生物質は風邪には効くのに、ビールスには効かないのか）、”Wh
o is the governor of Washington?”（ワシントン州知事は誰か）、”Where do
es the Nile cross the equator？”（ナイル川はどこで赤道と交差するか）、および”When did they start vaccinating for small pox?”（種痘の注射を開
始したのはいつか）。これら１２１個のクエリの各々をALTA VISTAサーチ・エン
ジンに提示し、文書を得ることができた場合に、各クエリに応答して戻ってきた
上位３０の文書を獲得した。クエリの一部について３０未満の文書が戻ってきた
状況では、戻ってきた文書全てを用いた。１２１個のクエリ全てについて累積す
ると、３３６１の文書（即ち、「生の」文書）を得た。A volunteer was requested to generate a full text query for submission to a search engine. A total of 121 extensive queries were generated.
The following are the representatives. ”Why was the Celtic civilization so eas
"Why do antibiotics work on colds but not on viruse?" (Why was the Celtic civilization so easily conquered by the Romans?)
s? ”(Why do antibiotics work for colds but not viruses?),“ Wh
o is the governor of Washington? ”,“ Where do
es the Nile cross the equator? "Where did the Nile cross the equator?" And "When did they start vaccinating for small pox?" , And when the document was obtained, the top 30 documents returned in response to each query were obtained. In a situation where less than 30 documents returned for some of the queries, it returned. All documents were used, accumulating for all 121 queries, yielding 3361 documents (ie, "raw" documents).

【００８１】３３６１の文書および１２１個のクエリの各々を、我々の発明プロセスによっ
て分析し、対応する論理形態三連体集合を生成した。その集合を適切に比較し、
先に論じたように、得られた文書を選択し、スコアを決定し、ランク付けした。Each of the 3361 documents and 121 queries was analyzed by our inventive process to generate the corresponding logical form triad set. Compare the sets appropriately,
The resulting documents were selected, scored, and ranked as discussed above.

【００８２】３３６１の文書全てを検索するための対応するクエリとの関連性について、手
作業でかつ別個にこれらの文書をお評価した。関連性を評価するために、我々の
具体的な実験目標を知らない評価要員を利用し、これら３３６１文書の各々を、
その対応するクエリとの関連性について、「最適」、「関連あり」または「関連
なし」として、手作業でかつ主観的にランク付けした。最適な文書は、対応する
クエリに対して明示的な答えを含むものとした。関連のある文書は、クエリに対
する明示的な答えを含まないが、しかしながらそれに関連するものとした。関連
のない文書は、クエリに対して有用な応答ではないものとした。例えば、英語以
外の言語でクエリには関連のなかった文書、またはALTA VISTAエンジンが提供す
る対応のＵＲＬ（即ち、「コブウェブ」リンク）から検索できなかった文書があ
った。評価の精度を高めるために、第２の評価要員がこれら３３６１の文書の部
分集合を検査した。即ち、対応するクエリにおける論理形態三連体と一致する少
なくとも１つの論理形態三連体を有した文書（３３６１文書の内４３１）、およ
び以前に関連ありまたは最適としてランク付けしたが、一致する論理形態三連体
を全く有さなかった文書（３３６１文書の内１０２）である。文書に対するこれ
らのランキングにおいて不一致があった場合には、全て、「タイ・ブレーカ」と
しての役割を担う第３評価要員が再検討した。These documents were manually and separately evaluated for relevance to the corresponding query to retrieve all 3361 documents. To assess relevance, we used an evaluator who was unaware of our specific experimental goals, and assigned each of these 3361 documents to
The relevance to the corresponding query was manually and subjectively ranked as "optimal", "related" or "unrelated". The best document contained an explicit answer to the corresponding query. Relevant documents did not include an explicit answer to the query, but were related to it. Unrelated documents were not considered useful responses to the query. For example, some documents were not relevant to the query in a language other than English, or could not be retrieved from the corresponding URL provided by the ALTA VISTA engine (i.e., the "Cobb Web" link). To increase the accuracy of the evaluation, a second evaluation staff inspected a subset of these 3361 documents. That is, a document (431 of 3361 documents) that had at least one logical form triple corresponding to the logical form triple in the corresponding query, and a logical form triple previously ranked as relevant or optimal. This is a document having no association (102 out of 3361 documents). Any discrepancies in these rankings for the document were reviewed by a third evaluator, acting as a "tie breaker".

【００８３】この実験の結果、関与した全ての文書にわたって、我々の発明の検索システム
では、ALTA VISTAサーチ・エンジンが戻した生の文書に対して、全体的な（即ち
、選択した文書全ての）正確性において、約１６％ないし約４７％から約２００
％程の改善が得られ、上位５件の文書では、約２６％ないし約５１％から、約１
００％の改善が得られた。加えて、我々の発明システムの使用により、最適とし
て戻ってきた最初の文書は、生の文書に対するそれに対して、約１７％ないし約
３５％から約１１３％の正確性向上を得た。As a result of this experiment, across all the documents involved, our search system compared the raw documents returned by the ALTA VISTA search engine to the overall (ie, all of the selected documents) In accuracy, from about 16% to about 47% to about 200%
% Improvement, with the top five documents from about 26% to about 51% to about 1%.
A 00% improvement was obtained. In addition, with the use of our inventive system, the first document that returned as optimal gained about 17% to about 35% to about 113% accuracy improvement over that for the raw document.

【００８４】以上統計的サーチ・エンジンとの使用という状況において我々の発明を具体的
に説明したが、我々の発明はこれに限定される訳ではない。その点について、情
報検索用途では、我々の発明を用いると、実質的にあらゆる形式のサーチ・エン
ジンによって得られた検索文書でも処理し、当該エンジンの正確性を改善するこ
とができる。Although our invention has been specifically described in the context of use with a statistical search engine, our invention is not so limited. In that regard, in information retrieval applications, our invention can process search documents obtained by virtually any form of search engine, improving the accuracy of that engine.

【００８５】論理形態三連体における異なる属性毎に固定の重みを用いるのではなく、これ
らの重みを動的に変化させることも可能であり、実際には適応型とすることがで
きる。これを達成するために、例えば、ベイジアン（Bayesian）またはニューラ
ル・ネットワークのような学習機構を、我々の発明プロセスに組み込み、異なる
各論理形態三連体に対する数値重みを、学習経験に基づく最適値に変化させるこ
とも可能である。Instead of using fixed weights for different attributes in the logical form triple, these weights can be dynamically changed, and in fact, they can be adaptive. To achieve this, a learning mechanism such as, for example, a Bayesian or a neural network is incorporated into our inventive process, and the numerical weight for each different logical form triple is changed to an optimal value based on learning experience. It is also possible to make it.

【００８６】我々の発明プロセスは、１つの例示としての実施形態において先に論じたよう
に、正確に照合するために論理形態三連体を必要としたが、十分に類似する意味
的内容を三連体間で識別する目的のために、一致を判定する基準を緩め、言い換
えを一致として含ませることも可能である。言い換えは、語彙上または構造上の
いずれでもよく、あるいは以下に述べるように、抽象的論理形態の発生を含むこ
とも可能である。語彙上の言い換えの一例は、上位語または同義語のいずれかで
あろう。構造上の言い換えは、名詞相当語（noun appositive）または関係節いずれかの使用によって例示する。例えば、”the president, Bill Clinton”（大統領ビル・クリントン）というような名詞相当語の構造は、”Bill Clinton,
who is president”（大統領であるビル・クリントン）のような、一致する関係
節構造として見なして当然であろう。意味上のレベルでは、２つの単語が互いに
どのように意味的に類似しているかについて、微粒な判定（fine＿grained judg
ment）を行なうことによって、クエリ”Where is coffee grown?”（どこでコー
ヒーは栽培されているか）と、”Coffee is frequently farmed in tropical mo
untainous regions.”（コーヒーは熱帯山脈地帯で栽培されることが多い）とい
うようなコーパス（corpus）における文章との間の一致を確認することができる
。加えて、一致が存在するか否かについて判定を行なう手順は、質問されるクエ
リの形式に応じて変更することも可能である。例えば、あるクエリが、何かがど
こにあるか尋ねる場合、この手順は、クエリに対して一致すると見なされるため
には、検査対象の文章に関連するいずれの三連体にも「場所」属性が、存在する
ことを主張すべきである。したがって、論理形態三連体の「一致」は、総括的に
、完全な一致だけでなく、緩和した一致条件、判断による一致条件、および変更
した一致条件から得られるものも含むように定義する。Although our inventive process required a logical form triad to accurately match, as discussed above in one exemplary embodiment, a sufficiently similar semantic content For the purpose of distinguishing between, it is also possible to relax the criteria for determining a match and include paraphrases as matches. Paraphrase may be lexical or structural, or may involve the generation of abstract logical forms, as described below. An example of a lexical paraphrase would be either a broader term or a synonym. Structural paraphrases are illustrated by the use of either noun appositives or relative clauses. For example, the structure of a noun equivalent, such as "the president, Bill Clinton", is "Bill Clinton,
It is natural to consider it as a matching relative clause structure, such as "who is president". At a semantic level, how two words are semantically similar to each other About fine judgment (fine_grained judg
ment), the query "Where is coffee grown?" and "Coffee is frequently farmed in tropical mo"
You can see a match between texts in the corpus, such as "untainous regions." (Coffee is often grown in tropical mountainous regions.) In addition, whether or not a match exists The procedure for making the determination can also vary depending on the type of query being queried, for example, if a query asks where something is, the procedure is considered to match the query To this end, it should be asserted that the "location" attribute exists in any triad associated with the text being examined. Therefore, the "match" of the logical form triad is defined to include not only a perfect match but also a result obtained from a relaxed match condition, a match condition by judgment, and a changed match condition.

【００８７】更に、我々の発明は、例えば、グラフィックス、表、ビデオまたはその他とい
った非テキスト情報の検索を中心とするその他の処理技法と容易に組み合わせて
も、全体的な正確性を向上させることができる。概して言えば、文書中の非テキ
スト・コンテンツは、当該文書内において、例えば、図の凡例または短い説明と
いうような言語的（テキスト）記述を頻繁に伴う。したがって、我々の発明プロ
セスの使用、即ち、その自然言語コンポーネントを用いて、非テキスト・コンテ
ンツにしばしば付随する言語的記述を分析し処理することができる。最初に我々
の発明の自然言語処理技法を用いて文書を検索し、クエリに意味的に関連する言
語的コンテンツを有する文書集合を突き止め、次いでこの文書集合をその非テキ
スト・コンテンツに関して処理し、関連するテキストおよび非テキスト・コンテ
ンツを有する文書（複数の文書）を突き止めることができる。あるいは、最初に
非テキスト・コンテンツに関して文書検索を行ない、文書集合を検索し、次いで
我々の発明技法によってその文書集合をその言語的コンテンツに関して処理し、
関連する文書（複数の文書）を突き止めることも可能である。Further, our invention improves overall accuracy, even when easily combined with other processing techniques centered on searching for non-textual information, for example, graphics, tables, videos or others. Can be. Generally speaking, non-text content in a document is often accompanied by linguistic (text) descriptions within the document, such as, for example, figure legends or short descriptions. Thus, the use of our inventive process, that is, its natural language components, can be used to analyze and process linguistic descriptions that often accompany non-text content. First, documents are retrieved using our natural language processing techniques to locate a document set having linguistic content semantically relevant to the query, and then process this document set with respect to its non-text content, The document (s) with the text and non-text content to be copied can be located. Alternatively, first perform a document search on the non-text content, retrieve the document set, and then process the document set with respect to its linguistic content according to our inventive techniques;
It is also possible to locate the relevant document (s).

【００８８】図１４は、本発明の一態様による情報検索システム１４８０の簡略化した機能
図である。システム１４８０は、検索エンジン１４８２、サーチ・エンジン１４
８４、および統計的データ記憶装置１４８６を含む。システム１４８０全体、ま
たはシステム１４８０の一部は、図３に示した環境に実装可能であることを注記
しておく。例えば、検索エンジン１４８２およびサーチ・エンジン１４８４は、
単純に、メモリ３２２に格納するコンピュータ読み取り可能命令として実装し、
ＣＰＵ３２１によって実行し、所望の機能を実行することができる。あるいは、
検索エンジン１４８２およびサーチ・エンジン１４８４は、図３に関して説明し
たような、あらゆる種類のコンピュータ読み取り可能媒体上に設けることも可能
である。加えて、検索エンジン１４８２およびサーチ・エンジン１４８４は、分
散型処理環境に設け、別個のプロセッサにおいて実行することも可能である。更
に、統計的データ記憶装置１４８６は、図３に関して論じたメモリ・コンポーネ
ントに格納することも可能であり、ワイド・エリア・ネットワーク３５２内に位
置するメモリ上に格納することも可能であり、また、例えば、ローカル・エリア
・ネットワーク３５１を通じてアクセス可能なメモリ３５０に格納することも可
能である。別の例示としての実施形態では、記憶装置１４８６をメモリ３２２の
一部に配置し、コンピュータ３２０内のオペレーティング・システムによってア
クセスすることができる。FIG. 14 is a simplified functional diagram of an information retrieval system 1480 according to one aspect of the present invention. The system 1480 includes a search engine 1482, a search engine 14
84, and statistical data storage 1486. It should be noted that the entire system 1480, or a portion of the system 1480, can be implemented in the environment shown in FIG. For example, search engine 1482 and search engine 1484
Simply implemented as computer readable instructions stored in memory 322,
It can be executed by the CPU 321 to execute a desired function. Or,
Search engine 1482 and search engine 1484 may also be provided on any type of computer readable media, such as described with respect to FIG. In addition, search engine 1482 and search engine 1484 may be provided in a distributed processing environment and run on separate processors. In addition, statistical data storage 1486 may be stored in the memory components discussed with respect to FIG. 3, may be stored on memory located within wide area network 352, and For example, it can be stored in the memory 350 accessible through the local area network 351. In another exemplary embodiment, storage 1486 may be located in a portion of memory 322 and accessed by an operating system within computer 320.

【００８９】いずれの場合でも、キーボード３４０、マウス３４２等のようないずれかの適
切な入力機構を通じて、テキスト入力（即ち、クエリ）を検索エンジン１４８２
に供給する。検索エンジン１４８２は、クエリに基づいて多数の機能を実行する
。好適な一実施形態では、検索エンジン１４８２は、テキスト入力に基づいて、
ブール・クエリ（Boolean query）を定式化し、このブール・クエリをサーチ・エンジン１４８４に供給する。In any case, text input (ie, a query) may be input to search engine 1482 through any suitable input mechanism, such as keyboard 340, mouse 342, or the like.
To supply. Search engine 1482 performs a number of functions based on the query. In a preferred embodiment, the search engine 1482 uses the text input to
Formulate a Boolean query and provide the Boolean query to search engine 1484.

【００９０】サーチ・エンジン１４８４は、例示としての一実施形態では、MA、MaynardのDi
gital Equipment Corporation（ディジタル・エクイップメント社）が商用名称（commercial designation）Alta Vistaとして提供するサーチ・エンジンである
。Alta Vistaサーチ・エンジンは、従来からのインターネット検索エンジンであ
る。このような実施形態では、検索エンジン１４８２は、適切なインターネット
接続によって、サーチ・エンジン１４８４に接続する。勿論、他のサーチ・エン
ジンも同様に使用可能である。The search engine 1484 is, in one exemplary embodiment, a Diary from MA, Maynard.
A search engine offered by gital Equipment Corporation under the commercial designation Alta Vista. Alta Vista search engine is the traditional Internet search engine. In such an embodiment, search engine 1482 connects to search engine 1484 via a suitable Internet connection. Of course, other search engines can be used as well.

【００９１】例示としての実施形態では、サーチ・エンジン１４８４は、統計的データ規則
部１４８４にアクセス可能な統計的サーチ・エンジンである。このようなサーチ
・エンジンは、典型的に、データ記憶装置１４８６を探索するために用いる探索
方法論に統計処理を組み込んでいる。In the exemplary embodiment, search engine 1484 is a statistical search engine that has access to statistical data rules 1484. Such search engines typically incorporate statistical processing into the search methodology used to search data store 1486.

【００９２】データ記憶装置１４８６は、典型的に、サーチ・エンジン１４８４によってイ
ンデックス化した文書レコードのデータ集合を含むことが多い。このような各レ
コードは、例えば、対応する文書にウェブ・ブラウザによってアクセス可能なウ
ェブ・アドレス、恐らく文書の短い概要であり当該文書に現れる既定の含有単語
、およびハイパーテキスト・マークアップ言語（ＨＴＭＬ）記述フィールド内に
与える場合の当該文書の記述を含む。加えて、統計的データ記憶装置１４８６は
、内部にインデックス化してある文書に対して計算した論理形態を示すデータも
含むことができる。例示としての一実施形態では、インデックスの見出しに関連
付けた論理形態は、インデックス化した文書に元来用いられている言語に対応す
る。別の例示としての実施形態では、以下で更に詳しく説明するが、論理形態は
、言い換えの論理形態を含み、高頻度の論理形態を抑制するように変更する。Data storage 1486 typically contains a data set of document records indexed by search engine 1484. Each such record is, for example, a web address at which the corresponding document can be accessed by a web browser, possibly a short summary of the document and predefined content words that appear in the document, and a hypertext markup language (HTML). Contains the description of the document when given in the description field. In addition, statistical data storage 1486 may also include data indicating the logical form calculated for the document that is indexed therein. In one exemplary embodiment, the logical form associated with the index heading corresponds to the language originally used for the indexed document. In another exemplary embodiment, as will be described in more detail below, the logical forms include paraphrased logical forms and are modified to suppress high frequency logical forms.

【００９３】統計的サーチ・エンジン１４８４は、典型的に、統計的データ記憶装置１４８
６から検索した各文書レコード毎に、数値尺度を算出する。この数値尺度は、サ
ーチ・エンジン１４８４に与えたクエリに基づく。このような数値尺度は、例え
ば、用語頻度＊逆文書頻度（inverse document frequency）（tf*idf）を含む場
合がある。Statistical search engine 1484 typically includes a statistical data store 148.
A numerical scale is calculated for each document record retrieved from 6. This numerical measure is based on the query provided to search engine 1484. Such numerical measures may include, for example, term frequency * inverse document frequency (tf * idf).

【００９４】いずれの場合でも、サーチ・エンジン１４８４は、検索エンジン１４８２に、
特定した文書レコードまたは文書自体のいずれかを、各文書レコードについて算
出した統計的尺度の順にランク付けして戻す。例示としての一実施形態では、検
索エンジン１４８２は、返ってきた文書またはレコードに追加の自然言語処理を
施し、文書またはレコードのランキングに絞りをかける。次に、文書またはレコ
ードを、絞りをかけたランキングにしたがって、出力文書集合としてユーザに提
示する。In any case, search engine 1484 causes search engine 1482 to
Either the identified document record or the document itself is ranked back in the order of the statistical measure calculated for each document record. In one exemplary embodiment, search engine 1482 performs additional natural language processing on the returned document or record to narrow down the ranking of the document or record. Next, the document or record is presented to the user as an output document set according to the narrowed ranking.

【００９５】図１５は、サーチ・エンジン１４８４の更に詳細な機能ブロック図であり、統
計的データ記憶装置１４８６をどのようにして本発明の例示としての一実施形態
にしたがって作成するのかについて示す。図１５は、いずれかの適した記憶装置
上に格納してある文書１５８８を示す。このような記憶装置は、分散型計算機環
境におけるコンピュータ、コンピュータ３２０内のオペレーティング・システム
がアクセスするストレージ、ワイド・エリア・ネットワーク（インターネットの
ような）を通じてアクセス可能なコンピュータ、ライブラリ・データベース、ま
たは文書を格納してあるその他のいずれかの適した場所とすることができる。文
書１５８８は、典型的に、ここでは文書インデックサ１５９０と呼ぶウェブ・ク
ローラ・コンポーネントを通じて、サーチ・エンジン１４８４によってアクセス
可能である。文書インデックサ１５９０は、文書１５８８にアクセスし、公知の
方法でこれらをインデックス化し、アクセスした文書の各々に関連するレコード
を発生する。FIG. 15 is a more detailed functional block diagram of search engine 1484, illustrating how statistical data storage 1486 is created in accordance with an exemplary embodiment of the present invention. FIG. 15 shows a document 1588 stored on any suitable storage device. Such storage may include computers in a distributed computing environment, storage accessed by an operating system in computer 320, computers accessible through a wide area network (such as the Internet), library databases, or documents. It can be any other suitable location stored. Document 1588 is typically accessible by search engine 1484 through a web crawler component, referred to herein as document indexer 1590. Document indexer 1590 accesses documents 1588 and indexes them in a known manner to generate a record associated with each of the accessed documents.

【００９６】また、サーチ・エンジン１４８４は、論理形態発生部１５９２、および論理形
態変更部１５９４も含む。論理形態発生部１５９２も文書にアクセスし、アクセ
スした文書の各々に対応する論理形態を作成する。The search engine 1484 also includes a logical form generator 1592 and a logical form changer 1594. The logical form generator 1592 also accesses the document and creates a logical form corresponding to each accessed document.

【００９７】論理形態発生部１５９２は、入力テキストに基づいて、論理形態を発生する。
端的に言うと、意味分析によって、テキスト入力の意味を記述する論理形態グラ
フを発生する。論理形態グラフは、ノードおよびリンクを含み、リンクには、ノ
ード対間の関係を示すラベルを付ける。論理形態グラフは、例えば、構文解析ツ
リーよりも一層抽象的なレベルの分析を表わす。何故なら、この分析は多くの構
文的または形態学的ばらつきを正規化するからである。The logical form generator 1592 generates a logical form based on the input text.
Briefly, semantic analysis generates a logical form graph that describes the meaning of the text input. The logical form graph includes nodes and links, and the links are labeled to indicate the relationship between the node pairs. A logical form graph represents, for example, a more abstract level of analysis than a parse tree. Because this analysis normalizes many syntactic or morphological variations.

【００９８】論理形態変更部１５９４は、論理形態発生部１５９２が発生した論理形態を受
け取り、この論理形態を変更する。変更部１５９４は、例示として、元の論理形
態に基づいて、言い換えた論理形態集合（paraphrased logical form）を作成し
、種々の文書間の区別に役立たない、所定のクラスの論理形態（高頻度論理形態
のような）を抑制する。The logical form changing unit 1594 receives the logical form generated by the logical form generating unit 1592, and changes the logical form. For example, the changing unit 1594 creates a paraphrased logical form set based on the original logical form, and creates a predetermined class of logical forms (high-frequency logical forms) that do not help distinguish between various documents. Like form).

【００９９】文書インデックサ１５９０が作成したレコードは、変更論理形態集合と共に、
例示として、統計的データ記憶装置１４８６に供給し、検索エンジン１４８２を
通じて供給されるクエリに応答してのサーチ・エンジン１４８４による後のアク
セスのために格納しておく。論理形態変更部１４９４については、以下で更に詳
しく説明する。The records created by the document indexer 1590, together with the changed logical form set,
By way of example, it may be provided to a statistical data store 1486 and stored for later access by the search engine 1484 in response to queries provided through the search engine 1482. The logical form changing unit 1494 will be described in more detail below.

【０１００】図１６は、検索エンジン１４８２の更に詳細なブロック図である。例示として
の実施形態では、検索エンジン１４８２は、入力論理形態発生部１６９６、論理
形態変更部１６９８、ブール・クエリ発生部１６００、およびフィルタ１６０２
を含む。一方、フィルタ１６０２は、論理形態比較部１６０４および文書ランク
発生部１６０６を含む。FIG. 16 is a more detailed block diagram of the search engine 1482. In the exemplary embodiment, search engine 1482 includes input logical form generator 1696, logical form changer 1698, Boolean query generator 1600, and filter 1602.
including. On the other hand, the filter 1602 includes a logical form comparison unit 1604 and a document rank generation unit 1606.

【０１０１】ユーザが入力したクエリは、ブール・クエリ発生部１６００に供給する。ブー
ル・クエリ発生部１６００は、従来の情報検索システムにおけると同様に、ユー
ザ入力クエリに基づいてブール・クエリを発生する。ブール・クエリをサーチ・
エンジン１４８４に供給し、サーチ・エンジン１４８４は統計的データ記憶装置
１４８６に対してクエリを実行する。これに応答して、統計的データ記憶装置１
４８６は、文書レコード（変更した論理形態集合を含む）をサーチ・エンジン１
４８４に戻し、次いでサーチ・エンジン１４８４はこれらを検索エンジン１４８
２内のフィルタ１６０２に供給する。The query input by the user is supplied to the Boolean query generator 1600. The Boolean query generator 1600 generates a Boolean query based on a user input query as in a conventional information retrieval system. Search for Boolean queries
Engine 1484, which performs a query against statistical data storage 1486. In response, the statistical data storage 1
486 stores the document record (including the changed logical form set) in the search engine 1
484, and then search engine 1484 converts them to search engine 148.
2 to the filter 1602.

【０１０２】また、クエリは入力論理形態発生部１５９６にも供給する。発生部１５９６は
クエリ内にある元の単語に基づいて、１つ以上の論理形態、およびそれらの互い
に対する関係を発生する。論理形態の発生は、図１５の論理形態発生部１５９２
に関して説明したのと同様に行なう。The query is also supplied to the input logical form generator 1596. The generator 1596 generates one or more logical forms and their relationship to one another based on the original words in the query. The generation of the logical form is performed by the logical form generator 1592 in FIG.
Is performed in the same manner as described above.

【０１０３】元の論理形態は、論理形態変更部１６９８に供給し、これらの論理形態を変更
して、例示として、言い換え論理形態集合を含ませ、高頻度論理形態を抑制する
。更に、この変更論理形態集合をフィルタ１６０２内の論理形態比較部１６０４
に供給する。The original logical form is supplied to a logical form changing unit 1698, and these logical forms are changed to include, for example, a paraphrase logical form set to suppress a high-frequency logical form. Further, the set of changed logical forms is compared with the logical form comparing unit 1604 in the filter 1602.
To supply.

【０１０４】論理形態比較部１６０４は、クエリに基づいた変更論理形態集合を、データ記
憶装置１４８６から検索した文書に基づいた変更論理形態集合と比較する。クエ
リに基づいた変更論理形態集合のいずれかが、文書に基づいたものと一致した場
合、論理形態比較部１６０４は、一致した論理形態を含む特定の文書に重みを割
り当てる。この重みは、各文書に関連する一致の数および種類に基づいている。
全く一致を含まない文書は、いずれも破棄してユーザには提示しないか、または
当該文書はクエリに関連する可能性は低いと思われるという指示と共にユーザに
提示することができる。The logical form comparing unit 1604 compares the modified logical form set based on the query with the modified logical form set based on the document retrieved from the data storage device 1486. If any of the set of modified logical forms based on the query matches the one based on the document, the logical form comparison unit 1604 assigns a weight to a specific document including the matched logical form. This weight is based on the number and type of matches associated with each document.
Any documents that do not contain any matches can be discarded and not presented to the user, or presented to the user with an indication that the document is unlikely to be relevant to the query.

【０１０５】一致を含む文書のレコードは、論理形態比較部１６０４が割り当てた重みと共
に、文書ランク発生部１６０６に供給する。文書ランク発生部１６０６は、論理
形態比較部１６０４が割り当てた重みに基づいて、文書にランク付けを行ない、
ランク付け出力を、出力文書集合としてユーザに提示する。The document record including the match is supplied to the document rank generating unit 1606 together with the weight assigned by the logical form comparing unit 1604. The document rank generating unit 1606 ranks the documents based on the weights assigned by the logical form comparing unit 1604,
The ranking output is presented to the user as an output document set.

【０１０６】図１７は、図１６に示したシステムの動作を、更に詳しく示すフロー図である
。最初に、統計的データ記憶装置１４８６に対して入力クエリを実行し、文書レ
コードおよびこれらの文書レコードに関連する変更論理形態をフィルタ１６０２
に供給する。これをブロック１７０８および１７１０で示す。発生部１６９６は
、次に、クエリの元のコンテンツに基づいて論理形態を発生する。これをブロッ
ク１７１２で示す。次に、クエリに基づいた論理形態を、論理形態変更部１６９
８によって変更する。これをブロック１７１４で示す。FIG. 17 is a flowchart showing the operation of the system shown in FIG. 16 in more detail. First, an input query is performed against the statistical data store 1486 to filter 1602 the document records and the change logic associated with those document records.
To supply. This is indicated by blocks 1708 and 1710. The generator 1696 then generates a logical form based on the original content of the query. This is indicated by block 1712. Next, the logical form based on the query is converted to a logical form changing unit 169.
8 to change. This is indicated by block 1714.

【０１０７】フィルタ１６０２は、次に、クエリに応答して、サーチ・エンジン１４８４が
供給した文書レコードの内第１のものを選択する。これをブロック１７１６で示
す。論理形態比較部１６０４は、変更クエリ論理形態のいずれかが、変更文書論
理形態に対応するか否かについて判定を行なう。対応しない場合、この文書には
ゼロ・スコアを割り当て、フィルタ１６０２は、比較する必要のある追加の文書
が他にあるか否かについて判定を行なう。これをブロック１７１８，１７２０お
よび１７２２で示す。The filter 1602 then selects the first of the document records supplied by the search engine 1484 in response to the query. This is indicated by block 1716. Logical form comparator 1604 determines whether any of the changed query logical forms corresponds to the changed document logical form. If not, the document is assigned a zero score, and the filter 1602 determines whether there are any additional documents that need to be compared. This is indicated by blocks 1718, 1720 and 1722.

【０１０８】しかしながら、変更クエリ論理形態のいずれかが、変更文書論理形態のいずれ
かと一致した場合、論理形態比較部１６０４が分析対象の文書に重みを割り当て
る。これをブロック１７２４で示す。再び、フィルタ１６０２は、ブロック１７
２２で示すように、比較する必要のある追加の文書が他にあるか否かについて判
定を行なう。However, if any of the changed query logical forms matches any of the changed document logical forms, the logical form comparison unit 1604 assigns a weight to the document to be analyzed. This is indicated by block 1724. Again, the filter 1602 returns to block 17
As shown at 22, a determination is made as to whether there are any additional documents that need to be compared.

【０１０９】比較する必要のある文書がそれ以上ない場合、文書ランク発生部１６０６は、
論理形態発生部１６０４が割り当てた重みにしたがって、文書をランク付けする
。次いで、ランクした出力をユーザに提示する。これをブロック１７２６および
１７２８で示す。If there are no more documents that need to be compared, the document rank generator 1606
The documents are ranked according to the weights assigned by the logical form generation unit 1604. The ranked output is then presented to the user. This is indicated by blocks 1726 and 1728.

【０１１０】図１８は、図１５に示した論理形態変更部１５９４および図１６に示した論理
形態変更部１６９８の動作を示すフロー図である。本発明は、クエリ側またはデ
ータ側のいずれか、または双方において、以下で更に詳しく論ずるような、変更
論理形態の使用も想定していることは理解されよう。この論述の目的上、クエリ
側およびデータ側双方に論理形態変更部を示す。FIG. 18 is a flowchart showing the operation of logical form changing section 1594 shown in FIG. 15 and logical form changing section 1698 shown in FIG. It will be appreciated that the present invention contemplates the use of modification logic, either discussed on the query side or the data side, or both, as discussed in more detail below. For the purposes of this discussion, a logical form changer is shown on both the query and data sides.

【０１１１】いずれの場合でも、論理形態変更部は、最初に、クエリまたは分析対象文書の
いずれかに基づいて発生した、元の論理形態を受け取る。これをブロック１８３
０で示す。次に、論理形態変更部は、元の論理形態の言い換えを発生する。この
言い換えは、多数の方法のいずれでも形成することができる。そのいくつかにつ
いて以下で説明する。言い換え論理形態の発生をブロック１８３２で示す。In any case, the logical form change unit first receives the original logical form generated based on either the query or the document to be analyzed. Block 183
Indicated by 0. Next, the logical form changing unit generates a paraphrase of the original logical form. This paraphrase can be formed in any of a number of ways. Some of them are described below. The occurrence of the paraphrase logic form is indicated by block 1832.

【０１１２】次に、論理形態変更部は、所定のクラスの論理形態（多種多様の論理形態とす
ることも可能である）を抑制する。その数については以下で論ずる。この抑制を
ブロック１８３４で示す。言い換え論理形態は、抑制を受けた後、フィルタ１０
２に供給され、抑制後に残っている論理形態に基づいて文書を濾過する。これを
ブロック１８３６で示す。（変更論理形態の発生）図１９は、言い換え論理形態の発生、および論理形態の抑制をより良く示すフ
ロー図である。（意味上または語彙上の言い換え）論理形態変更部の１つが、元の論理形態を受け取る。次に、論理形態変更部は
、最初に、元の論理形態内にある単語の意味的拡大を実行することによって、語
彙上の言い換え論理形態を形成する。これをブロック１９３８で示す。次に、意
味的に拡大した単語に基づいて、そして元の論理形態における元の構造的接続を
用いて、語彙上の言い換え論理形態を発生する。これをブロック１９４０で示す
。Next, the logical form changing unit suppresses a logical form of a predetermined class (a variety of logical forms is also possible). The number is discussed below. This suppression is indicated by block 1834. In other words, after the suppression, the filter 10
2 to filter the document based on the logical form remaining after suppression. This is indicated by block 1836. (Generation of Modified Logical Form) FIG. 19 is a flowchart showing the generation of paraphrase logical form and suppression of the logical form better. (Semantic or lexical paraphrase) One of the logical form change units receives the original logical form. Next, the logical form change unit first forms a lexical paraphrase logical form by performing a semantic expansion of the words in the original logical form. This is indicated by block 1938. A lexical paraphrase logical form is then generated based on the semantically expanded words and using the original structural connection in the original logical form. This is indicated by block 1940.

【０１１３】例示としての一実施形態では、意味的拡大を実行するには、元の論理形態にお
ける各含有単語を試験し、同義語、上位語、下位語、または元の含有単語に意味
的な関係を有するその他の単語を含むように、その単語を拡大する。例えば、論
理形態変更部９４および９８には、一実施形態では、シソーラスのような参照コ
ルプス、辞書、あるいはＷｏｒｄＮｅｔまたはＭｉｎｄＮｅｔ語彙のような計算
による語彙（computational lexicon）へのアクセスを与え、単語間の同義語、上位語、下位語、またはその他の意味的関係を識別し、クエリと文書との間に可
能な語彙上の言い換え関係を特定することができる。In one exemplary embodiment, to perform semantic expansion, each contained word in the original logical form is tested, and a synonym, broader term, narrower term, or semantic expansion is performed on the original contained word. Expand the word to include other words that have a relationship. For example, in one embodiment, the logical form modifiers 94 and 98 are given access to a reference corpse, such as a thesaurus, a dictionary, or a computational lexicon, such as a WordNet or MindNet vocabulary. It can identify synonyms, broader terms, narrower terms, or other semantic relationships and identify possible lexical paraphrases between the query and the document.

【０１１４】したがって、例えば、入力クエリが、 How do spiders eat their victims? （蜘蛛はその獲物をどのようにして食べるのか）である場合、このクエリに基づいて発生される元の論理形態は、次の通りである
。 eat;Dsub;spider eat;Dobj;victim 単語”eat”（食べる）の語彙上または意味上の拡大によって、”consume”（
消費する）が得られる。また、単語”spider”（蜘蛛）の語彙上または意味上の
拡大によって、”arachnid”（蛛形）および”wolf spider”（ウルフ・スパイダ）が得られる。一方、これらの拡大は、以下のように、eat;Dsub;spiderに対する追加の言い換え論理形態に至る。 consume;Dsub;spider eat;Dsub;arachnid consume;Dsub;arachnid eat;Dsub;wolf＿spider consume;Dsub;wolf＿spider 同様に、”victim”の語彙上または意味上の拡大によって、”prey”（祈り）
が得られる。したがって、論理形態eat;Dobj;victimに基づく言い換え論理形態は、次のようになる。 consume;Dobj;victim eat;Dobj;prey この技法は、クエリに基づいて戻される、関連文書を保持する傾向がある。し
たがって、この技法は、正確性を低下させることなく、この文書集合内における
回収率を高める。（構造上の言い換え）元の論理形態を語彙に関して拡大した後、これらを構造的に拡大し、追加の言
い換え論理形態を得る。サーチ・エンジンが戻した関連文書は、先に言及して本
願にも含まれるものとした引例に記載されている一層厳格な技法を用いると、ク
エリ内にある含有単語が文書中の単一の文章の中にあっても、破棄される場合が
ある。これが発生するのは、典型的に、クエリと文書の文章との間に構文上また
は意味上の言い換え関係が存在するが、クエリに基づいた論理形態および文書に
基づいた論理形態は正確には一致しない場合である。Thus, for example, if the input query is How do spiders eat their victims ?, the original logical form generated based on this query is It is as follows. eat; Dsub; spider eat; Dobj; victim The lexical or semantic expansion of the word “eat” (consume) (
Consumption) is obtained. The lexical or semantic expansion of the word "spider" also yields "arachnid" and "wolf spider". On the other hand, these expansions lead to additional paraphrased logic forms for eat; Dsub; spider, as follows. consume; Dsub; spider eat; Dsub; arachnid consume; Dsub; arachnid eat; Dsub; wolf_spider consume; Dsub; wolf_spider
Is obtained. Therefore, the paraphrase logical form based on the logical form eat; Dobj; victim is as follows. consume; Dobj; victim eat; Dobj; prey This technique tends to retain related documents that are returned based on the query. Thus, this technique increases the recovery within this document collection without compromising accuracy. (Structural paraphrase) After expanding the original logical forms with respect to vocabulary, they are structurally expanded to obtain additional paraphrase logical forms. Using the more rigorous technique described in the reference mentioned above and included in the present application, the relevant documents returned by the search engine may be modified such that the contained words in the query appear in a single word in the document. Even in the text, it may be discarded. This typically occurs when there is a syntactic or semantic paraphrase between the query and the text of the document, but the logical form based on the query and the logical form based on the document exactly match If not.

【０１１５】これらの基準を満たす文書を正しく保持するために、論理形態変更部に構造的
言い換え規則を実装し、元の論理形態に基づく追加の論理形態を発生する。この
追加の論理形態は、通常の構文上／意味上の言い換え関係を取り込み、ユーザが
クエリをどのように表現したかと、関連文書は同様の概念をどのように表現する
かとの間の相違を正規化することを意図するものである。これを行なうために、
論理形態変更部は、元の入力テキストに基づいて発生した基本論理形態を増強す
る。In order to properly retain documents that meet these criteria, the logical form modifier implements structural paraphrase rules to generate additional logical forms based on the original logical form. This additional logical form captures the usual syntactic / semantic paraphrases and normalizes the difference between how a user expresses a query and how related documents express similar concepts. It is intended to be To do this,
The logical form change unit augments the basic logical form generated based on the original input text.

【０１１６】例えば、元のクエリが、 How many moons does Jupiter have？（木星には月がいくつあるか）とすると、このクエリに基づく元の論理形態三連体は、 have;Dsub;Jupiter have;Dobj;moon moon;Ops;many となる。For example, if the original query is How many moons does Jupiter have? (How many moons does Jupiter have?) The original logical form triple based on this query is have; Dsub; Jupiter have; Dobj; moon moon; Ops; many.

【０１１７】ここで、Ｏｐｓは機能語関係である。本発明の一態様による構造的言い換え規則を実装することによって、論理形態
変更部は、次のような追加の論理形態を発生する。 moon;PossBy;Jupiter 含有単語は、元の論理形態と同一であるが、構造的接続が異なるものの関係の
ある構造的接続であることがわかる。これによって、同じ論理形態を含むインデ
ックス化文書に対する照合が可能となる。Here, Ops is a function word relation. By implementing the structural paraphrasing rules according to one aspect of the present invention, the logical form changer generates the following additional logical forms. Moon; PossBy; Jupiter The containing word is the same as the original logical form, but the structural connection is different, but it can be seen that it is a related structural connection. This allows matching against indexed documents containing the same logical form.

【０１１８】構造的言い換え規則の他の例では、一層の複雑化が可能である。例えば、入力
クエリが、Find me information on the crystallization of viruses.（ビール
スの結晶化に関する情報を私に見つけて下さい）とする。これは、以下のような
、計算論理形態三連体を生成する。 crystallization;of;virus どのように”viruses crystallize”（ビールスが結晶化する）かを記述する文章を含む関連文献とクエリを照合するには、考慮すべきいくつかの情報片が必
要となる可能性がある。このような情報は以下のものを含む。In other examples of structural paraphrase rules, further complications are possible. For example, suppose the input query is Find me information on the crystallization of viruses. This creates a computational logic triad as follows: crystallization; of; virus A few pieces of information to consider may be needed to match a query against relevant literature, including text, describing how "viruses crystallize" There is. Such information includes:

【０１１９】１．Dsub/verbとある種の英語の名詞化との間に、規則的な言い換え関係が存在する。２．名詞”crystallization”は、既定の辞書において、動詞基体”crystalli
ze”を有するとして、特定されている。[0119] 1. A regular paraphrase exists between Dsub / verb and some English nounization. 2. The noun "crystallization" appears in the default dictionary in the verb base "crystalli
ze ".

【０１２０】３．”virus”は、辞書では有生として類別されている。合わせて、これらの情報片により、追加の構造的言い換え論理形態を、クエリ
に対して仮説として取り上げ、照合のために生成することが可能となる。 crystallize;Dsub;virus ”virus”の有生性（animacy）は、この言い換えを主語または目的語関係のど
ちらとして表現すべきか予測するために用いる。相互言語学的には（cross＿lin
guistically）、有生物は、無生物よりも、動詞の主語（動作主）となる可能性が高い。したがって、クエリが”crystallization of sugar”について尋ねた場
合、追加の言い換え論理形態 crystallize;Dobj;sugar を生成するであろう。[0120] 3. "Virus" is categorized as animate in the dictionary. Together, these pieces of information allow additional structural paraphrase logic forms to be taken as hypotheses for the query and generated for matching. crystallize; Dsub; virus The "animacy" of "virus" is used to predict whether this paraphrase should be expressed as a subject or object relation. Cross-linguistically (cross_lin
Living things are more likely to be the subject (verb) of the verb than inanimate objects. Thus, if the query asks about "crystallization of sugar", it will generate an additional paraphrase logic form crystallize; Dobj; sugar.

【０１２１】種々の論理形態言い換え規則を実施して、多数の構文的言い換え関係を正規化
した。その中には、次の事項を含む。１．所有構文、２．名詞化／動詞目的語および主語、名詞複合体／動詞目的語（”pr
ogram computers”および”computer program”等）。Various logical form paraphrase rules have been implemented to normalize a number of syntactic paraphrases. These include the following: 1. Possession syntax, 2. Nounization / verb object and subject, noun complex / verb object ("pr
gram computers ”and“ computer programs ”).

【０１２２】３．名詞修飾語（”King of Spain”および”Spanish King”等）。４．相互関係構文（”John kissed Mary”および”Mary kissed John”等）。５．属詞／叙述形容詞（”That woman is tall”および”That tall woman” 等）。[0122] 3. Noun modifiers (such as "King of Spain" and "Spanish King"). 4. Reciprocal constructs (such as "John kissed Mary" and "Mary kissed John"). 5. Adjective / predicative adjectives (such as "That woman is tall" and "That tall woman").

【０１２３】６．軽い動詞構文／動詞（”The president made a decision”および”The P
resident decided”等）。補足資料Appendix Aは、前述の規則の実施形態の例を示すコードを含む。各場
合において、これらの規則は、照合プロセスをなおも厳格に制限しつつ、一層関
連の深い文書の保持を可能にする。元の構造関係の構造的拡大または構造的言い
換えの実行を、図１９のブロック１９４２で示す。先に論じた言い換え規則、お
よびその他のこのような規則は、経験的に、または他のいずれかの適切な手段に
よって得ることができる。6. Light verb constructions / verbs ("The president made a decision" and "The P
Appendix A contains code that illustrates examples of embodiments of the rules described above. In each case, these rules are more relevant while still severely limiting the matching process. Performing structural extensions or structural paraphrasing of the original structural relationship is indicated by block 1942 in Figure 19. The paraphrasing rules discussed above, and other such rules, are empirical. Or any other suitable means.

【０１２４】構造的言い換えは、情報検索システムのインデックス化側、およびクエリ側の
双方に実施可能であるが、インデックス化側に実装する場合、インデックスのサ
イズが大型化する可能性があり望ましくない。したがって、例示としての一実施
形態では、構造的言い換えは、情報検索システムのクエリ側にのみ実施する。The structural paraphrase can be performed on both the indexing side and the query side of the information search system. However, if the structural paraphrasing is implemented on the indexing side, the size of the index may become large, which is not desirable. Thus, in one exemplary embodiment, the structural paraphrase is performed only on the query side of the information retrieval system.

【０１２５】また、構造的言い換えは、ブロック１３８および１４０で示す意味上の言い換
えの前または後のいずれでも実行可能であることも注記しておく。加えて、構造
的言い換えは、意味の拡大中に発生した追加の論理形態に基づいて行なうことが
できる。これをブロック１９４４および１９４６で示す。（メタ構造言い換え）論理形態変更部によって発生することができる追加の言い換え論理形態集合は
、抽象的論理形態の発生を含む。例えば、ユーザに自然言語クエリをサーチ・エ
ンジンに入力するように促す場合でも、多くのユーザは未だ多数の含有単語を用
いて関心のある構文／意味的関係で明確に形成したクエリを与えられない。むし
ろ、多くのクエリは、ここでは「キーワード・クエリ」と呼ぶカテゴリになって
しまう。このようなキーワード・クエリは、”dog”（犬）、”gardening”（園
芸）”The Renaissance”（ルネッサンス）、”Buffalo Bill”（バッファロー・ビル）のような、真のキーワード・クエリを含む。また、キーワード・クエリ
は、紋切り型の「フレーム」文章におけるキーワードの形態となる可能性もあり
、”Tell me about dogs”（犬について教えて下さい）、”I want information
on gardening”（園芸に関する情報が欲しい）、および”What do you have on
dinosaurs?”（ディノザウルスについてどう思うか）のように、有用な言語学的な文脈を与えないものである。このようなクエリはよくあるので、本発明は、
これらのクエリに対処する照合技法を含む。It should also be noted that structural paraphrases can be performed either before or after the semantic paraphrases shown in blocks 138 and 140. In addition, structural paraphrases can be made based on additional logical forms that have occurred during semantic expansion. This is indicated by blocks 1944 and 1946. Meta-structure paraphrases Additional paraphrase logical form sets that can be generated by the logical form changer include the generation of abstract logical forms. For example, even when prompting a user to enter a natural language query into a search engine, many users still cannot provide a well-formed query with a syntactic / semantic relationship of interest using a large number of contained words. . Rather, many queries fall into a category referred to herein as "keyword queries." Such keyword queries include true keyword queries such as "dog", "gardening" (horticulture), "The Renaissance", "Buffalo Bill". Also, the keyword query may be in the form of keywords in a crest-type “frame” sentence, such as “Tell me about dogs”, “I want information”
on gardening ”(for information on gardening) and“ What do you have on
dinosaurs? ”(what do you think about dinosaurus)? It does not give a useful linguistic context.
Includes matching techniques to address these queries.

【０１２６】最初に、図１９のブロック１９４８で示すように、クエリをその構造に基づい
てキーワード・クエリとして識別する。クエリが１つの含有単語のみから成る場
合（または、多単語表現としても知られている、複合含有単語として扱う含有単
語のシーケンス）、または明示的に識別した共通クエリ構造内において発生する
１つ以上の含有単語を含むので、クエリをキーワード・クエリとして識別する。
多単語表現の一例は、”Buffalo Bill”（バッファロー・ビル）である。これは
、内部構造を有する単一単語として扱う。First, the query is identified as a keyword query based on its structure, as indicated by block 1948 in FIG. The query consists of only one contained word (or sequence of contained words treated as a compound contained word, also known as a multi-word expression), or one or more that occur within an explicitly identified common query structure , The query is identified as a keyword query.
An example of a multi-word expression is "Buffalo Bill". This is treated as a single word with an internal structure.

【０１２７】以下の規則は、”Who was Buffalo Bill”（バッファロー・ビルは誰だったの
か）という形態のキーワード・クエリを識別するために用いる構造を記述する一
例を与える。動詞が”be”であり、Ｄｎｏｍ（深い主格）が”who”であり、または Dsubが構文的に変更されていない場合、直前の決定詞または前置詞句を除いて
、照合の目的のためにDsubをキーワードとして扱う。一旦クエリをキーワード・クエリとして特定したなら、照合の目的のために、
種々の抽象的論理形態を発生する。クエリが”Who was Buffalo Bill?”である前述の例では、以下の抽象的論理形態を発生する。 heading＿OR＿title;Dsub;Buffalo＿Bill Dsub＿of＿be;Dsub;Buffalo＿Bill Dsub＿of＿verb;Dsub;Buffalo＿Bill これらの抽象的論理形態は、クエリに基づいて発生した元の論理形態に含まれ
るいずれとも直接的には対応しない。しかしながら、これらは、潜在的に、文書
をインデックス化し統計的データ記憶装置１４８６に格納するときに文書レコー
ド内に作成した、対応する論理形態に対して一致する。例えば、”Buffalo Bill
”という題の文書を処理する際、図１５に示す論理形態変更部１５９４は、以下
の抽象的論理形態を発生し、構造的データ記憶装置１５９６内のインデックスに
これを追加する。 heading＿or＿title;Dsub;Buffalo＿Bill また、文書のインデックス化中に、動詞”be”およびＤｓｕｂを含むいずれの
論理形態も、以下のような特殊な論理形態を生成する。 Dsub＿of＿be;Dsub;WORD （例えば、Dsub＿of＿be;Dsub;Buffalo＿Bill）加えて、論理形態がＤｓｕｂおよび”be”以外の動詞を含む場合、以下のよう
に追加の抽象的論理形態を作成する。 Dsub＿of＿verb;Dsub;WORD （例えば、Dsub＿of＿verb;Dsub:Buffalo＿Bill）このように、インデックス化時点およびクエリの時点でキーワード・クエリに
対して作成した抽象的論理形態は、情報検索システムが、データ側の言語学的構
造を利用して、キーワード・クエリ内に含まれるキーワードに主に関係する可能
性が高い文書を識別することを可能にする（例えば、データ側の抽象的論理形態
は、キーワード・クエリと一致させることができる文書のメタ構造を表わす）。The following rules give an example describing the structure used to identify a keyword query of the form “Who was Buffalo Bill”. If the verb is "be", the Dnom (deep nominative) is "who", or the Dsub is not syntactically changed, except for the immediately preceding determinant or prepositional phrase, Dsub for matching purposes Is treated as a keyword. Once a query has been identified as a keyword query, for matching purposes,
Generates various abstract logical forms. The query is "Who was Buffalo Bill?" In the above example, the following abstract logical form occurs. Buffalo_Bill Dsub_of_be; Dsub; Buffalo_Bill Dsub_of_verb; Dsub; Buffalo_Bill These abstract logical forms do not directly correspond to any included in the original logical form generated based on the query. However, they are potentially consistent with the corresponding logical form created in the document record when the document is indexed and stored in the statistical data store 1486. For example, "Buffalo Bill
When processing a document titled "", the logical form changing unit 1594 shown in FIG. 15 generates the following abstract logical form and adds it to the index in the structural data storage device 1596. heading_or_title; Dsub; Buffalo_Bill Also, during indexing of documents, any logical form, including the verbs “be” and Dsub, creates special logical forms such as: Dsub_of_be; Dsub; WORD (eg, Dsub_of_be; Dsub; Buffalo_Bill) In addition, if the logical form includes a verb other than Dsub and "be", create an additional abstract logical form as follows: Dsub_of_verb; Dsub; WORD (eg, Dsub_of_verb; Dsub: Buffalo_Bill) The abstract logical form created for the keyword query at the time of indexing and at the time of the query allows the information retrieval system to use the linguistic structure of the data side. To identify documents that are most likely to be primarily related to the keywords contained in the keyword query (eg, the abstract logical form on the data side can be matched to the keyword query). Represents a possible document meta-structure).

【０１２８】加えて、文書がキーワードを含む題を有していなくても、文書内の文章を分析
し、その文書のメタ構造を判定することができる。例えば、文章の主語は、特に
、主動詞が”be”である文章の主語は、当該文章の主題即ち題目（ｔｏｐｉｃ）
である場合が多い。キーワード・クエリを、当該キーワードに関する文章を含む
文書と優先的に照合することによって、キーワード・クエリに対してでさえも、
正確性を向上させることができる。例えば、クエリが”Buffalo Bill”（バッフ
ァロー・ビル）であり、第１の文書が、 Buffalo Bill was a showman, usually acting as the part of himself in o
ne of Buntline’s melodrams. （バッファロー・ビルは、バントラインのメロドラマの１つにおいて常に彼自
身の役を演じたショーマンであった）という文章を含み、第２の文書が、 One of the most active performers in American cinema,Keitel Demonstrat
ed his versatile talents in the 1970’s in drama, Alice Doesn’t Live He
re Anymore (1974); an artfull western Buffalo Bill and the Indians, or S
itting Bulls history lesson (1976);and a black comedy, Mother, Jugs, and
Speed(1976). アメリカ映画において最も活動的な俳優の一人、ケイテルは１９７０年代のド
ラマ、アリスはもうここには住んでいない（１９７４）、芸術的なウェスタン、
バッファロー・ビルとインディアン、またはシッティング・ブルス歴史の授業（
１９７６）、およびブラック・コメディ、母、刑務所、および覚せい剤（１９７
６）において、彼の多彩な才能を発揮した。という文章を含む場合、その文書に対してインデックス化の時点で発生する抽象的論理形態、およびク
エリの時点でキーワード・クエリに対して発生する抽象的論理形態は、キーワー
ド・クエリを、第２の文書ではなく、第１の文書に対して優先的に照合させる。
これは、第１の文書が文章の主語としてキーワード・クエリを含むのに対して、
第２の文書はそうでないからである。In addition, even if a document does not have a title including a keyword, a sentence in the document can be analyzed to determine the meta structure of the document. For example, the subject of a sentence, in particular, the subject of a sentence whose main verb is “be” is the subject or topic of the sentence.
Often it is. By preferentially matching keyword queries to documents containing text about the keywords, even for keyword queries,
Accuracy can be improved. For example, the query is "Buffalo Bill" and the first document is Buffalo Bill was a showman, usually acting as the part of himself in o
The second document, One of the most active performers, contains the text, "ne of Buntline's melodrams." in American cinema, Keitel Demonstrat
ed his versatile talents in the 1970's in drama, Alice Doesn't Live He
re Anymore (1974); an artfull western Buffalo Bill and the Indians, or S
itting Bulls history lesson (1976); and a black comedy, Mother, Jugs, and
Speed (1976). One of the most active actors in American cinema, Keitel is a 1970s drama, Alice no longer lives here (1974), an artistic western,
Buffalo Bill and Indian or Sitting Bulls History Lesson (
1976), and black comedy, mother, prison, and stimulants (197
In 6), he demonstrated his various talents. , The abstract logical form that occurs at the time of indexing for that document, and the abstract logical form that occurs for the keyword query at the time of the query, The collation is preferentially performed not on the document but on the first document.
This is because the first document contains a keyword query as the subject of a sentence,
This is because the second document is not.

【０１２９】抽象的論理形態の追加の例には、定義付け文章に基づいて作成するものがある
。定義付け文章の一例には、以下のようなものがある。 Lava, molten rock which flows from volcanoes （溶岩、活火山から流出する溶けた岩）この種の定義付け文章は、言語学的構造およびフォーマット構造を含むキュー
（ｃｕｅ）を試験することによって特定することができる。最も頻繁なのは、こ
のような文章を、単一の名詞または多単語表現を含む名詞句、それに続くカンマ
、それに続く同格の名詞句として解析することである。これは、 article＿title＿or＿heading;Dsub;lava という形態の抽象的論理形態を発生する。An additional example of an abstract logical form is one that is created based on a defining sentence. An example of the definition sentence is as follows. Lava, molten rock which flows from volcanoes This type of defining text can be identified by examining cues, including linguistic and formal structures. it can. Most often, such sentences are parsed as a noun phrase containing a single noun or multi-word expression, followed by a comma, followed by a peer noun phrase. This generates an abstract logical form of the form article_title_or_heading; Dsub; lava.

【０１３０】これは、文書のメタ構造（または全体的な内容）を示し、このような文章を要
求するキーワード・クエリと照合するために用いることができる。文書のメタ構造を示す抽象的論理形態を得ること、およびキーワード・クエリ
に対する抽象的論理形態を得ることは、図１９のブロック１９５０および１９５
２で示す。（ある論理形態の抑制）本発明の別の態様による論理形態変更部１５９４および１６９８は、あるクラ
スの論理形態の抑制も行なう。例えば、ある論理形態は、関連文書の良好な判別
子（discriminator）とはならず、偽りの肯定的な一致を生成する。典型的に、このような論理形態は、”be;Locn;there”のような高頻度論理形態に対応する。このクラスの論理形態は、ブール検索システムにおいて見られる”ｓｔｏｐｗ
ｏｒｄ”の構文的／意味的類同語として考えることができる。このクラスの論理
形態の追加の例には、以下のようなものがある。ある動詞／不変化詞：come（来る）;Ptcl;to（へ）（I came to a decision,J
ohn came to a stop.）（私は決定に至った、ジョンは停留所に来た）高頻度動詞：be;Dsub:John（John is tired, John is the largest elephant
in the world）（ジョンは疲れている。ジョンは世界で最も大きな象である）代名詞：eat（食べる）；Dsub;he（彼）（he ate at home）（彼は家で食べた）共通の論理形態：tell（言う）;Dobj;me（tell me about dogs）（犬について私に教えて下さい）これらおよびその他のこのような論理形態は、経験的に、またはその他の適切
な手段によって、識別し構築することができるが、典型的に、正しくない一致を
招く論理形態に対応する。本発明の一態様によれば、このクラスの論理形態は、
クエリまたは文書レコードのいずれか、または双方において識別し、抑制する。
これは、図１９のブロック１９５４および１９５６で示す。This indicates the meta-structure (or overall content) of the document and can be used to match against keyword queries that require such text. Obtaining an abstract logical form for the meta-structure of the document and obtaining an abstract logical form for the keyword query is performed by blocks 1950 and 195 of FIG.
Indicated by 2. (Suppression of Certain Logical Forms) The logical form change units 1594 and 1698 according to another aspect of the present invention also suppress a certain class of logical forms. For example, some logical forms do not become good discriminators of the relevant document, but produce false positive matches. Typically, such logical forms correspond to high frequency logical forms such as "be;Locn;there". The logical form of this class is "stopw" found in Boolean search systems.
ord "can be thought of as syntactic / semantic synonyms of" ord. "Additional examples of logical forms of this class include: Certain verbs / invariants: come; ; to (he) (I came to a decision, J
ohn came to a stop. (I came to a decision, John came to the stop) High frequency verb: be; Dsub: John (John is tired, John is the largest elephant
in the world) (John is tired; John is the largest elephant in the world) Synonyms: eat (eat); Dsub; he (he) (he ate at home) (he ate at home) Logical forms: tell; say Dobj; me (tell me about dogs) These and other such logical forms are identified empirically or by other appropriate means But typically correspond to logical forms that lead to incorrect matches. According to one aspect of the invention, the logical forms of this class are:
Identify and suppress in either the query or the document record, or both.
This is indicated by blocks 1954 and 1956 in FIG.

【０１３１】加えて、このような論理形態の一部は、クエリに基づく論理形態の生成中に抑
制することができる。例えば、”give（与える）;Dobj;information（情報）” という形式の論理形態は、文書のインデックス化中には抑制しなければ、”what databases give information on cancer?”（どのデータベースが癌に関する情
報を与えるか）というようなクエリに対する照合において有用な場合もある。そ
の場合、ユーザは、ある特定のデータベースのアイデンティティを要求し、その
クエリは非常に特定的となる。一方、”give（与える）;Dobj;information（情報）”という形式の論理形態は、”give me information on X”という形式のク
エリの処理中に抑制する。このクエリをキーワード・クエリとして識別し、識別
した論理形態を抑制する。In addition, some of these logical forms can be suppressed during the generation of the logical form based on the query. For example, a logical form of the form “give (give); Dobj; information (information)”, if not suppressed during document indexing, would produce “what databases give information on cancer?” May be useful in matching against queries such as In that case, the user requests the identity of a particular database, and the query is very specific. On the other hand, logical forms of the form "give (given); Dobj; information (information)" are suppressed during processing of a query of the form "give me information on X". Identify this query as a keyword query and suppress the identified logical forms.

【０１３２】語彙上および意味上の言い換え、構造的言い換え、抽象的論理形態の発生、お
よび論理形態の抑制に基づいて、論理形態および変更論理形態の全てを得た後、
変更論理形態集合をフィルタ１６０２に供給し、更に処理を行なう。これを図１
９のブロック１９５８で示す。先に論じたように、フィルタ１６０２は、クエリ
に基づいた変更論理形態と文書に基づいたそれらとの間で一致を探す。（結論）このように、本発明は２つ以上のテキスト入力間の類似性を判定するシステム
を提供することがわかる。更に、本発明の一態様は、サーチ・エンジンが戻す文
書集合において、以前の技法よりも関連が深い文書を識別することにより、情報
検索用途において正確性を格段に向上させるのに適している。また、本発明は、
濾過処理の間に破棄する関連文書数を減少させることによって、回収率を高める
。After obtaining all of the logical and modified logical forms based on lexical and semantic paraphrases, structural paraphrases, generation of abstract logical forms, and suppression of logical forms,
The modified logical form set is supplied to the filter 1602 for further processing. Figure 1
Nine blocks 1958. As discussed above, the filter 1602 looks for a match between the query-based logical changes and those based on the document. Conclusion Thus, it can be seen that the present invention provides a system for determining the similarity between two or more text inputs. In addition, one aspect of the present invention is well suited for significantly improving accuracy in information retrieval applications by identifying documents that are more relevant than previous techniques in the set of documents returned by the search engine. Also, the present invention
Increase recovery by reducing the number of relevant documents discarded during the filtration process.

【０１３３】本発明の一態様は、例示として、２つのテキスト入力に基づいて、論理形態を
作成しかつ比較し、元の単語を語彙的または意味的に拡大することによって、元
の構造接続を構造的に拡大することによって、および／またはテキスト入力のい
ずれかまたは双方（例えば、文書またはクエリ、あるいは双方）のメタ構造を示
す抽象的論理形態を作成することによって、言い換え論理形態を作成する。また
、本発明は、例示として、ある論理形態を抑制する。勿論、言い換えおよび抑制
は、論理形態集合双方に対して同一である必要はなく、それぞれに異なることも
可能である。One aspect of the present invention, by way of example, is to create and compare logical forms based on two textual inputs and expand the original word lexically or semantically, thereby connecting the original structural connection. Paraphrased logical forms are created by structurally expanding and / or by creating abstract logical forms that indicate the metastructure of either or both of the text inputs (eg, documents or queries, or both). The present invention also suppresses certain logical forms, by way of example. Of course, paraphrase and suppression need not be the same for both sets of logical forms, but can be different for each.

【０１３４】また、現在ハッシング技法を採用して統計的データ記憶装置８６内に含まれる
インデックスをより小さなサイズに切り刻んでいることも注記しておく。勿論、
適切なハッシング技法であれば、いずれでも使用可能である。本発明は、インデ
ックスのハッシュ表現とでも、またはインデックスの完全な表現とでも同等に利
用することができる。It should also be noted that currently hashing techniques are employed to chop the indexes contained in the statistical data store 86 to smaller sizes. Of course,
Any suitable hashing technique can be used. The present invention can be used equally with a hash representation of an index or with a complete representation of an index.

【０１３５】以上好適な実施形態を参照しながら本発明について説明したが、本発明の精神
および範囲から逸脱することなく、形態および詳細において変更が可能であるこ
とを、当業者は認めよう。While the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.

【０１３６】[0136]

【表５】 [Table 5]

【０１３７】 [0137]

【０１３８】 [0138]

【０１３９】 [0139]

【０１４０】 [0140]

[Brief description of the drawings]

【図１】我々の本発明による情報検索システム５の最上位ブロック図を示す。FIG. 1 shows a top-level block diagram of our information retrieval system 5 according to the present invention.

【図２】我々の本発明の教示を利用した、図１に示す形式の情報検索システム２００の
上位実施形態を示す。FIG. 2 illustrates a high-level embodiment of an information retrieval system 200 of the type shown in FIG. 1 utilizing our teachings of the present invention.

【図３】図２に示すシステム２００内部に収容したコンピュータ・システム３００、具
体的にはクライアント・パーソナル・コンピュータのブロック図を示す。3 shows a block diagram of a computer system 300, specifically a client personal computer, housed inside the system 200 shown in FIG.

【図４】図３に示すコンピュータ３００内部で実行するアプリケーション・プログラム
４００の最上位ブロック図を示す。4 shows a top-level block diagram of an application program 400 executed inside the computer 300 shown in FIG.

【図５Ａ】英語文章の変化する複雑性および対応するその論理形態要素の異なる対応する
例を示す。FIG. 5A illustrates different corresponding examples of the changing complexity of English sentences and their corresponding logical form factors.

【図５Ｂ】英語文章の変化する複雑性および対応するその論理形態要素の異なる対応する
例を示す。FIG. 5B illustrates different corresponding examples of the changing complexity of English sentences and their corresponding logical form factors.

【図５Ｃ】英語文章の変化する複雑性および対応するその論理形態要素の異なる対応する
例を示す。5A-5C illustrate different corresponding examples of the changing complexity of English sentences and their corresponding logical form factors.

【図５Ｄ】英語文章の変化する複雑性および対応するその論理形態要素の異なる対応する
例を示す。5A-5D illustrate different corresponding examples of the varying complexity of English sentences and their corresponding logical form factors.

【図６】図６Ａおよび図６Ｂの図面用紙の正しい位置合わせを示す。FIG. 6 shows the correct alignment of the drawing sheets of FIGS. 6A and 6B.

【図６Ａ】我々の発明の検索プロセス６００のフローチャートを示す。FIG. 6A shows a flowchart of a search process 600 of our invention.

【図６Ｂ】我々の発明の検索プロセス６００のフローチャートを示す。FIG. 6B shows a flowchart of a search process 600 of our invention.

【図７】プロセス６００内で実行するＮＬＰルーチン７００のフローチャートを示す。FIG. 7 shows a flowchart of an NLP routine 700 that executes within process 600.

【図８Ａ】例示としての照合論理形態三連体重み付けテーブル８００を示す。FIG. 8A illustrates an exemplary collation logic form triple weight table 800. FIG.

【図８Ｂ】図６Ａおよび図６Ｂに全て示すブロック６５０，６６０，６６５および６７０
において行われる、例示としてのクエリおよび例示としての３つの統計的に検索
した文書集合についての、我々の発明の教示による論理形態三連体の比較、なら
びに文書スコア決定、ランク付けおよび選択プロセスを図表で示す。8B shows blocks 650, 660, 665 and 670, all shown in FIGS. 6A and 6B.
Graphically illustrates the logical form triad comparison and document scoring, ranking and selection process in accordance with the teachings of our invention for an exemplary query and three exemplary statistically retrieved document sets performed at Show.

【図９Ａ】我々の本発明の教示を組み込んだ、情報検索システムの３つの異なる実施形態
を示す。FIG. 9A illustrates three different embodiments of an information retrieval system incorporating our teachings of the present invention.

【図９Ｂ】我々の本発明の教示を組み込んだ、情報検索システムの３つの異なる実施形態
を示す。9A and 9B illustrate three different embodiments of an information retrieval system incorporating the teachings of the present invention.

【図９Ｃ】我々の本発明の教示を組み込んだ、情報検索システムの３つの異なる実施形態
を示す。FIG. 9C illustrates three different embodiments of an information retrieval system that incorporates the teachings of the present invention.

【図９Ｄ】我々の本発明の別の異なる実施形態を実施する際に用いる、図９Ｃに示すリモ
ート・コンピュータ（サーバ）９３０の代替実施形態を示す。FIG. 9D illustrates an alternative embodiment of the remote computer (server) 930 shown in FIG. 9C for use in implementing another different embodiment of our invention.

【図１０】図１０Ａおよび図１０Ｂの図面用紙の正確な位置合わせを示す。FIG. 10 illustrates the precise registration of the drawing sheets of FIGS. 10A and 10B.

【図１０Ａ】各文書毎の論理形態三連体を予め計算し、その文書レコードと共に格納し、後
の文書検索処理の間にアクセスする、我々の本発明の更に別の実施形態を示す。FIG. 10A shows yet another embodiment of our invention where the logical form triple for each document is pre-computed, stored with the document record, and accessed during a later document retrieval process.

【図１０Ｂ】各文書毎の論理形態三連体を予め計算し、その文書レコードと共に格納し、後
の文書検索処理の間にアクセスする、我々の本発明の更に別の実施形態を示す。FIG. 10B shows yet another embodiment of our invention where the logical form triples for each document are pre-computed, stored with their document records, and accessed during a later document retrieval process.

【図１１】図１０Ａおよび図１０Ｂに示す文書インデックス化エンジン１０１５が実行す
る三連体発生プロセス１１００を示す。FIG. 11 illustrates a triple generation process 1100 performed by the document indexing engine 1015 shown in FIGS. 10A and 10B.

【図１２】図１２Ａおよび図１２Ｂの図面用紙の正確な位置合わせを示す。FIG. 12 shows the accurate registration of the drawing sheets of FIGS. 12A and 12B.

【図１２Ａ】図１０Ａおよび図１０Ｂに示す、コンピュータ・システム３００内で実行す
る我々の発明の検索プロセス１２００のフローチャートを示す。FIG. 12A shows a flowchart of the search process 1200 of our invention running in the computer system 300 shown in FIGS. 10A and 10B.

【図１２Ｂ】図１０Ａおよび図１０Ｂに示す、コンピュータ・システム３００内で実行す
る我々の発明の検索プロセス１２００のフローチャートを示す。FIG. 12B shows a flowchart of the search process 1200 of our invention running in the computer system 300 shown in FIGS. 10A and 10B.

【図１３Ａ】三連体発生プロセス１１００内部で実行するＮＬＰルーチン１３００のフロー
チャートを示す。FIG. 13A shows a flowchart of an NLP routine 1300 executed within the triple generation process 1100.

【図１３Ｂ】検索プロセス１２００内部で実行するＮＬＰルーチン１３５０のフローチャー
トを示す。FIG. 13B shows a flowchart of an NLP routine 1350 executed within the search process 1200.

【図１４】本発明の一実施形態を示す機能ブロック図である。FIG. 14 is a functional block diagram showing an embodiment of the present invention.

【図１５】本発明の一態様による文書のインデックス化を示す機能ブロック図である。FIG. 15 is a functional block diagram illustrating indexing of documents according to one aspect of the present invention.

【図１６】本発明の一態様による検索エンジンを一層詳細化したブロック図である。FIG. 16 is a more detailed block diagram of a search engine according to one aspect of the present invention.

【図１７】図１６に示すシステムの処理を示すフロー図である。FIG. 17 is a flowchart showing processing of the system shown in FIG. 16;

【図１８】本発明の一態様による、自然言語プロセッサの論理形態変更を示すフロー図で
ある。FIG. 18 is a flow diagram illustrating a logical form change of a natural language processor according to one aspect of the present invention.

【図１９】本発明の一態様による、自然言語プロセッサの論理形態変更を示す、より詳細
なブロック図である。FIG. 19 is a more detailed block diagram illustrating a logical form modification of a natural language processor according to one aspect of the present invention.

【手続補正書】[Procedure amendment]

【提出日】平成１２年３月１０日（２０００．３．１０）[Submission date] March 10, 2000 (2000.3.10)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】図面[Document name to be amended] Drawing

【補正対象項目名】全図[Correction target item name] All figures

【補正方法】変更[Correction method] Change

【補正内容】 [Correction contents]

【手続補正書】[Procedure amendment]

【提出日】平成１２年１０月２７日（２０００．１０．２７）[Submission date] October 27, 2000 (2000.10.27)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】図面[Document name to be amended] Drawing

【補正対象項目名】全図[Correction target item name] All figures

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【図１】 FIG.

【図２】 FIG. 2

【図３】 FIG. 3

【図４】 FIG. 4

【図５】 FIG. 5

【図６】 FIG. 6

【図７】 FIG. 7

【図８】 FIG. 8

【図９】 FIG. 9

【図１０】 FIG. 10

【図１１】 FIG. 11

【図１２】 FIG.

【図１３】 FIG. 13

【図１４】 FIG. 14

【図１５】 FIG.

【図１６】 FIG. 16

【図１７】 FIG.

【図１８】 FIG.

【図１９】 FIG.

───────────────────────────────────────────────────── フロントページの続き (81)指定国ＥＰ(ＡＴ，ＢＥ，ＣＨ，ＣＹ，ＤＥ，ＤＫ，ＥＳ，ＦＩ，ＦＲ，ＧＢ，ＧＲ，ＩＥ，ＩＴ，ＬＵ，ＭＣ，ＮＬ，ＰＴ，ＳＥ)，ＣＮ，ＪＰ (72)発明者ドーラン，ウィリアム・ビーアメリカ合衆国ワシントン州98052，レッドモンド，ワンハンドレッドフィフティサード・コート・ノース・イースト 7412 (72)発明者ヴァンダーウェンデ，ルーシー・エイチアメリカ合衆国ワシントン州98008，ベルビュー，ノース・イースト・サーティース・ストリート 16415 (72)発明者ブラデン−ハーダー，リサアメリカ合衆国ヴァージニア州20194，レストン，クリークベンド・ドライブ 12003 Ｆターム(参考） 5B075 ND03 NR12 PQ74 PR04 PR06 QM05 QP03 5B091 AA11 AA15 AB17 BA03 ──────────────────────────────────────────────────続き Continuation of front page (81) Designated country EP (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, MC, NL, PT, SE ), CN, JP (72) Inventor Dolan, William B. 98052, Washington, United States of America 98052, Redmond, One Hundred Fifty Sard Court North East 7412 (72) Inventor Vanderwende, Lucy H. Washington, United States of America Northeast Surteeth Street, Bellevue, State 98008, 16415 (72) Inventor Braden-Harder, Lisa Virginia, USA 20194, Reston, Creek Bend Drive 12003 F-term (reference) 5 B075 ND03 NR12 PQ74 PR04 PR06 QM05 QP03 5B091 AA11 AA15 AB17 BA03

Claims

[Claims]

1. A method for determining a similarity between a first and a second text input, comprising: obtaining a first set of logical forms based on the first text input; and based on the second text input. Obtaining a second set of logical forms, comparing the first and second set of logical forms, and determining a similarity between the first and second text inputs based on the comparing step. A method comprising the steps of:

2. The method of claim 1, wherein the comparing step comprises: determining whether any logical form in the first set matches any logical form in the second set. A method characterized by comprising:

3. The method of claim 2, wherein determining similarity comprises: determining a degree of similarity between the first and second text inputs based on a match between the first and second sets of logical forms. Assigning a score that reflects the score.

4. The method of claim 1, further comprising: obtaining a first paraphrase set based on the first set of logical forms.

5. The method of claim 4, wherein the comparing comprises: comparing the first set of paraphrases to the second set of logical forms; and any one of the first set of paraphrases. The paraphrased logical form of the second
Determining whether there is a match with any of the logical forms in the set of logical forms.

6. The method of claim 5, further comprising the step of: obtaining a second paraphrase set based on the second set of logical forms.

7. The method of claim 6, wherein the comparing step further comprises: comparing the first paraphrase logical form set with the second paraphrase logical form set; Is the second logical form
Making a determination as to whether any of the paraphrase logical forms in the set of paraphrase logical forms match.

8. The method of claim 1, wherein said first text entry comprises an information retrieval query and said second text entry comprises at least one document retrieved based on said query. how to.

9. The method of claim 1, wherein said second text entry comprises an information retrieval query and said first text entry comprises at least one document retrieved based on said query. Method.

10. The method of claim 5, wherein obtaining the first set of logical forms comprises: obtaining a prototype word and a prototype structural relationship between the prototype words based on the first text input. A method characterized by comprising:

11. The method of claim 10, wherein the prototype structural relationship comprises a prototype structural relationship between the prototype words, and wherein obtaining a first paraphrase logical form set comprises: Obtaining additional logical forms, including expanded words, connected in said archetypal structural relationship.

12. The method of claim 11, wherein the prototype words include a first prototype word and a second prototype word connected by the prototype structure relationship.
Obtaining an additional logical form includes expanding the first prototype word with respect to vocabulary and including a first related word that is semantically related to the first prototype word; expanding the second prototype word with respect to vocabulary Including a second related word semantically related to the second prototype word; connecting different ones of the first and second related words to each other according to the prototype structural relationship to obtain the additional logical form. A method comprising at least one of:

13. The method of claim 12, wherein expanding the first prototype word with respect to vocabulary or expanding the second prototype word with respect to vocabulary comprises: synonyms for the first and second prototype words. The step of obtaining

14. The method of claim 12, wherein expanding the first prototype word with respect to vocabulary or expanding the second prototype word with respect to vocabulary comprises: a broader term for the first and second prototype words. The step of obtaining

15. The method of claim 12, wherein expanding the first prototype word with respect to vocabulary, or expanding the second prototype word with respect to vocabulary, comprises subwords for the first and second prototype words. The step of obtaining

16. The method of claim 10, wherein obtaining a first paraphrase logical form set comprises: obtaining an expanded structural relationship related to the prototype structural relationship; and connecting the prototype word to the expanded structural relationship. Obtaining said paraphrased logic form.

17. The method of claim 16, wherein obtaining a first set of logical forms further comprises: obtaining an expanded word semantically related to the prototype word; Connecting. A method comprising:

18. The method of claim 17, wherein obtaining the first set of paraphrased logical forms further comprises connecting the expanded word with the expanded logical relationship.

19. The method of claim 10, wherein the first set of logical forms includes at least one contained word, and obtaining a first set of paraphrased logical forms comprises: a first abstraction based on the contained words. Obtaining a set of logical forms.

20. The method of claim 19, wherein the first text input comprises a document search query and obtaining a first set of abstract logical forms comprises: generating the first set of abstract logical forms. Identifying the query as a keyword query prior to causing the content word to be modified by other content words based on the structure of the query.

21. The method of claim 10, wherein the second text input comprises a document, and further comprising: obtaining a second paraphrase logical form set based on the second logical form set. how to.

22. The method of claim 21, wherein obtaining the second set of logical forms comprises obtaining an abstract set of logical forms indicating a meta-structure of the document.

23. The method of claim 22, wherein the metastructure of the document comprises:
Indicating a general subject of the document.

24. The method of claim 23, wherein the step of obtaining the set of abstract logical forms indicating a meta-structure of the document comprises: obtaining the set of abstract logical forms based on format information corresponding to the document. A method comprising the steps of:

25. The method of claim 23, wherein the step of obtaining the set of abstract logical forms indicating a meta-structure of the document comprises: obtaining the set of abstract logical forms based on a title of a sentence in the document. A method comprising the steps of:

26. The method of claim 23, wherein the step of obtaining the set of abstract logical forms indicating a meta-structure of the document comprises: obtaining the set of abstract logical forms based on the subject of a sentence in the document. A method comprising the steps of:

27. The method of claim 21, further comprising the step of suppressing other logical forms other than the first and second paraphrase logical form sets based on the contained words. And how.

28. A method for filtering a sentence in a document set retrieved from a document storage device in response to a query, the method comprising: filtering a first logic based on the query and one selected from documents in the document set. Obtaining a set of forms; obtaining a second set of logical forms based on the query and another of the documents in the set of documents; a first paraphrase showing at least a paraphrase of the first set of logical forms Obtaining a set of logical forms; and filtering the documents in the set of documents based on a predetermined relationship between the first set of paraphrased forms and the second set of logical forms. And how.

29. The method of claim 28, wherein the step of filtering includes the step of providing an output indicating a ranking order of the documents in the document set based on the predetermined relationship. how to.

30. A method for filtering sentences in a document set retrieved from document storage in response to a query, comprising: obtaining a first set of logical forms based on the query and one selected from the document set. Obtaining a second set of logical forms based on the query and another one of the set of documents; suppressing at least a logical form of a first predetermined class in the first set of logical forms; 1) obtaining a set of suppressed logical forms; and filtering the documents in the document set based on a predetermined relationship between the first set of suppressed logical forms and the second set of logical forms. A method characterized by the following.

31. The method of claim 30, wherein the step of suppressing comprises the step of suppressing logic forms having a predetermined structure.

32. The method of claim 30, wherein the step of suppressing comprises the step of suppressing logic forms that occur at a frequency above a threshold frequency level.

33. The method of claim 30, further comprising the step of suppressing a second predetermined class of logical forms in the second set of logical forms, wherein the second predetermined class includes the first predetermined class. A method characterized by being different from a predetermined class.

34. The method according to claim 30, wherein a suppressing step is performed before the step of obtaining the first set of logical forms.

35. The method of claim 30, wherein the step of suppressing is performed substantially simultaneously with the step of obtaining the first set of logical forms.

36. The method of claim 30, wherein the step of obtaining the first set of logical forms is followed by the step of suppressing.

37. A computer readable medium comprising computer readable data stored thereon, the computer readable data comprising index data indicative of the content of a document in a document collection; A set of abstract logical forms indicating a meta-structure of each of the documents in the set.

38. The computer readable medium of claim 37, wherein the meta structure of each document indicates an overall subject of the document.

39. The computer readable medium of claim 38, wherein the set of abstract logical forms is based on format information corresponding to each document.

40. The computer-readable medium of claim 38, wherein the set of abstract logical forms is based on the subject of a sentence in each document.

41. The computer readable medium of claim 38, wherein the set of abstract logical forms is based on the subject of a sentence in each document.

42. A computer readable medium, including computer readable data stored thereon, wherein the computer readable data, when executed by a computer, comprises the following steps: the query and the document Obtaining a first set of logical forms based on one selected from the set of documents; obtaining a second set of logical forms based on the query and another one of the documents in the set of documents; Modifying at least the first set of logical forms using language processing to obtain a first set of modified logical forms; based on a predetermined relationship between the first set of modified logical forms and the second set of logical forms Filtering the documents in the document set. Computer readable medium, characterized in that for filtering a document that the computer.

43. A method for determining similarity between first and second text inputs, comprising: obtaining a first set of logical forms based on the first text input; and based on the second text input. Obtaining a second set of logical forms; suppressing at least a first predetermined class of logical forms in the first set of logical forms to obtain a first set of suppressed logical forms; Determining a similarity between the first and second textual inputs by comparing to a second set of logical forms.

44. The method of claim 6, wherein the step of obtaining the first paraphrase logical form set is performed using a first paraphrase technique, and the step of obtaining the second paraphrase logical form set comprises: Performing using a second paraphrase technique different from the paraphrase technique.