JPH1145266A

JPH1145266A - Document retrieval device and computer readable recording medium recorded with program for functioning computer as the device

Info

Publication number: JPH1145266A
Application number: JP9201983A
Authority: JP
Inventors: Atsushi Takato; 淳高藤
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1997-07-28
Filing date: 1997-07-28
Publication date: 1999-02-16

Abstract

PROBLEM TO BE SOLVED: To automatically extract a noun phrase relating to retrieval conditions used at the time of retrieval from a document obtained by the retrieval and to register it to a thesaurus dictionary for the retrieval. SOLUTION: This device selects a synonym corresponding to the inputted retrieval conditions 206 from the thesaurus dictionary 203 for the retrieval, extends the retrieval conditions 206 and retrieves a pertinent document. In this case, it is provided with a natural language processing module 200 for inputting the document retrieved in a retrieval engine 209 based on the inputted retrieval conditions 206 and generating a document set 204 including a noun phrase list in the inputted document and a relating word extraction engine 211 for inputting the document set 204, imparting a score corresponding to statistic information such as an appearing frequency and distribution, etc., in the document in a document DB 101 and the document retrieved in the retrieval engine 209 to the respective noun phrases, extracting the noun phrase of the score pertinent to selection conditions set beforehand and registering it to the thesaurus dictionary 203.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、検索によって得た
文書から検索の際に用いた検索条件に関連する名詞句を
自動的に抽出して、検索用のシソーラス辞書に登録でき
るようにした文書検索装置およびその装置としてコンピ
ュータを機能させるためのプログラムを記録したコンピ
ュータ読み取り可能な記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document in which a noun phrase related to a search condition used in a search is automatically extracted from a document obtained by the search and registered in a thesaurus for search. The present invention relates to a search device and a computer-readable recording medium storing a program for causing a computer to function as the search device.

【０００２】[0002]

【従来の技術】複数の文書を格納した文書ＤＢ（データ
ベース）から特定の文書を検索する文書検索装置は、一
般に、検索式や検索文等の検索条件を入力し、入力した
検索条件に該当する文書を文書ＤＢから検索するもので
ある。2. Description of the Related Art Generally, a document search apparatus for searching for a specific document from a document DB (database) storing a plurality of documents inputs a search condition such as a search formula or a search sentence, and corresponds to the input search condition. The document is searched from the document DB.

【０００３】ところで、上記文書検索装置では、入力し
た検索条件に基づいて検索を行うため、検索条件の語彙
そのものではなく、検索条件中の語彙に関連する語彙を
用いて記述された文書については、入力した検索条件に
該当せず、検索結果に漏れが生じることがあった。In the above-described document search apparatus, a search is performed based on the input search condition. Therefore, a document described using a vocabulary related to the vocabulary in the search condition, not the vocabulary itself of the search condition, In some cases, the search results did not correspond to the input search conditions and the search results were omitted.

【０００４】そこで、検索用のシソーラス辞書を予め用
意しておき、入力した検索条件に対応する類義語をシソ
ーラス辞書から抽出し、入力した検索条件にシソーラス
辞書から抽出した類義語を加えて検索を行うことによ
り、検索結果に漏れが生じることを防止した文書検索装
置が提案されている。Therefore, a thesaurus for search is prepared in advance, a synonym corresponding to the input search condition is extracted from the thesaurus, and a search is performed by adding a synonym extracted from the thesaurus to the input search condition. Thus, a document search apparatus has been proposed in which search results are prevented from being leaked.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来の技術においては、検索用のシソーラス辞書を人手で
生成しているため、常に最新の類義語が登録されている
状態でシソーラス辞書を管理することは困難であるとい
う問題点があった。特に、先端技術に関しては、常に新
しい用語が次々と生まれてくるため、新たな用語を収集
し、シソーラス辞書に登録する作業を継続的に行うこと
は困難であった。そして、シソーラス辞書への登録を怠
れば、いくらシソーラス辞書を用いて検索を行ったとし
ても、常に高い精度の検索結果を得ることは不可能であ
るという問題点があった。However, in the above-mentioned conventional technology, since the thesaurus for search is manually generated, it is difficult to manage the thesaurus in a state where the latest synonyms are always registered. There was a problem that it was difficult. In particular, with regard to advanced technology, new terms are constantly being created, and it has been difficult to continuously collect new terms and register them in the thesaurus. Then, if registration in the thesaurus dictionary is neglected, there is a problem that it is impossible to always obtain a highly accurate search result, no matter how much the search is performed using the thesaurus dictionary.

【０００６】本発明は上記に鑑みてなされたものであっ
て、検索によって得た文書から検索の際に用いた検索条
件に関連する名詞句を自動的に抽出してシソーラス辞書
に登録できるようにすることにより、シソーラス辞書を
管理するための労力の軽減を図ることを目的とする。The present invention has been made in view of the above, and is intended to automatically extract a noun phrase related to a search condition used in a search from a document obtained by the search and register the same in a thesaurus dictionary. Accordingly, an object of the present invention is to reduce the labor for managing the thesaurus dictionaries.

【０００７】また、本発明は上記に鑑みてなされたもの
であって、シソーラス辞書を常に最新の類義語が登録さ
れた状態に保つことができるようにすることにより、精
度の高い検索結果を得ることができるようにすることを
目的とする。Further, the present invention has been made in view of the above, and it is possible to obtain a highly accurate search result by keeping a thesaurus dictionary in a state where the latest synonyms are registered at all times. The purpose is to be able to.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するた
め、請求項１の文書検索装置は、入力した検索条件に対
応する類義語を検索用のシソーラス辞書から選択し、前
記検索条件および類義語に基づいて、検索対象の文書か
ら該当する文書を検索する検索手段を備えた文書検索装
置であって、前記検索手段で検索した文書を入力し、入
力した文書中の名詞句を抽出して名詞句リストを生成す
る名詞句リスト生成手段と、前記名詞句リスト生成手段
で生成した名詞句リストを入力し、入力した名詞句リス
ト中の各名詞句に対して、前記検索手段で検索した文書
および検索対象の文書における出現頻度および分布等の
統計情報に応じたスコアを付与するスコア付与手段と、
前記スコア付与手段で付与したスコアに基づいて、予め
設定された抽出条件に該当するスコアの名詞句を抽出す
る名詞句抽出手段と、前記名詞句抽出手段で抽出した名
詞句を前記シソーラス辞書に登録する辞書登録手段と、
を備えたものである。In order to achieve the above object, a document search apparatus according to claim 1 selects a synonym corresponding to an input search condition from a thesaurus for search, and based on the search condition and the synonym. What is claimed is: 1. A document search apparatus comprising: a search unit for searching a target document from a search target document, wherein the search unit inputs a document searched by the search unit, extracts a noun phrase in the input document, and And a noun phrase list generated by the noun phrase list generating unit, and a document and a search target searched by the searching unit for each noun phrase in the input noun phrase list Score assigning means for assigning a score according to statistical information such as an appearance frequency and a distribution in a document,
A noun phrase extraction unit for extracting a noun phrase of a score corresponding to a preset extraction condition based on the score given by the score giving unit, and a noun phrase extracted by the noun phrase extraction unit registered in the thesaurus dictionary Dictionary registration means to
It is provided with.

【０００９】また、請求項２の文書検索装置は、請求項
１に記載の文書検索装置において、さらに、前記検索手
段で検索した文書から前記名詞句の抽出元となる文書を
選択する文書選択手段を備え、前記名詞句リスト生成手
段が、前記文書選択手段で選択した文書から前記名詞句
を抽出して名詞句リストを生成するものである。A document search device according to a second aspect of the present invention is the document search device according to the first aspect, further comprising: a document selection unit that selects a document from which the noun phrase is to be extracted from the documents searched by the search unit. Wherein the noun phrase list generation means extracts the noun phrase from the document selected by the document selection means to generate a noun phrase list.

【００１０】また、請求項３の文書検索装置は、請求項
１または２に記載の文書検索装置において、さらに、前
記名詞句抽出手段で抽出した名詞句から前記シソーラス
辞書に登録する名詞句を選択する名詞句選択手段を備
え、前記辞書登録手段が、前記名詞句選択手段で選択し
た名詞句を前記シソーラス辞書に登録するものである。According to a third aspect of the present invention, in the document search apparatus according to the first or second aspect, further, a noun phrase to be registered in the thesaurus dictionary is selected from the noun phrases extracted by the noun phrase extraction means. The dictionary registration means registers the noun phrase selected by the noun phrase selection means in the thesaurus dictionary.

【００１１】さらに、請求項４のコンピュータ読み取り
可能な記録媒体は、前記請求項１〜３のいずれか１つに
記載の文書検索装置の各手段としてコンピュータを機能
させるためのプログラムを記録したものである。According to a fourth aspect of the present invention, there is provided a computer-readable recording medium on which a program for causing a computer to function as each unit of the document search apparatus according to any one of the first to third aspects is recorded. is there.

【００１２】[0012]

【発明の実施の形態】以下、本発明の文書検索装置およ
びその装置としてコンピュータを機能させるためのプロ
グラムを記録したコンピュータ読み取り可能な記録媒体
の実施の形態について、添付の図面を参照しつつ詳細に
説明する。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of a document search apparatus according to the present invention and a computer-readable recording medium storing a program for causing a computer to function as the apparatus. explain.

【００１３】図１は、本実施の形態の文書検索装置のシ
ステム構成図である。図１に示す文書検索装置は、所望
の文書を検索するための検索条件を出力する複数のクラ
イアント１００と、文書ＤＢ（データベース）１０１の
文書を用いて転置ファイル１０２を生成すると共に、ベ
クトル空間法を用いた検索を行う検索エンジン（例え
ば、ＣＬＡＲＩＴＥＣＨ社のＣＬＡＲＩＴ等）を備え、
クライアント１００から検索条件を入力し、予め用意し
たシソーラス辞書から検索条件に対応する類義語を抽出
し、検索条件および類義語に基づいて、転置ファイル１
０２を用いて該当する文書を検索し、さらに、検索した
文書から入力した検索条件に関連する名詞句を抽出し、
シソーラス辞書に登録する検索サーバ１０３と、上記ク
ライアント１００や検索サーバ１０３等を接続するネッ
トワーク１０４と、から構成されている。FIG. 1 is a system configuration diagram of a document search apparatus according to the present embodiment. The document search apparatus shown in FIG. 1 generates a transposed file 102 using a plurality of clients 100 that output search conditions for searching for a desired document, and a document in a document DB (database) 101, and uses a vector space method. Equipped with a search engine (for example, CLARIT of CLARITECH, etc.) that performs a search using
A search condition is input from the client 100, a synonym corresponding to the search condition is extracted from a prepared thesaurus dictionary, and the transposed file 1 is extracted based on the search condition and the synonym.
02, the relevant document is searched, and a noun phrase related to the input search condition is extracted from the searched document.
It comprises a search server 103 to be registered in a thesaurus dictionary, and a network 104 for connecting the client 100, the search server 103 and the like.

【００１４】図１において、文書ＤＢ１０１は、クライ
アント１００等で作成された複数の文書を格納したもの
であり、格納される文書は、ワープロ文書や、ＳＧＭ
Ｌ，ＨＴＭＬ等の構造化文書等、いかなる種類の文書で
あっても良い。本実施の形態においては、文書ＤＢ１０
１に格納された文書を検索対象とするが、検索対象を文
書ＤＢ１０１中の文書に限定するものではない。In FIG. 1, a document DB 101 stores a plurality of documents created by the client 100 or the like. The stored documents include word processing documents and SGM.
Any type of document, such as a structured document such as L or HTML, may be used. In the present embodiment, the document DB 10
1, the search target is not limited to the documents in the document DB 101.

【００１５】転置ファイル１０２は、文書ＤＢ１０１中
の複数の文書と、これら複数の文書それぞれから後述す
る方法で抽出した複数の索引語との関係を規定すること
により、ある索引語が各文書それぞれにおいてどの程度
重要な語彙であるかをベクター表現を用いて示したもの
であって、この索引語を用いて該当する文書を検索する
ことができるようにしたものである。The transposed file 102 defines the relationship between a plurality of documents in the document DB 101 and a plurality of index words extracted from each of the plurality of documents by a method described later, so that a certain index word is included in each document. It shows how important the vocabulary is by using a vector expression, so that a corresponding document can be searched using this index word.

【００１６】具体的には、１つの文書を予め定めた複数
のセンテンスからなるサブドキュメント単位に区切り、
サブドキュメントから上記索引語となる名詞句を抽出し
て、抽出した名詞句それぞれについて、サブドキュメン
ト中の出現頻度，文書ＤＢ１０１全体における分布等の
統計情報を求め、求めた名詞句毎の統計情報を用いて各
サブドキュメントをベクター表現に変換する。そして、
変換したサブドキュメントのベクター表現に基づいて、
文書のベクター表現を生成する。転置ファイル１０２
は、このようにしてベクター表現された文書ＤＢ１０１
中の文書を格納するものである。Specifically, one document is divided into sub-documents consisting of a plurality of predetermined sentences,
The noun phrases serving as the above-mentioned index words are extracted from the sub-documents, and for each of the extracted noun phrases, statistical information such as the frequency of appearance in the sub-document and the distribution in the entire document DB 101 is obtained. To convert each subdocument into a vector representation. And
Based on the vector representation of the converted subdocument,
Generate a vector representation of the document. Inversion file 102
Is the document DB 101 thus expressed in vector.
This is for storing the document inside.

【００１７】なお、各索引語には、対応する文書中の重
要度に応じた重み付けを行うことができる。また、文書
のベクター表現については、実際の検索を行う際に、サ
ブドキュメントのベクター表現に基づいて生成すること
にしても良い。Each index word can be weighted according to the degree of importance in the corresponding document. Also, the vector expression of the document may be generated based on the vector expression of the sub-document when performing an actual search.

【００１８】クライアント１００および検索サーバ１０
３は、パーソナルコンピュータやワークステーション等
によって構成される。図２は、検索サーバ１０３の処理
を示す概略ブロック図である。検索サーバ１０３は、文
書ＤＢ１０１中の文書を転置ファイル１０２に登録する
処理と、ベクトル空間法を利用した文書の検索処理と、
検索した文書から検索条件に関連する名詞句を抽出して
シソーラス辞書２０３への登録処理を行うものである。Client 100 and search server 10
Reference numeral 3 includes a personal computer, a workstation, and the like. FIG. 2 is a schematic block diagram illustrating the processing of the search server 103. The search server 103 registers the document in the document DB 101 in the transposed file 102, searches the document using the vector space method,
A process of extracting a noun phrase related to a search condition from a searched document and registering the noun phrase in the thesaurus dictionary 203 is performed.

【００１９】検索サーバ１０３において、文書ＤＢ１０
１中の文書を転置ファイル１０２に登録する処理は、自
然言語処理モジュール２００と、データベース・ビルド
・コンポーネント２０５とによって行われる。In the search server 103, the document DB 10
The process of registering the document in the file 1 in the transposition file 102 is performed by the natural language processing module 200 and the database build component 205.

【００２０】具体的に、自然言語処理モジュール２００
は、文書ＤＢ１０１から文書を入力し、入力した文書に
ついて、フォーマットの認識処理や、品詞情報等を格納
した辞書２０１および各単語の係り受け等を解析するた
めの文法辞書２０２を用いて形態素解析，構文解析，名
詞句抽出等の解析処理を行い、上述したサブドキュメン
ト毎の名詞句リストを含むドキュメント・セット２０４
を生成する。Specifically, the natural language processing module 200
Input a document from the document DB 101, and perform morphological analysis on the input document using a format recognition process, a dictionary 201 storing part-of-speech information and the like, and a grammar dictionary 202 for analyzing dependency of each word. The document set 204 including the parsing list and the noun phrase list for each sub-document is analyzed by performing a parsing process and a noun phrase extraction process.
Generate

【００２１】データベース・ビルド・コンポーネント２
０５は、自然言語処理モジュール２００で生成したドキ
ュメント・セット２０４を入力し、入力したドキュメン
ト・セット２０４中の各サブドキュメントをベクター表
現に変換すると共に、サブドキュメントのベクター表現
に基づいて、文書のベクター表現を生成して転置ファイ
ル１０２に登録する。Database Build Component 2
05, a document set 204 generated by the natural language processing module 200 is input, each sub-document in the input document set 204 is converted into a vector expression, and a document vector is generated based on the vector expression of the sub-document. An expression is generated and registered in the transposition file 102.

【００２２】また、文書の検索処理は、自然言語処理モ
ジュール２００と、クエリー・ビルド・コンポーネント
２０７と、検索エンジン２０９とによって行われる。The document search process is performed by the natural language processing module 200, the query build component 207, and the search engine 209.

【００２３】具体的に、自然言語処理モジュール２００
は、クライアント１００から検索条件２０６を入力し、
品詞情報等を格納した辞書２０１および各単語の係り受
け等を解析するための文法辞書２０２を用いて形態素解
析，構文解析，名詞句抽出等の解析処理を行うと共に、
抽出した名詞句の類義語をシソーラス辞書２０３から抽
出して検索条件２０６の名詞句に加え、検索条件２０６
の名詞句のリストを含むドキュメント・セット２０４を
生成する。More specifically, the natural language processing module 200
Inputs the search condition 206 from the client 100,
Analysis processing such as morphological analysis, syntax analysis, and noun phrase extraction is performed using a dictionary 201 storing part-of-speech information and a grammar dictionary 202 for analyzing the dependency of each word.
Synonyms of the extracted noun phrase are extracted from the thesaurus dictionary 203 and added to the noun phrase of the search condition 206, and the search condition 206
Generate a document set 204 that includes a list of noun phrases.

【００２４】クエリー・ビルド・コンポーネント２０７
は、ドキュメント・セット２０４を入力し、検索条件２
０６を構成する各名詞句について、検索条件２０６中の
出現頻度，文書ＤＢ１０１全体における分布等の統計情
報を求め、求めた統計情報を用いて検索条件２０６をベ
クター表現に変換したクエリー・ドキュメント２０８を
生成する。Query build component 207
Inputs the document set 204 and sets the search condition 2
For each of the noun phrases constituting the part No. 06, statistical information such as the frequency of appearance in the search condition 206 and the distribution in the entire document DB 101 is obtained, and the query document 208 obtained by converting the search condition 206 into a vector expression using the obtained statistical information is obtained. Generate.

【００２５】検索エンジン２０９は、クエリー・ビルド
・コンポーネント２０７で生成したクエリー・ドキュメ
ント２０８を入力し、転置ファイル１０２中の各文書の
ベクター表現とクエリー・ドキュメント２０８（検索条
件２０６のベクター表現）とを比較して、クエリー・ド
キュメント２０８との類似度に応じたスコアを各文書に
付与し、所定の閾値を超えるスコアが付与された文書リ
スト２１０を検索結果として出力する。The search engine 209 inputs the query document 208 generated by the query build component 207, and converts the vector expression of each document in the transposed file 102 and the query document 208 (vector expression of the search condition 206). In comparison, a score corresponding to the degree of similarity with the query document 208 is assigned to each document, and a document list 210 to which a score exceeding a predetermined threshold is assigned is output as a search result.

【００２６】さらに、検索サーバ１０３において、シソ
ーラス辞書２０３への登録処理は、自然言語処理モジュ
ール２００と、関連語抽出エンジン２１１とによって行
われる。Further, in the search server 103, the registration processing to the thesaurus dictionary 203 is performed by the natural language processing module 200 and the related word extraction engine 211.

【００２７】具体的に、自然言語処理モジュール２００
は、検索エンジン２０９から文書リスト２１０を入力
し、入力した文書リスト２１０に基づいて、文書ＤＢ１
０１から該当する文書を入力する。そして、入力した全
ての文書について、フォーマットの認識処理や、品詞情
報等を格納した辞書２０１および各単語の係り受け等を
解析するための文法辞書２０２を用いて形態素解析，構
文解析，名詞句抽出等の解析処理を行い、上述したサブ
ドキュメント毎の名詞句リストを含むドキュメント・セ
ット２０４を生成する。More specifically, the natural language processing module 200
Inputs the document list 210 from the search engine 209, and based on the input document list 210, the document DB1
Enter the corresponding document from 01. For all input documents, morphological analysis, syntax analysis, and noun phrase extraction are performed using a format recognition process, a dictionary 201 storing part-of-speech information, and a grammar dictionary 202 for analyzing the dependency of each word. To generate the document set 204 including the noun phrase list for each sub-document described above.

【００２８】関連語抽出エンジン２１１は、自然言語処
理モジュール２００で生成したドキュメント・セット２
０４を入力し、入力したドキュメント・セット２０４中
の各名詞句それぞれについて、各文書（ドキュメント・
セット２０４）中の出現頻度や文書ＤＢ１０１（転置フ
ァイル１０２）中の分布等の統計データを演算し、演算
した統計データに基づいて、各名詞句にスコアを付与す
る。そして、予め設定した閾値を超えるスコアの名詞句
を検索条件２０６に関連する名詞句として抽出し、抽出
した名詞句を１つのグループとしてシソーラス辞書２０
３に登録する。The related word extraction engine 211 generates the document set 2 generated by the natural language processing module 200.
04, and for each noun phrase in the input document set 204,
The statistical data such as the appearance frequency in the set 204) and the distribution in the document DB 101 (transposed file 102) are calculated, and a score is assigned to each noun phrase based on the calculated statistical data. Then, a noun phrase having a score exceeding a preset threshold is extracted as a noun phrase related to the search condition 206, and the extracted noun phrases are grouped as one group in the thesaurus dictionary 20.
Register to 3.

【００２９】なお、図１においては、文書ＤＢ１０１お
よび転置ファイル１０２をネットワーク１０４に単独に
接続した構成を示したが、これらを検索サーバ１０３に
直接接続する構成としても良い。また、図１において
は、本実施の形態の文書検索装置をネットワーク１０４
を介したシステムで構成するように示したが、クライア
ント１００と検索サーバ１０３の処理を１つのコンピュ
ータで行うようにすることもできる。Although FIG. 1 shows a configuration in which the document DB 101 and the transposed file 102 are independently connected to the network 104, a configuration in which these are directly connected to the search server 103 may be used. In FIG. 1, the document search device of the present embodiment is connected to a network 104.
However, the processing of the client 100 and the search server 103 may be performed by one computer.

【００３０】次に、上述した構成を備えた文書検索装置
の動作について、（１）転置ファイルの生成処理，
（２）文書の検索処理，（３）シソーラス辞書への登録
処理の順で詳細に説明する。Next, the operation of the document search apparatus having the above-described configuration will be described with respect to (1) a process of generating an inverted file,
This will be described in detail in the order of (2) document search processing and (3) registration processing in a thesaurus dictionary.

【００３１】（１）転置ファイルの生成処理図３は、転置ファイルの生成処理を示すフローチャート
である。検索サーバ１０３は、新たな文書が文書ＤＢ１
０１に登録された場合（Ｓ３０１）、この文書を入力し
て転置ファイル１０２に登録するための処理を開始する
（Ｓ３０２）。(1) Transposition File Generation Process FIG. 3 is a flowchart showing a transposition file generation process. The search server 103 stores the new document in the document DB1
01 (S301), a process for inputting this document and registering it in the transposition file 102 is started (S302).

【００３２】検索サーバ１０３において、自然言語処理
モジュール２００は、ステップＳ３０２で入力した文書
を解析する処理を行う（Ｓ３０３）。具体的には、入力
した文書がワープロ文書，ＨＴＭＬ等の構造化文書等、
いかなるフォーマットの文書であるかを判定する処理を
行う。その後、辞書２０１および文法辞書２０２を用い
て形態素解析，係り受け等の構文解析を行い、文書を複
数のサブドキュメントに区分すると共に、区分したサブ
ドキュメントから名詞句を抽出する等の処理を行う。In the search server 103, the natural language processing module 200 performs a process of analyzing the document input in step S302 (S303). Specifically, the input document is a word processing document, a structured document such as HTML, etc.
A process is performed to determine the format of the document. Thereafter, syntax analysis such as morphological analysis and dependency is performed by using the dictionary 201 and the grammar dictionary 202, and the document is divided into a plurality of sub-documents, and processing such as extracting a noun phrase from the divided sub-documents is performed.

【００３３】そして、自然言語処理モジュール２００
は、ステップＳ３０３における処理の結果に基づいて、
サブドキュメント毎に名詞句リストを生成し、生成した
名詞句リストを含むドキュメント・セット２０４を生成
する（Ｓ３０４）。Then, the natural language processing module 200
Is based on the result of the processing in step S303.
A noun phrase list is generated for each sub-document, and a document set 204 including the generated noun phrase list is generated (S304).

【００３４】その後、データベース・ビルド・コンポー
ネント２０５は、自然言語処理モジュール２００で生成
したドキュメント・セット２０４を入力し、文書のベク
ター表現を生成して転置ファイル１０２に登録する処理
を行う（Ｓ３０５）。Thereafter, the database build component 205 receives the document set 204 generated by the natural language processing module 200, generates a vector representation of the document, and registers it in the transposed file 102 (S305).

【００３５】具体的には、ドキュメント・セット２０４
中のサブドキュメントの各名詞句を転置ファイル１０２
の索引語として、サブドキュメント中の出現頻度，文書
ＤＢ１０１全体における分布等の統計情報を求め、求め
た名詞句毎の統計情報を用いてサブドキュメントをベク
ター表現に変換する。この処理をドキュメント・セット
２０４中の全てのサブドキュメントについて行い、変換
したサブドキュメントのベクター表現に基づいて、文書
のベクター表現を生成して転置ファイル１０２に登録す
る。その結果、文書ＤＢ１０１に新たに登録された文書
がベクター表現に変換されて転置ファイル１０２に登録
されることになる。Specifically, the document set 204
Transpose file 102 for each noun phrase of subdocument in
The statistical information such as the frequency of occurrence in the sub-document and the distribution in the entire document DB 101 is obtained as an index term, and the sub-document is converted into a vector expression using the obtained statistical information for each noun phrase. This process is performed for all the sub-documents in the document set 204, and based on the converted sub-document vector expression, a vector expression of the document is generated and registered in the transposed file 102. As a result, the document newly registered in the document DB 101 is converted into a vector expression and registered in the transposition file 102.

【００３６】（２）文書の検索処理次に、上述したようにして生成した転置ファイル１０２
に基づいて、文書を検索する処理について説明する。図
４は、文書の検索処理を示すフローチャートである。(2) Document Search Processing Next, the transposed file 102 generated as described above
A description will be given of a process of searching for a document based on the. FIG. 4 is a flowchart showing a document search process.

【００３７】検索サーバ１０３は、クライアント１００
から検索条件２０６を入力すると（Ｓ４０１）、自然言
語処理モジュール２００において、検索条件２０６の解
析処理を行う（Ｓ４０２）。この検索条件２０６は、自
然言語で記述された検索文であり、文書検索装置の構成
により、検索式，キーワードの集合等に変更することも
可能である。自然言語処理モジュール２００は、具体的
に、検索条件２０６について、辞書２０１および文法辞
書２０２を用いて形態素解析，係り受け等の構文解析処
理を行い、検索条件２０６から名詞句を抽出する。そし
て、抽出した名詞句の類義語がシソーラス辞書２０３中
に存在する場合には、該当する類義語を抽出して検索条
件２０６の名詞句として追加する処理を行う。The search server 103 is connected to the client 100
When the search condition 206 is input from (S401), the natural language processing module 200 analyzes the search condition 206 (S402). The search condition 206 is a search sentence described in a natural language, and can be changed to a search expression, a set of keywords, or the like, depending on the configuration of the document search device. Specifically, the natural language processing module 200 performs syntactic analysis processing such as morphological analysis and dependency using the dictionary 201 and the grammar dictionary 202 for the search condition 206, and extracts a noun phrase from the search condition 206. If a synonym of the extracted noun phrase exists in the thesaurus dictionary 203, a process of extracting the corresponding synonym and adding it as a noun phrase of the search condition 206 is performed.

【００３８】そして、自然言語処理モジュール２００
は、ステップＳ４０２で解析処理を行うことによって抽
出した名詞句からなるドキュメント・セット２０４を生
成する（Ｓ４０３）。Then, the natural language processing module 200
Generates the document set 204 including the noun phrases extracted by performing the analysis processing in step S402 (S403).

【００３９】続いて、クエリー・ビルド・コンポーネン
ト２０７は、自然言語処理モジュール２００からドキュ
メント・セット２０４を入力し、入力したドキュメント
・セット２０４を構成する各名詞句について、検索条件
２０６中の出現頻度，文書ＤＢ１０１全体における分布
等の統計情報を求め、求めた統計情報を用いてドキュメ
ント・セット２０４をベクター表現に変換したクエリー
・ドキュメント２０８を生成する（Ｓ４０４）。Next, the query build component 207 inputs the document set 204 from the natural language processing module 200, and for each noun phrase constituting the input document set 204, the appearance frequency in the search condition 206, Statistical information such as distribution in the entire document DB 101 is obtained, and a query document 208 is generated by converting the document set 204 into a vector expression using the obtained statistical information (S404).

【００４０】検索エンジン２０９は、クエリー・ビルド
・コンポーネント２０７で生成したクエリー・ドキュメ
ント２０８を入力し、転置ファイル１０２中の各文書の
ベクター表現とクエリー・ドキュメント２０８（検索条
件２０６のベクター表現）とを比較して、クエリー・ド
キュメント２０８との類似度に応じたスコアを各文書に
付与する（Ｓ４０５）。すなわち、ベクトル空間法を用
いた検索処理が行われる。The search engine 209 inputs the query document 208 generated by the query build component 207, and converts the vector expression of each document in the transposed file 102 and the query document 208 (vector expression of the search condition 206). In comparison, a score corresponding to the degree of similarity with the query document 208 is given to each document (S405). That is, search processing using the vector space method is performed.

【００４１】なお、類似度に応じたスコアは、各文書と
クエリー・ドキュメント２０８との類似度を余弦距離に
基づいて表現したものであり、スコアが大きい文書がよ
りクエリー・ドキュメント２０８と類似していることを
表している。The score according to the similarity expresses the similarity between each document and the query document 208 based on the cosine distance. A document having a higher score is more similar to the query document 208. It represents that it is.

【００４２】そして、検索エンジン２０９は、予め設定
されたスコアの閾値に基づいて、閾値を超えるスコアが
付与された文書を選択し、選択した文書に基づいて、文
書リスト２１０を生成して、検索結果として出力する
（Ｓ４０６）。The search engine 209 selects a document having a score exceeding the threshold based on a preset score threshold, generates a document list 210 based on the selected document, and performs a search. The result is output (S406).

【００４３】クライアント１００は、検索サーバ１０３
から文書リスト２１０を入力し、入力した文書リスト２
１０に基づいて、検索された文書一覧を画面表示する
（Ｓ４０７）。クライアント１００のユーザは、画面表
示された文書一覧から所望の文書を選択することによ
り、文書ＤＢ１０１中の文書を画面表示させることがで
きる。The client 100 is a search server 103
Input document list 210, and input document list 2
Based on 10, a list of searched documents is displayed on the screen (S 407). The user of the client 100 can display a document in the document DB 101 on the screen by selecting a desired document from the document list displayed on the screen.

【００４４】なお、クライアント１００においては、上
位のランキングの文書から順に一覧表示される。したが
って、検索条件２０６に最も類似する文書から順に表示
されることになり、ユーザが文書を選択する際の基準を
提供することができる。In the client 100, documents are listed in order from the document having the highest ranking. Therefore, the documents that are most similar to the search condition 206 are displayed in order, and a criterion for the user to select a document can be provided.

【００４５】（３）シソーラス辞書への登録処理続いて、上述した検索処理において検索した文書から検
索条件２０６に関連する名詞句を抽出し、抽出した名詞
句をシソーラス辞書２０３に登録する処理について説明
する。図５は、シソーラス辞書への名詞句の登録処理を
示すフローチャートである。(3) Registration Processing in Thesaurus Dictionary Next, processing for extracting a noun phrase related to the search condition 206 from the document searched in the above-described search processing and registering the extracted noun phrase in the thesaurus dictionary 203 will be described. I do. FIG. 5 is a flowchart showing a process for registering a noun phrase in a thesaurus dictionary.

【００４６】自然言語処理モジュール２００は、上述し
た検索処理が終了すると、検索エンジン２０９から文書
リスト２１０を入力する（Ｓ５０１）。そして、入力し
た文書リスト２１０に基づいて、文書ＤＢ１０１から該
当する文書を入力する（Ｓ５０２）。When the above-described search processing is completed, the natural language processing module 200 inputs the document list 210 from the search engine 209 (S501). Then, based on the input document list 210, a corresponding document is input from the document DB 101 (S502).

【００４７】文書ＤＢ１０１から文書を入力すると、入
力した全ての文書について、フォーマットの認識処理
や、品詞情報等を格納した辞書２０１および各単語の係
り受け等を解析するための文法辞書２０２を用いて形態
素解析，構文解析，名詞句抽出等の解析処理を行う（Ｓ
５０３）。When a document is input from the document DB 101, for all the input documents, a format recognition process, a dictionary 201 storing part-of-speech information and the like, and a grammar dictionary 202 for analyzing the dependency of each word are used. Perform analysis processing such as morphological analysis, syntax analysis, and noun phrase extraction (S
503).

【００４８】その後、ステップＳ５０３における解析処
理の結果に基づいて、サブドキュメント毎の名詞句リス
トを含むドキュメント・セット２０４を生成する（Ｓ５
０４）。Thereafter, based on the result of the analysis processing in step S503, a document set 204 including a noun phrase list for each sub-document is generated (S5).
04).

【００４９】関連語抽出エンジン２１１は、自然言語処
理モジュール２００で生成したドキュメント・セット２
０４を入力し、入力したドキュメント・セット２０４中
の各名詞句それぞれについて、各文書（ドキュメント・
セット２０４）中の出現頻度や文書ＤＢ１０１（転置フ
ァイル１０２）中の分布等の統計データを演算する（Ｓ
５０５）。The related word extraction engine 211 generates the document set 2 generated by the natural language processing module 200.
04, and for each noun phrase in the input document set 204,
The statistical data such as the appearance frequency in the set 204) and the distribution in the document DB 101 (transposed file 102) are calculated (S
505).

【００５０】ステップＳ５０５で統計データを演算した
後、関連語抽出エンジン２１１は、求めた統計データに
基づいて、各名詞句に対してスコア付けを行う（Ｓ５０
６）。このスコアは、文書２００における各名詞句の重
要性および検索条件２０６中の名詞句に対する関連性を
表すもので、スコアが大きいもの程、重要性および関連
性が高いことを表している。After calculating the statistical data in step S505, the related word extraction engine 211 scores each noun phrase based on the obtained statistical data (S50).
6). This score indicates the importance of each noun phrase in the document 200 and the relevance to the noun phrase in the search condition 206. The higher the score, the higher the importance and relevance.

【００５１】関連語抽出エンジン２１１は、ステップＳ
５０６で行ったスコア付けの結果に基づいて、予め設定
された閾値を超えるスコアの名詞句を検索条件２０６中
の名詞句に関連する名詞句として抽出する（Ｓ５０
７）。なお、ここでは、名詞句を選択する条件として閾
値を用いることにしたが、閾値に代えて、例えば、上位
５番までのスコアの名詞句を抽出することにしても良
い。The related word extraction engine 211 proceeds to step S
Based on the result of the scoring performed in 506, a noun phrase with a score exceeding a preset threshold is extracted as a noun phrase related to the noun phrase in the search condition 206 (S50).
7). Here, a threshold value is used as a condition for selecting a noun phrase. However, instead of the threshold value, for example, a noun phrase having the top five scores may be extracted.

【００５２】その後、関連語抽出エンジン２１１は、ス
テップＳ５０７で抽出した名詞句を１つのグループとし
て、シソーラス辞書２０３に登録する（Ｓ５０８）。シ
ソーラス辞書２０３に登録された名詞句は、上述したよ
うに、文書の検索処理を行う際に、検索条件２０６中の
名詞句の類義語として、検索条件２０６を拡張するため
に利用されることになる。Thereafter, the related word extraction engine 211 registers the noun phrases extracted in step S507 as one group in the thesaurus dictionary 203 (S508). As described above, the noun phrase registered in the thesaurus dictionary 203 is used to extend the search condition 206 as a synonym of the noun phrase in the search condition 206 when performing a document search process. .

【００５３】なお、ステップＳ５０８において、抽出し
た名詞句をシソーラス辞書２０３に登録する際に、抽出
した名詞句の利用目的等のコメント同時に登録すること
ができる。In step S508, when the extracted noun phrase is registered in the thesaurus dictionary 203, a comment such as the purpose of use of the extracted noun phrase can be registered at the same time.

【００５４】また、文書の検索処理を説明した際には、
シソーラス辞書２０３から自動的に類義語を抽出して、
検索条件２０６の名詞句として追加することにしたが、
クライアント１００のユーザの操作により、類義語を追
加することにしても良い。When explaining the document search process,
Synonyms are automatically extracted from the thesaurus dictionary 203,
I decided to add it as a noun phrase in the search condition 206,
Synonyms may be added by a user operation of the client 100.

【００５５】なお、上述したシソーラス辞書２０３への
登録処理においては、検索結果である文書リスト２１０
に該当する文書から名詞句を抽出してシソーラス辞書２
０３へ登録することにしたが、クライアント１００でシ
ソーラス辞書２０３に登録する名詞句の抽出元となる文
書を選択することにしても良い。すなわち、図４のステ
ップＳ４０７でクライアント１００に画面表示された文
書一覧から、ユーザがシソーラス辞書２０３への登録処
理に用いる文書を選択し、選択された文書から名詞句を
抽出して、シソーラス辞書２０３に登録することができ
る。その結果、ユーザが満足した検索結果の文書から名
詞句が抽出されることになり、不要な名詞句がシソーラ
ス辞書２０３に登録されることを防止することができ
る。In the above-described registration processing in the thesaurus dictionary 203, the document list 210 as a search result is obtained.
Extract noun phrases from documents corresponding to
However, the client 100 may select a document from which a noun phrase to be registered in the thesaurus dictionary 203 is extracted. That is, the user selects a document to be used for registration processing in the thesaurus dictionary 203 from the document list displayed on the screen of the client 100 in step S407 of FIG. 4, extracts a noun phrase from the selected document, and extracts the noun phrase from the selected document. You can register to. As a result, a noun phrase is extracted from the document of the search result satisfying the user, and it is possible to prevent an unnecessary noun phrase from being registered in the thesaurus dictionary 203.

【００５６】さらに、図５のステップＳ５０７において
選択した名詞句をそのままシソーラス辞書２０３に登録
することにしたが、選択した名詞句の一覧をクライアン
ト１００に画面表示して、シソーラス辞書２０３に登録
する名詞句をユーザに選択させることにしても良い。そ
の結果、上記と同様に、不要な名詞句がシソーラス辞書
に登録されることを防止することができる。Further, the noun phrase selected in step S507 in FIG. 5 is registered in the thesaurus dictionary 203 as it is. A list of the selected noun phrases is displayed on the client 100 on the screen, and the noun phrase registered in the thesaurus dictionary 203 is displayed. The user may be allowed to select a phrase. As a result, similarly to the above, it is possible to prevent unnecessary noun phrases from being registered in the thesaurus dictionary.

【００５７】このように、本実施の形態の文書検索装置
によれば、検索によって得た文書から検索の際に用いた
検索条件２０６に関連する名詞句を自動的に抽出してシ
ソーラス辞書２０３に登録できるようにしたため、シソ
ーラス辞書２０３を管理するための労力の軽減を図るこ
とができる。また、シソーラス辞書２０３を常に最新の
類義語が登録された状態に保つことができるため、常に
精度の高い検索結果を得ることができるようにすること
ができる。特に、以前に行った検索に似た方法で検索を
行う場合の検索精度を特に向上させることができる。As described above, according to the document search apparatus of the present embodiment, the noun phrase related to the search condition 206 used in the search is automatically extracted from the document obtained by the search, and is stored in the thesaurus dictionary 203. Since the registration can be performed, the labor for managing the thesaurus dictionary 203 can be reduced. In addition, since the thesaurus dictionary 203 can always keep the latest synonyms registered, it is possible to always obtain highly accurate search results. In particular, it is possible to particularly improve search accuracy when a search is performed by a method similar to the search performed previously.

【００５８】なお、検索結果として一覧表示された文書
において、ユーザが検索結果としてふさわしいと思う文
書やふさわしくないと思う文書については、その結果を
検索サーバ１０３にフィードバックすることができる。
すなわち、ユーザは、検索結果としてふさわしいと思う
文書に対して、正の重み、例えば「＋」を指定すること
ができ、検索結果としてふさわしくないと思う文書に対
して負の重み、例えば「−」を指定することができる。
その結果、入力した重みが正の指定である場合には、転
置ファイル１０２中の該当する文書の重みが強化され、
入力した重みが負の指定である場合には、文書の重みが
弱められる。Note that, among documents displayed as a list of search results, for a document that the user considers appropriate or unsuitable as the search result, the result can be fed back to the search server 103.
That is, the user can specify a positive weight, for example, “+” for a document that is considered to be appropriate as a search result, and a negative weight, for example, “−” for a document that is not appropriate for the search result. Can be specified.
As a result, if the input weight is a positive designation, the weight of the corresponding document in the transposed file 102 is strengthened,
If the input weight is a negative designation, the weight of the document is reduced.

【００５９】また、本実施の形態においては、ベクトル
空間法による検索を例にとって説明したが、ブーリアン
検索により検索処理を行うことにしても良い。Further, in the present embodiment, the search by the vector space method has been described as an example, but the search processing may be performed by a Boolean search.

【００６０】さらに、本実施の形態で説明した文書検索
装置は、予め用意されたプログラムをコンピュータやワ
ークステーションで実行することによって実現される。
このプログラムは、ハードディスク，フロッピーディス
ク，ＣＤ−ＲＯＭ，ＭＯ，ＤＶＤ等のコンピュータで読
み取り可能な記録媒体に記録され、コンピュータによっ
て記録媒体から読み出されることによって実行される。
また、このプログラムは、上記記録媒体を介して、また
はネットワークを介して配布することができる。Further, the document search device described in the present embodiment is realized by executing a prepared program on a computer or a workstation.
This program is recorded on a computer-readable recording medium such as a hard disk, a floppy disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer.
This program can be distributed via the recording medium or via a network.

【００６１】[0061]

【発明の効果】以上説明したように、本発明の文書検索
装置（請求項１）によれば、入力した検索条件に対応す
る類義語を検索用のシソーラス辞書から選択し、検索条
件および類義語に基づいて、検索対象の文書から該当す
る文書を検索する検索手段を備えた文書検索装置であっ
て、検索手段で検索した文書を入力し、入力した文書中
の名詞句を抽出して名詞句リストを生成する名詞句リス
ト生成手段と、名詞句リスト生成手段で生成した名詞句
リストを入力し、入力した名詞句リスト中の各名詞句に
対して、検索手段で検索した文書および検索対象の文書
における出現頻度および分布等の統計情報に応じたスコ
アを付与するスコア付与手段と、スコア付与手段で付与
したスコアに基づいて、予め設定された抽出条件に該当
するスコアの名詞句を抽出する名詞句抽出手段と、名詞
句抽出手段で抽出した名詞句をシソーラス辞書に登録す
る辞書登録手段と、を備えたため、シソーラス辞書を管
理するための労力の軽減を図ることができる。また、シ
ソーラス辞書を常に最新の類義語が登録された状態に保
つことができるため、常に精度の高い検索結果を得るこ
とができるようにすることができる。特に、以前に行っ
た検索に似た方法で検索を行う場合の検索精度を特に向
上させることができる。As described above, according to the document search apparatus of the present invention (claim 1), a synonym corresponding to the input search condition is selected from the search thesaurus, and the synonym is selected based on the search condition and the synonym. A document search apparatus having a search means for searching for a corresponding document from a search target document, inputting the document searched by the search means, extracting a noun phrase in the input document, and generating a noun phrase list. The noun phrase list generating means to be generated and the noun phrase list generated by the noun phrase list generating means are inputted, and for each noun phrase in the input noun phrase list, the document searched by the searching means and the document to be searched are A scoring unit that assigns a score according to statistical information such as an appearance frequency and a distribution, and a noun of a score corresponding to a preset extraction condition based on the score assigned by the scoring unit. Because with a noun phrase extraction means for extracting a dictionary registration means for registering a noun phrase extracted with noun phrase extraction means thesaurus, and it is possible to reduce the effort required to manage the thesaurus. In addition, since the thesaurus dictionary can always keep the latest synonyms registered, it is possible to always obtain highly accurate search results. In particular, it is possible to particularly improve search accuracy when a search is performed by a method similar to the search performed previously.

【００６２】また、本発明の文書検索装置（請求項２）
によれば、請求項１に記載の文書検索装置において、さ
らに、検索手段で検索した文書から名詞句の抽出元とな
る文書を選択する文書選択手段を備え、名詞句リスト生
成手段は、文書選択手段で選択した文書から名詞句を抽
出して名詞句リストを生成するため、ユーザが満足した
検索結果の文書から名詞句が抽出されることになり、不
要な名詞句がシソーラス辞書に登録されることを防止す
ることができる。Further, the document search device of the present invention (Claim 2)
According to the document search apparatus according to claim 1, further comprising a document selection unit for selecting a document from which a noun phrase is to be extracted from the documents searched by the search unit, wherein the noun phrase list generation unit includes a document selection unit. Since the noun phrases are extracted from the document selected by the means to generate a noun phrase list, the noun phrases are extracted from the document of the search result satisfying the user, and unnecessary noun phrases are registered in the thesaurus dictionary. Can be prevented.

【００６３】また、本発明の文書検索装置（請求項３）
によれば、請求項１または２に記載の文書検索装置にお
いて、さらに、名詞句抽出手段で抽出した名詞句からシ
ソーラス辞書に登録する名詞句を選択する名詞句選択手
段を備え、辞書登録手段は、名詞句選択手段で選択した
名詞句をシソーラス辞書に登録するため、不要な名詞句
がシソーラス辞書に登録されることを防止することがで
きる。A document retrieval apparatus according to the present invention (claim 3)
According to claim 1, the document retrieval device according to claim 1 or 2, further comprising a noun phrase selection unit that selects a noun phrase to be registered in a thesaurus dictionary from the noun phrases extracted by the noun phrase extraction unit. Since the noun phrase selected by the noun phrase selecting means is registered in the thesaurus dictionary, unnecessary noun phrases can be prevented from being registered in the thesaurus dictionary.

【００６４】さらに、本発明のコンピュータ読み取り可
能な記録媒体（請求項４）によれば、請求項１〜３のい
ずれか１つに記載の文書検索装置の各手段としてコンピ
ュータを機能させるためのプログラムを記録したため、
このプログラムをコンピュータに実行させることによ
り、シソーラス辞書を管理するための労力の軽減を図る
ことができ、かつ、シソーラス辞書を常に最新の類義語
が登録された状態に保つことができるため、常に精度の
高い検索結果を得ることができる文書検索装置を実現す
ることができる。Further, according to a computer readable recording medium of the present invention (claim 4), a program for causing a computer to function as each means of the document search device according to any one of claims 1 to 3 Was recorded,
By running this program on a computer, the effort required to manage the thesaurus dictionary can be reduced, and the thesaurus dictionary can always be kept in the state where the latest synonyms are registered. A document search device that can obtain high search results can be realized.

[Brief description of the drawings]

【図１】本実施の形態の文書検索装置のシステム構成図
である。FIG. 1 is a system configuration diagram of a document search device according to an embodiment.

【図２】本実施の形態の文書検索装置において、検索サ
ーバの処理を示す概略ブロック図である。FIG. 2 is a schematic block diagram illustrating processing of a search server in the document search device according to the present embodiment.

【図３】本実施の形態の文書検索装置において、転置フ
ァイルの生成処理を示すフローチャートである。FIG. 3 is a flowchart illustrating a process of generating a transposed file in the document search device according to the present embodiment.

【図４】本実施の形態の文書検索装置において、文書の
検索処理を示すフローチャートである。FIG. 4 is a flowchart illustrating a document search process in the document search device according to the present embodiment.

【図５】本実施の形態の文書検索装置において、シソー
ラス辞書への名詞句の登録処理を示すフローチャートで
ある。FIG. 5 is a flowchart showing a process of registering a noun phrase in a thesaurus dictionary in the document search device according to the present embodiment.

[Explanation of symbols]

１００クライアント１０１文書ＤＢ１０２転置ファイル１０３検索サーバ２００自然言語処理モジュール２０１辞書２０２文法辞書２０３シソーラス辞書２０４ドキュメント・セット２０５データベース・ビルド・コンポーネント２０６検索条件２０７クエリー・ビルド・コンポーネント２０８クエリー・ドキュメント２０９検索エンジン２１０文書リスト２１１関連語抽出エンジン REFERENCE SIGNS LIST 100 client 101 document DB 102 transposed file 103 search server 200 natural language processing module 201 dictionary 202 grammar dictionary 203 thesaurus dictionary 204 document set 205 database build component 206 search condition 207 query build component 208 query document 209 search engine 210 Document list 211 Related word extraction engine

Claims

[Claims]

1. A document search method comprising: selecting a synonym corresponding to an input search condition from a thesaurus for search; and searching for a corresponding document from a search target document based on the search condition and the synonym. An apparatus, comprising: inputting a document searched by the search unit; extracting a noun phrase in the input document to generate a noun phrase list; and a noun generated by the noun phrase list generation unit. A phrase list is input, and a score is assigned to each noun phrase in the input noun phrase list according to statistical information such as the frequency of occurrence and distribution in the document searched by the search means and the search target document. Means, a noun phrase extracting means for extracting a noun phrase of a score corresponding to a preset extraction condition based on the score given by the score giving means, and the noun phrase A dictionary registration means for noun phrases extracted with means out is registered in the thesaurus, the document search apparatus characterized by comprising a.

2. The apparatus according to claim 1, further comprising: a document selection unit that selects a document from which the noun phrase is to be extracted from the documents retrieved by the retrieval unit. 2. The document search device according to claim 1, wherein the noun phrase is extracted to generate a noun phrase list.

3. The system according to claim 1, further comprising a noun phrase selecting unit for selecting a noun phrase to be registered in the thesaurus dictionary from the noun phrases extracted by the noun phrase extracting unit, wherein the dictionary registering unit selects the noun phrase selected by the noun phrase selecting unit. 3. The document search apparatus according to claim 1, wherein a noun phrase is registered in the thesaurus dictionary.

4. A computer-readable recording medium having recorded thereon a program for causing a computer to function as each unit of the document search device according to claim 1.