JP2002259385A

JP2002259385A - Device, method and program for retrieving document and recording medium

Info

Publication number: JP2002259385A
Application number: JP2001054539A
Authority: JP
Inventors: Yasutsugu Ogawa; 泰嗣小川; Hiroko Mano; 博子真野
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-02-28
Filing date: 2001-02-28
Publication date: 2002-09-13
Anticipated expiration: 2021-02-28
Also published as: JP4049543B2

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieving device capable of expanding retrieval conditions even when an n-gram index is used. SOLUTION: To a retrieval condition 10 to become a retrieval character string inputted from a user by a retrieval condition input part 21, a document ranking part 22 selects the set of suited documents out of a document file 23a in a document database 23. A word extracting part 24 extracts a word in the suited document by morpheme analysis or the like. A word ranking part 25 selects a word in the suited document corresponding to the degree of relation by an n-gram index 23b of the document database 23 and prepares new retrieval conditions by adding such words to the original retrieval conditions as words related to the retrieval conditions. On such new retrieval conditions, a suited document 30 is selected out of the document database 23 by the document ranking part 22 again and outputted by a document output part 26.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書検索装置、文
書検索方法、文書検索プログラム、及びその記録媒体に
関し、より詳細には、与えられた検索条件に対して適合
する文書を選択する検索を行い、適合文書から抽出し
た、検索条件に関連した単語或いは索引単位によって検
索条件を拡張し、拡張した検索条件で再検索する文書検
索装置、文書検索方法、文書検索プログラム、及びその
プログラムを記録したコンピュータ読み取り可能な記録
媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search apparatus, a document search method, a document search program, and a recording medium for the same. A document search device, a document search method, a document search program, and a program for retrieving a search condition based on a word or an index unit related to the search condition extracted from a suitable document and performing re-search based on the expanded search condition. The present invention relates to a computer-readable recording medium.

【０００２】[0002]

【従来の技術】文書検索装置において、ユーザが入力し
た検索条件に適合する文書を探し出すために、検索条件
中の各単語に重みをあたえ、それに基づいて検索対象の
各文書の検索条件に対する適合の度合を求めるという方
法が一般に行われている。2. Description of the Related Art In a document search apparatus, in order to search for a document that satisfies a search condition input by a user, each word in the search condition is given a weight, and based on the weight, each document to be searched is matched with the search condition. A method of obtaining a degree is generally used.

【０００３】単語の重みの計算式には、例えば、本出願
人による特願平１１−３１４４４２号明細書（以下、従
来技術１と呼ぶ）に記載の計算式がある。この計算式
は、Ｄを検索対象文書数（総文書数と呼ぶ）、ｄを各単
語の出現する文書数（文書頻度と呼ぶ）、ｋ_４′を確率
推定に基づく調整パラメータ（０より大きい実数）とし
て、下式（１）で表される。A formula for calculating the weight of a word is, for example, a formula described in Japanese Patent Application No. 11-314442 (hereinafter, referred to as prior art 1) by the present applicant. In this formula, D is the number of documents to be searched (referred to as the total number of documents), d is the number of documents in which each word appears (referred to as the document frequency), and k ₄ ′ is an adjustment parameter (real number greater than 0) based on probability estimation. ) Is represented by the following equation (1).

【０００４】[0004]

【数１】 (Equation 1)

【０００５】各単語の重みが定まったら、各文書が各単
語をどのくらい含んでいるかをもとに各文書の文書適合
度を計算する。この文書適合度は、ｔｆを文書あたりの
単語の出現数（文書内頻度と呼ぶ）、ｋ_１を調整パラメ
ータとして、以下の計算式（２）で求める。When the weight of each word is determined, the document relevance of each document is calculated based on how much each word contains each word. This document adaptation degree (referred to as document in frequency) number of occurrences of words per document tf, the k ₁ as the adjustment parameter is calculated by the following equation (2).

【０００６】[0006]

【数２】 (Equation 2)

【０００７】さらに、ユーザが入力した検索条件を用い
て検索した後、適合する文書中に出現する単語から入力
検索条件に関連する単語を選出、元の検索条件に追加
し、再度検索することでユーザの求めるものに近いもの
がより得られやすくなることも知られている。このよう
にして関連語を追加した場合、再検索時の重みづけに
は、例えば、適合文書、非適合文書の中での出現頻度な
どのフィードバック情報を利用し、以下の計算式（３）
で求める。Further, after searching using the search condition input by the user, a word related to the input search condition is selected from words appearing in a matching document, added to the original search condition, and searched again. It is also known that something closer to what the user wants is more likely to be obtained. When related words are added in this manner, feedback information such as the frequency of occurrence in conforming documents and non-conforming documents is used for weighting at the time of re-searching, and the following formula (3) is used.
Ask for.

【０００８】[0008]

【数３】 (Equation 3)

【０００９】なお、上式（３）において、Ｒは適合文書
数、ｒは適合文書集合の中で単語の出現する文書数、Ｓ
は非適合文書数、ｓは非適合文書集合の中で単語の出現
する文書数、ｋ₅，ｋ₆は調整パラメータである。In the above equation (3), R is the number of conforming documents, r is the number of documents in which a word appears in the conforming document set, and S is
Is the number of non-conforming documents, s is the number of documents in which a word appears in the non-conforming document set, and k ₅ and k ₆ are adjustment parameters.

【００１０】また、検索条件関連語を選出するには、適
合する文書から選出すべき関連語を選択するための関連
度評価値ＴＳＶは、例えば、適合文書及び非適合文書で
の文書内頻度などのフィードバック情報を利用して、
α，βを調整パラメータとして、以下の計算式（４）で
求める。In order to select a search condition related word, a relevance evaluation value TSV for selecting a related word to be selected from a matching document is, for example, a frequency in a document between a matching document and a non-matching document. Use the feedback information from
Using α and β as adjustment parameters, it is obtained by the following equation (4).

【００１１】[0011]

【数４】 (Equation 4)

【００１２】一方、日本語文書を対象に検索を行う場
合、検索のための索引をどのように作成するかが問題と
なる。すなわち、英語では単語を索引単位として索引を
作成するのが一般的であるが、日本語では英語のように
スペース／カンマ／ピリオドなどによって単語が区切ら
れていない。そこで、英語同様に単語を索引単位とする
ためには形態素解析等を導入し単語を切り出す必要があ
るが、解析誤りや辞書整備の問題がある。そこで、ｎ−
ｇｒａｍ（ｎ文字の連続）を索引単位とする方法（以下
ｎ−ｇｒａｍ索引と呼ぶ）が使用される。この方法とし
ては、例えば当出願人により先に提案された文書検索装
置、文書検索装置及び記録媒体（以下、従来技術２と呼
ぶ）が挙げられる。On the other hand, when searching for Japanese documents, how to create an index for searching becomes a problem. That is, in English, it is common to create an index using words as index units, but in Japanese, words are not separated by spaces / commas / periods as in English. Therefore, in order to use words as index units as in English, it is necessary to introduce a morphological analysis or the like to cut out words, but there are problems with analysis errors and dictionary maintenance. Then, n-
A method using a gram (a sequence of n characters) as an index unit (hereinafter referred to as an n-gram index) is used. As this method, for example, a document search device, a document search device, and a recording medium (hereinafter, referred to as Conventional Technique 2) previously proposed by the present applicant can be cited.

【００１３】[0013]

【発明が解決しようとする課題】しかしながら、ｎ−ｇ
ｒａｍ索引を使用した場合には、文書検索装置に単語切
り出し手段がないため、上で述べたように単純には検索
条件拡張を適用できないという問題がある。However, ng
When the ram index is used, there is a problem that the search condition expansion cannot be simply applied as described above because the document search device does not have a word extracting means.

【００１４】本発明、上述のごとき実情に鑑みてなされ
たものであり、ｎ−ｇｒａｍ索引を使用した場合にも検
索条件拡張を可能とした文書検索装置、文書検索方法、
文書検索プログラム、及びそのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体を提供することをそ
の目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above situation, and has a document search apparatus and a document search method capable of expanding search conditions even when an n-gram index is used.
It is an object of the present invention to provide a document search program and a computer-readable recording medium on which the program is recorded.

【００１５】本発明は、また、検索条件拡張において単
語を単位とするとＴＳＶなどの計算に時間がかかるとい
った問題を解消するために、近似的に計算した頻度を使
用する文書検索装置、文書検索方法、文書検索プログラ
ム、及びそのプログラムを記録したコンピュータ読み取
り可能な記録媒体を提供することを他の目的とする。According to the present invention, a document search apparatus and a document search method using approximate calculated frequencies are used to solve the problem that it takes a long time to calculate a TSV or the like when words are used as a unit in expansion of search conditions. Another object of the present invention is to provide a document search program, and a computer-readable recording medium storing the program.

【００１６】本発明は、さらに、検索条件拡張において
拡張する要素を単語ではなくｎ−ｇｒａｍとすること
で、検索条件拡張を可能とする文書検索装置、文書検索
方法、文書検索プログラム、及びそのプログラムを記録
したコンピュータ読み取り可能な記録媒体を提供するこ
とをその目的とする。The present invention further provides a document search apparatus, a document search method, a document search program, and a document search program capable of expanding search conditions by using n-grams instead of words as elements to be expanded in search condition expansion. It is an object of the present invention to provide a computer-readable recording medium on which is recorded.

【００１７】[0017]

【課題を解決するための手段】請求項１の発明は、ユー
ザが登録した文書を格納し、検索用に、ｎ個の連続され
る文字から構成される文字組であるｎ−ｇｒａｍを索引
単位とするｎ−ｇｒａｍ索引を含む文書データベース
と、ユーザから検索条件を得る検索条件入力部と、検索
条件にしたがって文書をランキングする文書ランキング
部と、該文書ランキング部が出力する適合文書／非適合
文書から単語を抽出する単語抽出部と、適合文書中の単
語の適合文書／非適合文書／登録文書の統計情報を利用
して単語をランキングし、さらにランキングされた単語
の一部或いは全部をユーザの検索条件に追加して新たな
検索条件を作成する単語ランキング部と、文書を出力す
る文書出力部と、を有することを特徴としたものであ
る。According to the first aspect of the present invention, a document registered by a user is stored, and an n-gram, which is a character set composed of n consecutive characters, is stored in an index unit for retrieval. A document database including an n-gram index, a search condition input unit for obtaining a search condition from a user, a document ranking unit for ranking documents according to the search condition, and a conforming / non-conforming document output by the document ranking unit Words are extracted using a word extraction unit that extracts words from a document, and statistical information of the matching documents / non-matching documents / registered documents in the matching documents, and a part or all of the ranked words is used by the user. It is characterized by having a word ranking section for creating a new search condition in addition to a search condition, and a document output section for outputting a document.

【００１８】請求項２の発明は、請求項１の発明におい
て、単語を選出する際に、該単語の、前記登録文書にお
ける出現数である文書頻度を、該単語を構成するｎ−ｇ
ｒａｍを含む文書数で代用することを特徴としたもので
ある。According to a second aspect of the present invention, in the first aspect of the invention, when a word is selected, a document frequency, which is the number of appearances of the word in the registered document, is determined by ng which constitutes the word.
This is characterized in that the number of documents including ram is substituted.

【００１９】請求項３の発明は、請求項１の発明におい
て、単語を選出する際に、該単語の、前記登録文書にお
ける出現数である文書頻度を、該単語を構成するｎ−ｇ
ｒａｍの文書数の最小値で代用することを特徴としたも
のである。According to a third aspect of the present invention, in the first aspect of the invention, when a word is selected, a document frequency, which is the number of occurrences of the word in the registered document, is determined by ng which constitutes the word.
This is characterized in that the minimum value of the number of documents of ram is substituted.

【００２０】請求項４の発明は、ユーザが登録した文書
を格納し、検索用に、ｎ個の連続される文字から構成さ
れる文字組であるｎ−ｇｒａｍを索引単位とするｎ−ｇ
ｒａｍ索引を含む文書データベースと、ユーザから検索
条件を得る検索条件入力部と、検索条件にしたがって文
書をランキングする文書ランキング部と、適合文書中の
ｎ−ｇｒａｍの適合文書／非適合文書／登録文書の統計
情報を利用してｎ−ｇｒａｍをランキングし、さらにラ
ンキングされたｎ−ｇｒａｍの一部或いは全部をユーザ
の検索条件に追加して新たな検索条件を作成する索引単
位ランキング部と、文書を出力する文書出力部と、を有
することを特徴としたものである。According to a fourth aspect of the present invention, a document registered by a user is stored, and for search, ng is used as an index unit with n-gram which is a character set composed of n consecutive characters.
a document database including a ram index, a search condition input unit for obtaining a search condition from a user, a document ranking unit for ranking documents according to the search condition, and n-gram conforming documents / non-conforming documents / registered documents in the conforming documents An index unit ranking unit that ranks n-grams using the statistical information of, and adds a part or all of the ranked n-grams to the search condition of the user to create a new search condition; And a document output unit for outputting.

【００２１】請求項５の発明は、ユーザが登録した文書
を格納し、検索用に、ｎ個の連続される文字から構成さ
れる文字組であるｎ−ｇｒａｍを索引単位とするｎ−ｇ
ｒａｍ索引を含む文書データベースにより、ユーザから
得た検索条件にしたがって文書をランキングし、適合文
書及び非適合文書に分類し、該ランキングによって分類
された適合文書／非適合文書から単語を抽出し、適合文
書中の単語の適合文書／非適合文書／登録文書の統計情
報を利用して単語をランキングし、さらに該ランキング
された単語の一部或いは全部をユーザの検索条件に追加
して新たな検索条件を作成し、前記文書データベースに
より該作成した新たな検索条件にしたがって文書をラン
キングし、適合文書を出力することを特徴としたもので
ある。According to a fifth aspect of the present invention, a document registered by a user is stored, and for search, ng is used as an index unit with n-gram being a character set composed of n consecutive characters.
A document database including a ram index ranks documents according to search conditions obtained from a user, classifies the documents into conforming documents and non-conforming documents, extracts words from conforming documents / non-conforming documents classified according to the ranking, and performs matching. A word is ranked using statistical information of a conforming document / non-conforming document / registered document of a word, and a part or all of the ranked word is added to a user's search condition to create a new search condition. Is generated, and the documents are ranked according to the new search condition created by the document database, and a conforming document is output.

【００２２】請求項６の発明は、請求項５の発明におい
て、単語を選出する際に、該単語の、前記登録文書にお
ける出現数である文書頻度を、該単語を構成するｎ−ｇ
ｒａｍを含む文書数で代用することを特徴としたもので
ある。According to a sixth aspect of the present invention, in the fifth aspect of the invention, when a word is selected, a document frequency, which is the number of occurrences of the word in the registered document, is determined by using ng constituting the word.
This is characterized in that the number of documents including ram is substituted.

【００２３】請求項７の発明は、請求項５の発明におい
て、単語を選出する際に、該単語の、前記登録文書にお
ける出現数である文書頻度を、該単語を構成するｎ−ｇ
ｒａｍの文書数の最小値で代用することを特徴としたも
のである。According to a seventh aspect of the present invention, in the fifth aspect of the invention, when a word is selected, a document frequency, which is the number of occurrences of the word in the registered document, is determined by ng which constitutes the word.
This is characterized in that the minimum value of the number of documents of ram is substituted.

【００２４】請求項８の発明は、ユーザが登録した文書
を格納し、検索用に、ｎ個の連続される文字から構成さ
れる文字組であるｎ−ｇｒａｍを索引単位とするｎ−ｇ
ｒａｍ索引を含む文書データベースにより、ユーザから
得た検索条件にしたがって文書をランキングし、適合文
書及び非適合文書に分類し、適合文書中のｎ−ｇｒａｍ
の適合文書／非適合文書／登録文書の統計情報を利用し
てｎ−ｇｒａｍをランキングし、さらに該ランキングさ
れたｎ−ｇｒａｍの一部或いは全部をユーザの検索条件
に追加して新たな検索条件を作成し、前記文書データベ
ースにより該作成された新たな検索条件にしたがって文
書をランキングし、適合文書を出力する文書出力部と、
を有することを特徴としたものである。According to an eighth aspect of the present invention, a document registered by a user is stored, and for retrieval, ng is used as an index unit with n-gram, which is a character set composed of n consecutive characters.
A document database including a ram index ranks documents according to search conditions obtained from the user, classifies the documents into conforming documents and non-conforming documents, and selects n-grams in the conforming documents.
N-grams are ranked using statistical information of conforming documents / non-conforming documents / registered documents, and a part or all of the ranked n-grams is added to the user's search condition to create a new search condition. And a document output unit that ranks documents according to the new search conditions created by the document database and outputs conforming documents.
Which is characterized by having

【００２５】請求項９の発明は、請求項５乃至８のいず
れか１記載の文書検索方法を実行させるための文書検索
プログラムであることを特徴としたものである。According to a ninth aspect of the present invention, there is provided a document search program for executing the document search method according to any one of the fifth to eighth aspects.

【００２６】請求項１０の発明は、請求項９記載の文書
検索プログラムを記録したコンピュータ読み取り可能な
記憶媒体であることを特徴としたものである。According to a tenth aspect of the present invention, there is provided a computer-readable storage medium storing the document search program according to the ninth aspect.

【００２７】[0027]

【発明の実施の形態】図１は、本発明の一実施形態にか
かわる文書検索装置の構成を示すブロック図である。本
実施形態にかかわる文書検索装置２０は、検索条件入力
部２１、文書ランキング部２２、単語抽出部２４、単語
ランキング部２５、文書出力部２６及び文書データベー
ス２３より構成される。文書データベース２３は、文書
そのものを記録する文書ファイル２３ａと検索に使用す
るｎ−ｇｒａｍ索引２３ｂから構成される。検索条件入
力部２１では、ユーザがキーボード等により、検索文字
列となる検索条件１０を入力できる。文書ランキング部
２２は、検索条件入力部２１で入力された検索条件１０
に適合する文書（適合文書）の集合を、文書データベー
ス２３の文書ファイル２３ａからｎ−ｇｒａｍ索引を参
照しながら選定する。単語抽出部２４は適合文書中の単
語を形態素解析等によって抽出する。単語ランキング部
２５は、文書データベース２３のｎ−ｇｒａｍ索引２３
ｂを参照して、適合文書中の単語の適合文書／非適合文
書／登録文書の統計情報を利用して単語をランキング
し、すなわち適合文書中の単語を関連度に応じて選出
し、それらを検索条件関連語として元の検索条件に追加
した新しい検索条件を作成する。この新しい検索条件の
もと、文書ランキング部２２にて文書データベース２３
から再度適合文書３０を選出する。文書出力部２６は、
選出した適合文書３０を出力する。FIG. 1 is a block diagram showing the configuration of a document search apparatus according to an embodiment of the present invention. The document search device 20 according to the present embodiment includes a search condition input unit 21, a document ranking unit 22, a word extraction unit 24, a word ranking unit 25, a document output unit 26, and a document database 23. The document database 23 includes a document file 23a for recording the document itself and an n-gram index 23b used for searching. In the search condition input unit 21, the user can input the search condition 10 that becomes a search character string using a keyboard or the like. The document ranking unit 22 retrieves the search condition 10 input by the search condition input unit 21.
Is selected from the document file 23a of the document database 23 with reference to the n-gram index. The word extracting unit 24 extracts words in the matching document by morphological analysis or the like. The word ranking unit 25 is an n-gram index 23 of the document database 23.
b, the words are ranked using the statistical information of the conforming document / non-conforming document / registered document of the word in the conforming document, that is, the words in the conforming document are selected according to the degree of relevance, and the words are selected. Create a new search condition that is added to the original search condition as a search condition related term. Under the new search conditions, the document ranking unit 22 generates a document database 23
, The matching document 30 is selected again. The document output unit 26
The selected conforming document 30 is output.

【００２８】文書登録は、図１には示されていない文書
入力部によって文書データベースが更新されることによ
り行われる。すなわち、文書が文書ファイルに追加さ
れ、文書内容に応じてｎ−ｇｒａｍ索引の内容も更新さ
れる。Document registration is performed by updating a document database by a document input unit not shown in FIG. That is, the document is added to the document file, and the content of the n-gram index is updated according to the content of the document.

【００２９】図２は、本発明の一実施形態にかかわる文
書検索装置の動作を説明するためのフロー図である。図
１の文書検索装置における文書検索処理は、図２のフロ
ーにしたがって行われる。ユーザによって検索条件入力
部２１から検索条件１０が入力されると（ステップＳ
１）、文書ランキング部２２は検索条件１０中の単語を
重みづけして、文書をランキングし、適合文書を選出す
る（ステップＳ２）。次に、単語抽出部２４によりそれ
らの適合文書から単語を切り出す（ステップＳ３）。そ
して単語ランキング部２５により、適合文書中の単語を
ランキングし、重みづけし、関連語を選出し、新しい検
索条件を作成する（ステップＳ４）。ここで作成した新
検索条件に基づいて文書ランキング部２２により再度文
書をランキングし（ステップＳ５）、その適合文書３０
を出力する（ステップＳ６）。なお、ステップＳ２，Ｓ
５において行うｎ−ｇｒａｍ索引を用いた文書ランキン
グは従来技術２の方法を用いればよい。ただし、ステッ
プＳ５では単語の重みはステップＳ４で計算済みなの
で、単語重みの計算を行う必要はない。ステップＳ３の
単語切り出しは形態素解析によって文書を単語に切れば
よい。以下、ステップＳ４を詳しく説明する。FIG. 2 is a flowchart for explaining the operation of the document search apparatus according to one embodiment of the present invention. The document search processing in the document search device of FIG. 1 is performed according to the flow of FIG. When the user inputs the search condition 10 from the search condition input unit 21 (Step S)
1) The document ranking unit 22 weights the words in the search condition 10, ranks the documents, and selects a suitable document (step S2). Next, words are cut out from the matching documents by the word extracting unit 24 (step S3). The word ranking unit 25 ranks and weights the words in the conforming document, selects related words, and creates a new search condition (step S4). The documents are ranked again by the document ranking unit 22 based on the new search conditions created here (step S5), and the matching documents 30
Is output (step S6). Steps S2 and S
Document ranking using the n-gram index performed in step 5 may use the method of the related art 2. However, in step S5, the weight of the word has already been calculated in step S4, so that it is not necessary to calculate the word weight. The word extraction in step S3 may be performed by cutting the document into words by morphological analysis. Hereinafter, step S4 will be described in detail.

【００３０】図３は、本発明にかかわる文書検索処理に
おけるｎ−ｇｒａｍ索引の参照方法を説明するための図
である。ステップＳ４では、適合文書中のすべての単語
について、ｎ−ｇｒａｍ索引２３ｂを参照しながら、適
合文書及び非適合文書での出現状況、すなわちフィード
バック情報を反映させて、それぞれの単語の重みを求め
る。さらに、単語ランキング部は、この重みとフィード
バック情報から適合文書中の各単語について、検索条件
との関連度ＴＳＶを求める。以下、計算式は上述の従来
技術１に記載のものを使用する。さらに、図３に示すよ
うに、ｎ−ｇｒａｍ索引の索引単位の長さはｎ＝２と
し、適合文書は２つ、非適合文書はなし、「雨林」と
「アマゾン」は１つの適合文書にのみ含まれており出現
回数はそれぞれ３回と２回であり、総文書数は１０００
であるとする。ここで、これらの単語の重みとＴＳＶの
計算を説明する。FIG. 3 is a diagram for explaining a method of referring to an n-gram index in a document search process according to the present invention. In step S4, with respect to all the words in the conforming document, the appearance status of the conforming document and the non-conforming document, that is, the feedback information is reflected while referring to the n-gram index 23b, and the weight of each word is obtained. Further, the word ranking unit obtains the relevance TSV of each word in the matching document from the weight and the feedback information with respect to the search condition. Hereinafter, the calculation formulas used in the above-described prior art 1 are used. Further, as shown in FIG. 3, the length of the index unit of the n-gram index is n = 2, there are two conforming documents, there is no non-conforming document, and “rainforest” and “Amazon” are in only one conforming document. Included and the number of appearances is 3 and 2, respectively, and the total number of documents is 1000
And Here, the weight of these words and the calculation of the TSV will be described.

【００３１】まず、「雨林」に関して以下の値が得られ
る。Ｄ＝１０００，ｄ＝５０，Ｒ＝２，ｒ＝１，Ｓ＝
０，ｓ＝０，ｔｆ＝３このうち、「雨林」は索引単位と
一致するので、ｄはｎ−ｇｒａｍ索引の「雨林」の文書
頻度を読み出すことで得られる。上記値を式（３）に当
てはめれば重みが、さらに重みを式（４）に当てはめれ
ばＴＳＶが計算できる。調整パラメータ（ｋ₁，ｋ₄′，
ｋ₅，ｋ₆，α，β）がすべて１とすると、重みは２.９
９、ＴＳＶは２.２４となる。First, the following values are obtained for "rainforest". D = 1000, d = 50, R = 2, r = 1, S =
0, s = 0, tf = 3 Of these, since “rainforest” matches the index unit, d can be obtained by reading the document frequency of “rainforest” in the n-gram index. The weight can be calculated by applying the above value to equation (3), and the TSV can be calculated by applying the weight to equation (4). Adjustment parameters (k ₁ , k ₄ ′,
If k ₅ , k ₆ , α, β) are all 1, the weight is 2.9.
9. TSV is 2.24.

【００３２】一方、「アマゾン」に関しては以下の値が
得られる。Ｄ＝１０００，ｄ＝７５，Ｒ＝２，ｒ＝１，
Ｓ＝０，ｓ＝０，ｔｆ＝２ここで面倒なのはｄの取得
で、「雨林」とは異なり「アマゾン」は複数の索引単位
に分割されるので、ｎ−ｇｒａｍ索引からは直接得られ
ない。「アマ」「ゾン」の出現情報（文書ＩＤ，文書内
頻度，文書内出現位置）を用いて、「アマゾン」が出現
した文書数を求める必要がある。そのためには、「ア
マ」「ゾン」が両方出現し、かつその文書内出現位置が
２文字ずれている文書を特定する。図３の例であれば、
ＩＤ＝１の文書は両者が２文字はなれて出現しているの
で「アマゾン」を含み、ＩＤ＝２は両者が出現している
が文書内出現位置がばらばらなので「アマゾン」を含ま
ないと判断できる。このような処理を続けることで、ｄ
を求めることができる。重みとＴＳＶの計算は「雨林」
と同じで、重みは２.５７、ＴＳＶは１.７２となる。On the other hand, the following values are obtained for "Amazon". D = 1000, d = 75, R = 2, r = 1,
S = 0, s = 0, tf = 2 What is troublesome here is acquisition of d. Unlike “rainforest”, “Amazon” is divided into a plurality of index units, and cannot be directly obtained from the n-gram index. . It is necessary to determine the number of documents in which “Amazon” has appeared using the appearance information (document ID, frequency in document, occurrence position in document) of “Amazon” and “Zon”. For this purpose, a document in which both “ama” and “zon” appear and whose appearance position in the document is shifted by two characters is specified. In the example of FIG.
The document with ID = 1 includes "Amazon" because both appear two characters apart, and it can be determined that ID = 2 does not include "Amazon" since both appear but the locations in the document are different. . By continuing such processing, d
Can be requested. Calculation of weight and TSV is "rain forest"
The weight is 2.57 and the TSV is 1.72.

【００３３】ステップＳ４の最後では、ＴＳＶの高い単
語を選択して、それを入力された検索条件に追加して新
検索条件を生成する。入力された検索条件が「熱帯」で
あり、その重みが４.２１であったとする。この場合、
新検索条件は、＃ＯＲをＯＲ演算子、＃ＷＥＩＧＨＴを
重みを指定する演算子として、以下のようになる。At the end of step S4, a word having a high TSV is selected and added to the input search condition to generate a new search condition. It is assumed that the input search condition is “tropical” and its weight is 4.21. in this case,
The new search condition is as follows, using #OR as an OR operator and #WEIGHT as an operator to specify a weight.

【００３４】＃ＯＲ（＃ＷＥＩＧＨＴ［４.２１］（熱
帯），＃ＷＥＩＧＨＴ［２.９９］（雨林），＃ＷＥＩ
ＧＨＴ［２.５７］（アマゾン））#OR (#WEIGHT [4.21] (tropical), #WEIGHT [2.99] (rainforest), #WEI
GHT [2.57] (Amazon)

【００３５】上述の実施形態においては、文書頻度ｄを
もとめるのに文書内出現位置の検査が必要であり、処理
に時間がかかる。そこで、本発明の他の実施形態として
は、ステップＳ４において複数の索引単位に分割される
単語については、ｄを求める際に文書内出現位置の検査
を行わず、索引単位を含む文書数で代用するようにす
る。In the above-described embodiment, it is necessary to check the appearance position in the document in order to obtain the document frequency d, and it takes a long time for the processing. Therefore, as another embodiment of the present invention, for a word divided into a plurality of index units in step S4, the appearance position in the document is not checked when d is obtained, and the number of documents including the index unit is substituted. To do it.

【００３６】例えば、「アマゾン」については、ＩＤ＝
２の文書は「アマ」「ゾン」の出現しているので、文書
内出現位置を無視して「アマゾン」を含むと判定する。
この処理を続けることでｄを求める。For example, for “Amazon”, ID =
In the document No. 2, "amazon" and "zon" appear, so that it is determined that "amazon" is included ignoring the appearance position in the document.
By continuing this processing, d is obtained.

【００３７】本発明の他の実施形態においては、上述の
ステップＳ４の処理を変更した実施形態における文書検
索処理をさらに高速化する方法として、ステップＳ４で
複数の索引単位に分割される単語については、ｄを求め
る際に索引単位の文書頻度の最小値で代用するようにす
る。In another embodiment of the present invention, as a method for further accelerating the document search processing in the embodiment in which the processing in step S4 described above is changed, a word divided into a plurality of index units in step S4 is used. , D is substituted by the minimum value of the document frequency for each index.

【００３８】例えば、「アマゾン」については、「ア
マ」の文書頻度２００と「ゾン」の文書頻度１００の最
小値である１００をｄとする。For example, as for "Amazon", d which is the minimum value of the document frequency 200 of "Amazon" and the document frequency 100 of "zon" is d.

【００３９】図４は、本発明の他の実施形態にかかわる
文書検索装置の構成を示すブロック図である。本実施形
態にかかわる文書検索装置２０′は、図１で説明した実
施形態の文書検索装置２０と比較して単語抽出部２４が
ないこと、単語ランキング部２５が索引単位ランキング
部２７に変わった点が異なる。FIG. 4 is a block diagram showing a configuration of a document search apparatus according to another embodiment of the present invention. The document search device 20 ′ according to the present embodiment is different from the document search device 20 of the embodiment described with reference to FIG. 1 in that the word extraction unit 24 is not provided, and the word ranking unit 25 is replaced with an index unit ranking unit 27. Are different.

【００４０】すなわち、文書検索装置２０′は、検索条
件入力部２１、文書ランキング部２２、索引単位ランキ
ング部２７、文書出力部２６及び文書データベース２３
より構成される。文書データベース２３は、文書そのも
のを記録する文書ファイル２３ａと検索に使用するｎ−
ｇｒａｍ索引２３ｂから構成される。検索条件入力部２
１では、ユーザがキーボード等により、検索文字列とな
る検索条件１０を入力できる。文書ランキング部２２
は、検索条件入力部２１で入力された検索条件１０に適
合する文書（適合文書）の集合を、文書データベース２
３の文書ファイル２３ａからｎ−ｇｒａｍ索引を参照し
ながら選定する。索引単位ランキング部２７は、文書デ
ータベース２３のｎ−ｇｒａｍ索引２３ｂにより、適合
文書中のｎ−ｇｒａｍの適合文書／非適合文書／登録文
書の統計情報を利用してｎ−ｇｒａｍをランキングし、
すなわち適合文書中の索引単位をランキング、重みづけ
し、関連する索引単位を選出し、新しい検索条件を作成
する。この新しい検索条件のもと、文書ランキング部２
２にて文書データベース２３から再度適合文書３０′を
選出する。文書出力部２６は、選出した適合文書３０′
を出力する。また、文書登録は、図４には示されていな
い文書入力部によって文書データベースが更新されるこ
とにより行われる。すなわち、文書が文書ファイルに追
加され、文書内容に応じてｎ−ｇｒａｍ索引の内容も更
新される。That is, the document search device 20 'includes a search condition input unit 21, a document ranking unit 22, an index unit ranking unit 27, a document output unit 26, and a document database 23.
It is composed of The document database 23 includes a document file 23a for recording the document itself and n-
It is composed of a gram index 23b. Search condition input part 2
In 1, the user can input a search condition 10 that becomes a search character string by using a keyboard or the like. Document ranking section 22
Stores a set of documents (conforming documents) meeting the search condition 10 input by the search condition input unit 21 into the document database 2
3 with reference to the n-gram index from the document file 23a. The index unit ranking unit 27 ranks the n-grams by using the statistical information of the conforming / non-conforming / registered documents of the n-gram in the conforming document by the n-gram index 23b of the document database 23,
That is, the index units in the conforming document are ranked and weighted, related index units are selected, and a new search condition is created. Under this new search condition, the document ranking section 2
In step 2, a suitable document 30 'is selected again from the document database 23. The document output unit 26 outputs the selected conforming document 30 ′
Is output. The document registration is performed by updating the document database by a document input unit not shown in FIG. That is, the document is added to the document file, and the content of the n-gram index is updated according to the content of the document.

【００４１】図５は、本発明の他の実施形態にかかわる
文書検索装置の動作を説明するためのフロー図である。
図４の文書検索装置における文書検索処理は、図５のフ
ローにしたがって行われる。図２で説明した処理と比較
すると、図２のステップＳ３の単語抽出がないことと、
ステップＳ１３（図２のステップＳ４と対応する）の動
作が異なる。すなわち、ユーザによって検索条件入力部
２１から検索条件１０が入力されると（ステップＳ１
１）、文書ランキング部２２は検索条件１０中の単語を
重みづけして、文書をランキングし、適合文書を選出す
る（ステップＳ１２）。次に、索引単位ランキング部２
７により、適合文書中の索引単位をランキングし、重み
づけし、関連する索引単位を選出し、新しい検索条件を
作成する（ステップＳ１３）。ここで作成した新検索条
件に基づいて文書ランキング部２２により再度文書をラ
ンキングし（ステップＳ１４）、その適合文書３０′を
出力する（ステップＳ１５）。なお、ステップＳ１２，
Ｓ１４において行うｎ−ｇｒａｍ索引を用いた文書ラン
キングは従来技術２の方法を用いればよい。以下、ステ
ップＳ１３を詳しく説明する。FIG. 5 is a flow chart for explaining the operation of the document search apparatus according to another embodiment of the present invention.
The document search processing in the document search device of FIG. 4 is performed according to the flow of FIG. Compared to the processing described in FIG. 2, there is no word extraction in step S3 in FIG.
The operation in step S13 (corresponding to step S4 in FIG. 2) is different. That is, when the user inputs the search condition 10 from the search condition input unit 21 (step S1).
1) The document ranking unit 22 weights the words in the search condition 10, ranks the documents, and selects a suitable document (step S12). Next, the index unit ranking unit 2
7, the index units in the conforming document are ranked, weighted, related index units are selected, and a new search condition is created (step S13). The documents are ranked again by the document ranking unit 22 based on the new search condition created here (step S14), and the matching document 30 'is output (step S15). Step S12,
The document ranking using the n-gram index performed in S14 may use the method of the related art 2. Hereinafter, step S13 will be described in detail.

【００４２】ステップＳ１３では、適合文書中のすべて
の索引単位について、ｎ−ｇｒａｍ索引を参照しなが
ら、適合文書及び非適合文書での出現状況、すなわちフ
ィードバック情報を反映させて、それぞれの索引単位
（＝ｎ−ｇｒａｍ）の重みを求める。さらに、索引単位
ランキング部２７は、この重みとフィードバック情報か
ら適合文書中の各索引単位について、検索条件との関連
度ＴＳＶを求める。In step S13, with respect to all index units in the conforming document, the appearance state in the conforming document and the non-conforming document, that is, the feedback information is reflected while referring to the n-gram index, and each index unit ( = N-gram). Further, the index unit ranking unit 27 obtains the relevance TSV with the search condition for each index unit in the relevant document from the weight and the feedback information.

【００４３】図１で説明した実施形態との相違は、ラン
キングの対象が索引単位にかわったことである。したが
って、図１の実施形態と同じく「雨林」はランキング対
象になるが、「アマゾン」はランキング対象とはならず
そこに含まれる「アマ」「マゾ」「ゾン」が対象とあ
る。索引単位の重み、ＴＳＶの計算は、図１の実施形態
における「雨林」の場合と全く同様に行えばよい。The difference from the embodiment described with reference to FIG. 1 is that the ranking target is changed to the index unit. Therefore, similarly to the embodiment of FIG. 1, “rainforest” is a ranking target, but “Amazon” is not a ranking target but “ama”, “maso”, and “zon” included therein. The weight of the index unit and the calculation of the TSV may be performed in exactly the same manner as in the case of “rainforest” in the embodiment of FIG.

【００４４】本実施形態によれば、新検索条件を作成す
る際に重み、ＴＳＶを計算するのは全て索引単位となる
ため、文書頻度は単純にｎ−ｇｒａｍの文書頻度フィー
ルドを読み出すだけで得られるため、検索条件作成が高
速になる。According to this embodiment, when creating a new search condition, the weight and TSV are all calculated on an index basis, so that the document frequency can be obtained by simply reading out the n-gram document frequency field. As a result, the creation of the search condition becomes faster.

【００４５】以上、本発明の各実施形態を文書検索装置
として説明してきたが、文書検索装置の動作としても説
明したように、本発明は文書検索方法としての形態も取
り得ることはいうまでもない。さらに、本発明は、コン
ピュータに文書検索方法を実行させるための文書検索プ
ログラムとしての形態も、その文書検索プログラムを記
録したコンピュータ読み取り可能な記録媒体としての形
態も取り得る。As described above, each embodiment of the present invention has been described as a document search apparatus. However, as has been described as the operation of the document search apparatus, it goes without saying that the present invention can take the form of a document search method. Absent. Furthermore, the present invention may take the form of a document search program for causing a computer to execute a document search method, or a form of a computer-readable recording medium on which the document search program is recorded.

【００４６】[0046]

【発明の効果】本発明によれば、単語抽出部（単語切り
出し部）を設けることでｎ−ｇｒａｍ索引を用いた場合
でも検索条件拡張を行うことができるようになる。According to the present invention, by providing a word extracting unit (word extracting unit), search conditions can be extended even when an n-gram index is used.

【００４７】本発明によれば、近似的に計算した頻度を
使用することにより、候補単語の重み、ＴＳＶ計算が単
純になり、検索条件拡張が高速化される。According to the present invention, by using the frequency calculated approximately, the weight of the candidate word and the TSV calculation are simplified, and the expansion of the search condition is accelerated.

【００４８】本発明によれば、検索条件拡張において拡
張する要素を単語ではなくｎ−ｇｒａｍとすることで、
単語抽出部がなくともさらに高速化した検索条件拡張が
可能となる。According to the present invention, the elements to be expanded in the search condition expansion are not words but n-grams,
Even without a word extraction unit, search conditions can be extended at a higher speed.

[Brief description of the drawings]

【図１】本発明の一実施形態にかかわる文書検索装置
の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document search device according to an embodiment of the present invention.

【図２】本発明の一実施形態にかかわる文書検索装置
の動作を説明するためのフロー図である。FIG. 2 is a flowchart illustrating an operation of the document search device according to the embodiment of the present invention.

【図３】本発明にかかわる文書検索処理におけるｎ−
ｇｒａｍ索引の参照方法を説明するための図である。FIG. 3 is a diagram illustrating n- in a document search process according to the present invention.
FIG. 7 is a diagram for explaining a method of referring to a gram index.

【図４】本発明の他の実施形態にかかわる文書検索装
置の構成を示すブロック図である。FIG. 4 is a block diagram illustrating a configuration of a document search device according to another embodiment of the present invention.

【図５】本発明の他の実施形態にかかわる文書検索装
置の動作を説明するためのフロー図である。FIG. 5 is a flowchart for explaining the operation of a document search device according to another embodiment of the present invention.

[Explanation of symbols]

１０…検索条件、２０，２０′…文書検索装置、２１…
検索条件入力部、２２…文書ランキング部、２３…文書
データベース、２３ａ…文書ファイル、２３ｂ…ｎ−ｇ
ｒａｍ索引、２４…単語抽出部、２５…単語ランキング
部、２６…文書出力部、２７…索引単位ランキング部、
３０，３０′…適合文書。10 ... search conditions, 20, 20 '... document search device, 21 ...
Search condition input unit, 22: document ranking unit, 23: document database, 23a: document file, 23b: ng
ram index, 24: word extraction unit, 25: word ranking unit, 26: document output unit, 27: index unit ranking unit,
30, 30 ': Relevant document.

Claims

[Claims]

1. A document database that stores a document registered by a user and includes an n-gram index for searching, the index unit being n-gram, which is a character set composed of n consecutive characters. A search condition input unit that obtains search conditions from a user, a document ranking unit that ranks documents according to the search conditions, a word extraction unit that extracts words from matching documents / non-matching documents output by the document ranking unit, The words are ranked using the statistical information of the conforming document / non-conforming document / registered document of the word, and a part or all of the ranked word is added to the user's search condition to create a new search condition. A document search device comprising: a word ranking section to be created; and a document output section to output a document.

2. The document search device according to claim 1, wherein
When selecting a word, the document frequency, which is the number of occurrences of the word in the registered document, is determined by the n-gra that constitutes the word.
A document search apparatus characterized in that the number of documents including m is substituted.

3. The document search device according to claim 1, wherein
When selecting a word, the document frequency, which is the number of occurrences of the word in the registered document, is determined by the n-gra that constitutes the word.
A document search apparatus characterized in that the minimum number of documents of m is substituted.

4. A document database which stores a document registered by a user and includes an n-gram index for searching, the index unit being n-gram which is a character set composed of n consecutive characters. A search condition input unit that obtains search conditions from a user, a document ranking unit that ranks documents according to the search conditions, and statistical information of n-gram conforming documents / non-conforming documents / registered documents in the conforming documents. n
-Ranking the gram and further ranked n
A document search device, comprising: an index unit ranking unit that creates a new search condition by adding a part or all of a gram to a user search condition; and a document output unit that outputs a document.

5. A document database which stores a document registered by a user and includes an n-gram index for searching, which is an n-gram index unit, which is a character set composed of n consecutive characters. , Ranking documents according to search conditions obtained from the user, classifying them into conforming documents and non-conforming documents,
A word is extracted from a non-conforming document, words are ranked using statistical information of the conforming document / non-conforming document / registered document of the word in the conforming document, and a part or all of the ranked word is used by the user. A document search method, wherein a new search condition is created in addition to the search condition, documents are ranked according to the created new search condition by the document database, and a matching document is output.

6. The document search method according to claim 5, wherein
When selecting a word, the document frequency, which is the number of occurrences of the word in the registered document, is determined by the n-gra that constitutes the word.
A document search method, wherein the number of documents including m is substituted.

7. The document search method according to claim 5, wherein
When selecting a word, the document frequency, which is the number of occurrences of the word in the registered document, is determined by the n-gra that constitutes the word.
A document search method characterized by substituting the minimum value of m documents.

8. A document database which stores a document registered by a user and includes a n-gram index for searching, which is an n-gram index unit, which is a character set composed of n consecutive characters. The documents are ranked according to search conditions obtained from the user, classified into conforming documents and non-conforming documents, and n-gram conforming documents / non-conforming documents / n-grams in the conforming documents are used by using statistical information of the conforming documents / non-conforming documents / registered documents. rank the gram, and further rank the ranked n-gram
A document output unit for creating a new search condition by adding a part or the whole of the search condition to the user's search condition, ranking documents according to the created new search condition by the document database, and outputting a compatible document; , A document search method.

9. A document search program for executing the document search method according to claim 5. Description:

10. A computer-readable storage medium on which the document search program according to claim 9 is recorded.