JP2003150636A

JP2003150636A - Document retrieval device, document retrieval method, program, and recording medium

Info

Publication number: JP2003150636A
Application number: JP2001347048A
Authority: JP
Inventors: Hideo Ito; 秀夫伊東
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-11-13
Filing date: 2001-11-13
Publication date: 2003-05-23

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieval device capable of obtaining an appropriate retrieved result for a retrieval request while hardly generating problems of retrieval omissions and excessive retrieval compared to the conventional case. SOLUTION: The document retrieval device obtains an appearing set of the same character string as a notation of a word extracted from a retrieval condition for a plurality of documents composed of the character strings expressed in a natural language, simultaneously obtains the appearing set in which the same word as the word extracted from the retrieval condition appears, and obtains an appearing degree for each word on the basis of the appearing sets. By obtaining a sufficiency degree on the basis of the appearing degree obtained for all the words extracted from the retrieval condition and selecting and outputting the document of the retrieved result corresponding to the sufficiency degree, the appropriate retrieved result for the retrieval condition is obtained.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、電子化され蓄積さ
れた文書情報から所望の文書を検索する文書検索装置、
文書検索方法、プログラムおよび記録媒体に関し、詳細
には、検索条件中の各単語と文書群中の各文書に対し、
文書中に単語の文字列として出現する出現度と単語とし
て出現する出現度とから計算される充足度をもとに文書
を選択する文書検索装置、文書検索方法、文書検索装置
の機能を実行するためのプログラムおよびそのプログラ
ムを記録したコンピュータ読み取り可能な記録媒体に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval device for retrieving a desired document from digitized and accumulated document information,
Regarding the document search method, program, and recording medium, in detail, for each word in the search condition and each document in the document group,
Executes the functions of a document search device, a document search method, and a document search device that select a document based on the degree of sufficiency calculated from the degree of occurrence of a character string of a word and the degree of appearance of a word in a document And a computer-readable recording medium recording the program.

【０００２】[0002]

【従来の技術】文書を入力して類似文書を検索する技術
としては、以下の公知の技術がある。特開平１１−７３
４２２号公報の「類似文書検索システムおよびそれに用
いる記録媒体」は、すべての文書テキストから語群とそ
の頻度情報を抽出・更新し、これを索引に保持すること
によって、文書テキストのみから得られる最大限の情報
を参照および加工して、重要概念の文書間の包含関係を
考慮して利用者にとって有用な検索結果を得ることがで
き、また、文書間の類似度の計算において入力文書の特
性と検索結果を考慮して冗長な計算を省くことのできる
効率的な類似文書検索システムを提供している。2. Description of the Related Art As a technique for inputting a document and searching for a similar document, there are the following known techniques. Japanese Patent Laid-Open No. 11-73
The "similar document retrieval system and recording medium used therefor" of Japanese Patent No. 422 publication extracts a word group and its frequency information from all document texts, updates them, and stores them in an index. It is possible to obtain and obtain useful search results for users considering the inclusion relation between documents of important concepts by referring and processing the limited information, and the characteristics of the input document in calculating the similarity between documents. We provide an efficient similar document search system that can omit redundant calculation considering search results.

【０００３】また、特開平１１−７３４２９号公報の
「文書検索装置」は、任意の検索条件文字列に対して、
検索漏れのない高速な全文検索を保証した上で、利用者
から辞書に登録されていない文字列（未知語）を含む検
索条件が入力された場合でも、この文字列を含む文書を
もれなく検索でき、且つ、正確な文字列の頻度情報を用
いて、精度を落とすことなく文書と検索条件との類似度
を計算して文書ランキングを行うことができるように構
成し、文書情報の効率的で高精度な検索に利用すること
ができる。Further, the "document retrieval device" of Japanese Patent Laid-Open No. 11-73429 discloses that for any retrieval condition character string,
Even if the user inputs a search condition that includes a character string (unknown word) that is not registered in the dictionary, it guarantees a fast full-text search with no omissions in the search. In addition, by using the accurate frequency information of the character string, it is possible to calculate the similarity between the document and the search condition and perform the document ranking without lowering the accuracy. It can be used for accurate search.

【０００４】[0004]

【発明が解決しようとする課題】自然言語テキストで表
現された検索条件や文書に対し、文書群中の各文書がそ
の検索条件を満たす度合い（以下、充足度と呼ぶ）を求
め、充足度が高い文書群、または、充足度が大きい順に
ランキングされた文書群を検索結果とする文書検索シス
テムが提案されている。このような充足度を用いた文書
検索方法としては、上述した特開平１１−７３４２２号
公報や特開平１１−７３４２９号公報にその例を見るこ
とができる。以下、これらの例を用い、検索条件を表す
要求文または文書（以下、検索条件Ｑと呼ぶ）と検索対
象の文書（以下、文書Ｄと呼ぶ）から充足度Ｒ（Ｑ，
Ｄ）を計算する従来の方法について説明する。For a search condition or a document expressed in natural language text, the degree to which each document in the document group satisfies the search condition (hereinafter referred to as "satisfaction level") is determined. A document search system has been proposed in which a search result is a document group ranked in descending order of the degree of satisfaction or a high document group. Examples of document retrieval methods using such a degree of sufficiency can be found in the above-mentioned JP-A Nos. 11-73422 and 11-73429. In the following, using these examples, the satisfaction degree R (Q, Q) from the request sentence or document representing the search condition (hereinafter referred to as search condition Q) and the document to be searched (hereinafter referred to as document D).
A conventional method for calculating D) will be described.

【０００５】方法１：検索条件Ｑから単語群Ｗを抽出
し、この単語群Ｗの要素である各単語ｗについて、その
単語の表記と同一の文字列が文書Ｄ中に出現したときの
総数である出現頻度ＳＦ（ｗ，Ｄ）をそれぞれ求め、そ
れらの総和ΣＳＦ（ｗ，Ｄ）が大きいほど充足度Ｒ
（Ｑ，Ｄ）が大きいとする。Method 1: A word group W is extracted from the search condition Q, and for each word w that is an element of this word group W, the total number of times when the same character string as the description of that word appears in the document D A certain appearance frequency SF (w, D) is obtained, and the degree of sufficiency R increases as the sum ΣSF (w, D) of them is larger.
It is assumed that (Q, D) is large.

【０００６】方法２：検索条件Ｑから単語群Ｗを抽出
し、この単語群Ｗの要素である各単語ｗが文書Ｄ中に単
語として出現したときの総数である出現頻度ＷＦ（ｗ，
Ｄ）を求め、それらの総和ΣＷＦ（ｗ，Ｄ）が大きいほ
ど充足度Ｒ（Ｑ，Ｄ）が大きいとする。Method 2: A word group W is extracted from the search condition Q, and each word w that is an element of the word group W appears as a word in the document D. The appearance frequency WF (w, w,
D) is obtained, and it is assumed that the satisfaction degree R (Q, D) is larger as the sum ΣWF (w, D) thereof is larger.

【０００７】これらの方法について、以下の例を用いて
説明する。２つの文書（Ｄ１，Ｄ２）には、それぞれＤ１：知識と演繹Ｄ２：東京都があり、２つの検索条件（Ｑ１，Ｑ２）は、それぞれＱ１：機械の知Ｑ２：京都であるとする。These methods will be described using the following examples. It is assumed that the two documents (D1, D2) have D1: knowledge and deduction D2: Tokyo, respectively, and the two search conditions (Q1, Q2) are respectively Q1: machine knowledge Q2: Kyoto.

【０００８】これらの文書（Ｄ１，Ｄ２）と検索条件
（Ｑ１，Ｑ２）からそれぞれ単語群を抽出すると、次の
ようになる（各構成単語を“／”で区切って例示す
る）。Ｄ１：知識／と／演繹Ｄ２：東京／都Ｑ１：機械／の／知Ｑ２：京都A word group is extracted from each of these documents (D1 and D2) and search conditions (Q1 and Q2) as follows (each constituent word is separated by "/" as an example). D1: Knowledge / and / deduction D2: Tokyo / Miyako Q1: Machine / No / Knowledge Q2: Kyoto

【０００９】上述の各検索条件（Ｑ１，Ｑ２）から抽出
した単語群の中の各単語ｗに関して出現頻度ＳＦ（ｗ，
Ｄ）とＷＦ（ｗ，Ｄ）を計算すると、次のようになる。With respect to each word w in the word group extracted from each of the above search conditions (Q1, Q2), the appearance frequency SF (w,
D) and WF (w, D) are calculated as follows.

【００１０】検索条件Ｑ１に対して：ＳＦ（機械，Ｄ１）＝０，ＳＦ（の，Ｄ１）＝０，ＳＦ
（知，Ｄ１）＝１ＷＦ（機械，Ｄ１）＝０，ＷＦ（の，Ｄ１）＝０，ＷＦ
（知，Ｄ１）＝０ＳＦ（機械，Ｄ２）＝０，ＳＦ（の，Ｄ２）＝０，ＳＦ
（知，Ｄ２）＝０ＷＦ（機械，Ｄ２）＝０，ＷＦ（の，Ｄ２）＝０，ＷＦ
（知，Ｄ２）＝０For search condition Q1: SF (machine, D1) = 0, SF (of, D1) = 0, SF
(Intelligence, D1) = 1 WF (machine, D1) = 0, WF (of, D1) = 0, WF
(Knowledge, D1) = 0 SF (machine, D2) = 0, SF (of, D2) = 0, SF
(Intelligence, D2) = 0 WF (machine, D2) = 0, WF (of, D2) = 0, WF
(Intelligence, D2) = 0

【００１１】検索条件Ｑ２に対して：ＳＦ（京都，Ｄ１）＝０ＷＦ（京都，Ｄ１）＝０ＳＦ（京都，Ｄ２）＝１ＷＦ（京都，Ｄ２）＝０For the search condition Q2: SF (Kyoto, D1) = 0 WF (Kyoto, D1) = 0 SF (Kyoto, D2) = 1 WF (Kyoto, D2) = 0

【００１２】ここで充足度Ｒ（Ｑ，Ｄ）を出現頻度の総
和ΣＳＦまたは出現頻度の総和ΣＷＦとし、この充足度
Ｒ（Ｑ，Ｄ）が０より大きくなる文書を検索結果とす
る。このようにした場合、上記の例では検索条件Ｑ１に
対する検索結果は、方法１：文書Ｄ１方法２：なしであり、また、検索条件Ｑ２に対する検索結果は、方法１：文書Ｄ２方法２：なしとなる。Here, the satisfaction degree R (Q, D) is defined as a sum ΣSF of appearance frequencies or a sum ΣWF of appearance frequencies, and a document whose satisfaction degree R (Q, D) is larger than 0 is set as a search result. In this case, in the above example, the search result for the search condition Q1 is: Method 1: Document D1 Method 2: None, and the search result for the search condition Q2 is: Method 1: Document D2 Method 2: None. Become.

【００１３】即ち、文書の中に単語の表記の文字列が出
現する出現頻度に基づく充足度の計算方法１では、検索
条件Ｑ１（機械の知）から文書Ｄ１（知識と演繹）を検
索でき、検索洩れの問題はない。しかし、検索条件Ｑ２
（京都）から文書Ｄ２（東京都）が検索されてしまい、
過剰検索の問題がでてくる。That is, in the calculation method 1 of the degree of sufficiency based on the frequency of appearance of the character string of the word notation in the document, the document D1 (knowledge and deduction) can be searched from the search condition Q1 (machine knowledge), There is no problem of omission of search. However, the search condition Q2
Document D2 (Tokyo) was retrieved from (Kyoto),
The problem of excessive search comes out.

【００１４】一方、文書の中に単語として出現する出現
頻度に基づく充足度の計算方法２では、検索条件Ｑ１
（機械の知）から文書Ｄ１（知識と演繹）が検索されな
いため、検索洩れの問題がある。しかし、検索条件Ｑ２
（京都）から文書Ｄ２（東京都）は検索されず、過剰検
索の問題はでてこない。On the other hand, in the sufficiency calculation method 2 based on the frequency of appearance of words as words in the document, the search condition Q1
Since the document D1 (knowledge and deduction) is not retrieved from (machine knowledge), there is a problem of omission of retrieval. However, the search condition Q2
Document D2 (Tokyo) is not searched from (Kyoto), and the problem of excessive search does not occur.

【００１５】したがって、従来の方法１または方法２で
は、検索洩れまたは過剰検索のいずれかの問題が発生し
やすく、検索条件に対し適切な検索結果を得ることがで
きない場合が多かった。Therefore, according to the conventional method 1 or 2, the problem of missing search or excessive search is likely to occur, and it is often impossible to obtain a suitable search result for the search condition.

【００１６】本発明は、上述した実情に鑑みてなされた
ものであって、従来に比べて検索洩れおよび過剰検索の
いずれの問題も発生しにくく、検索条件に対し適切な検
索結果を得ることができる文書検索装置、文書検索方
法、文書検索装置の機能を実行するためのプログラムお
よびそのプログラムを記録したコンピュータ読み取り可
能な記録媒体を提供することを目的とする。The present invention has been made in view of the above-mentioned circumstances, and it is less likely to cause problems such as omission of search and excessive search as compared with the prior art, and it is possible to obtain an appropriate search result for a search condition. An object of the present invention is to provide a document retrieval device, a document retrieval method, a program for executing the function of the document retrieval device, and a computer-readable recording medium recording the program.

【００１７】[0017]

【課題を解決するための手段】上述の課題を解決するた
めに、本発明の請求項１の文書検索装置は、検索条件等
の要求を入力するための入力部と、自然言語で表現され
た文字列からなる複数の文書を記憶する文書群記憶部
と、前記検索条件および前記文書から単語を抽出する単
語抽出部と、前記文書に対して、前記検索条件から抽出
された単語の表記と同一の文字列の出現集合を求める文
字列検索部と、前記文書から抽出された単語に対して、
前記検索条件から抽出された単語と同一の単語が出現す
る出現集合を求める単語検索部と、前記検索条件から抽
出された単語ごとに、前記文字列検索部と前記単語検索
部の各々から得た出現集合を基に出現度を求める出現度
推定部と、前記検索条件から抽出したすべての単語に対
して求められた前記出現度を基に充足度を求める充足度
算定部と、前記文書群記憶部中に記憶されたすべての文
書に対して求めた前記充足度に応じて検索結果の文書を
選択して出力する出力部とを備えたことを特徴とする。
また、本発明の請求項２は、請求項１に記載の文書検索
装置において、前記検索条件から抽出された単語ごとに
求められた、前記文字列検索部の出現集合の各要素につ
いて、その要素が前記単語検索部の出現集合に存在する
場合は１とし、さもないときには既定値として、すべて
の要素についての合計を前記出現度とすることを特徴と
する。また、本発明の請求項３は、請求項１または２に
記載の文書検索装置において、前記検索条件から抽出さ
れた単語ごとに求められた前記出現度の合計を前記充足
度とすることを特徴とする。また、本発明の請求項４の
文書検索方法は、検索条件等に存在する各単語に対し、
文書群中の各文書に文字列として出現する出現集合と単
語として出現する出現集合を求め、それらの出現集合か
ら単語に対する出現度を求め、この出現度から求めた文
書の充足度を基に検索結果とする文書を選択することを
特徴とする。また、本発明の請求項５のプログラムは、
コンピュータを、請求項１、２または３に記載の文書検
索装置として機能させるためのプログラムである。ま
た、本発明の請求項６の記録媒体は、請求項５に記載の
文書検索プログラムを記録したコンピュータ読み取り可
能な記録媒体である。In order to solve the above-mentioned problems, the document search device according to claim 1 of the present invention is expressed in natural language and an input unit for inputting a request such as a search condition. A document group storage unit that stores a plurality of documents each including a character string, a word extraction unit that extracts a word from the search condition and the document, and the same notation as the word extracted from the search condition for the document A character string search unit for obtaining the appearance set of the character string of, and for the words extracted from the document,
A word search unit for obtaining an appearance set in which the same word as the word extracted from the search condition appears, and for each word extracted from the search condition, obtained from each of the character string search unit and the word search unit. An appearance degree estimation unit that obtains the appearance degree based on the appearance set, a satisfaction degree calculation unit that obtains the satisfaction degree based on the appearance degree obtained for all the words extracted from the search condition, and the document group storage An output unit that selects and outputs a document as a search result according to the degree of sufficiency obtained for all the documents stored in the unit.
According to claim 2 of the present invention, in the document search device according to claim 1, for each element of the appearance set of the character string search unit, which is obtained for each word extracted from the search condition, the element Is present in the appearance set of the word search unit, and otherwise is set as a default value, and the sum of all elements is set as the appearance degree. Further, a third aspect of the present invention is characterized in that, in the document search apparatus according to the first or second aspect, a sum of the appearance degrees obtained for each word extracted from the search conditions is set as the satisfaction degree. And Further, according to the document search method of claim 4 of the present invention, for each word existing in the search condition,
Find the appearance set that appears as a character string and the appearance set that appears as a word in each document in the document group, find the degree of occurrence for a word from these occurrence sets, and search based on the degree of satisfaction of the document obtained from this occurrence degree It is characterized by selecting a document as a result. The program of claim 5 of the present invention is
It is a program for causing a computer to function as the document search device according to claim 1. A recording medium according to claim 6 of the present invention is a computer-readable recording medium recording the document search program according to claim 5.

【００１８】したがって、上記構成により、検索条件中
の各単語と文書群中の各文書に対し、文字列または単語
の文書内での出現頻度そのものを用いるのではなく、単
語の文書中での文字列として出現する出現集合と単語と
して出現する出現集合から、その単語のその文書におけ
る出現度を求め、その出現度から求めた充足度を基に検
索結果の文書を選択するため、従来に比べて検索洩れお
よび過剰検索のいずれの問題も発生しにくく、検索条件
に対し適切な検索結果を得ることができる。Therefore, according to the above configuration, for each word in the search condition and each document in the document group, the character string or the frequency of occurrence of the word itself in the document is not used, but the character of the word in the document is used. From the appearance set appearing as a column and the appearance set appearing as a word, the degree of occurrence of the word in the document is obtained, and the document of the search result is selected based on the degree of sufficiency obtained from the degree of occurrence. Problems such as omission of search and excessive search are unlikely to occur, and it is possible to obtain appropriate search results for the search conditions.

【００１９】[0019]

【発明の実施の形態】以下、図面を参照して本発明の一
実施例の文書検索装置について説明する。図１は、本発
明の一実施例の文書検索装置を構成する機能構成図であ
る。図１において、本実施例の文書検索装置は、制御部
１０、入力部２０、文書群記憶部３０、単語抽出部４
０、文字列検索部５０、単語検索部６０、出現度推定部
７０、充足度算定部８０、出力部９０とを含んでいる。BEST MODE FOR CARRYING OUT THE INVENTION A document retrieval apparatus according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a functional block diagram of a document search device according to an embodiment of the present invention. In FIG. 1, the document search apparatus according to the present exemplary embodiment includes a control unit 10, an input unit 20, a document group storage unit 30, and a word extraction unit 4.
0, a character string search unit 50, a word search unit 60, an appearance degree estimation unit 70, a satisfaction degree calculation unit 80, and an output unit 90.

【００２０】制御部１０は、検索条件の入力から検索結
果の出力までの全般を制御する。入力部２０は、ユーザ
から検索条件文や文書等を入力し、一時的に記憶してお
く。文書群記憶部３０は、自然言語で記述されたテキス
トからなる検索対象の文書を複数記憶している。また、
予め文書群に文字列検索や単語検索のための索引づけを
作成しておくことにより、処理を高速化することができ
る。この索引は、文字列や単語に対して、それらが文書
中に出現する位置（先頭の文字の位置）を対応付けて記
憶するものである。The control unit 10 controls the entire process from the input of search conditions to the output of search results. The input unit 20 inputs a search condition sentence, a document, or the like from the user and temporarily stores them. The document group storage unit 30 stores a plurality of search target documents each of which is a text described in natural language. Also,
The processing can be speeded up by creating indexing for character string search or word search in the document group in advance. This index stores a character string or word in association with a position where they appear in a document (position of a leading character).

【００２１】単語抽出部４０は、入力部２０で入力され
た検索条件や文書群記憶部３０中の文書から単語を抽出
する。この単語抽出には、字種の切れ目を利用して単語
に分割する方法や単語辞書を用いて形態素解析を行う方
法がある。文書群記憶部３０に索引が付加されている場
合には、文書に対する単語抽出は行わなくてもよくな
る。The word extraction unit 40 extracts a word from the search condition input by the input unit 20 or the document in the document group storage unit 30. For this word extraction, there are a method of dividing into words by using breaks of character types and a method of performing morphological analysis using a word dictionary. When an index is added to the document group storage unit 30, the words need not be extracted from the document.

【００２２】制御部１０は、文書群記憶部３０に記憶さ
れた文書について、単語抽出部４０により検索条件から
抽出された単語ごとに文字列検索部５０を呼び出して、
文字列の出現集合を求める。また、これと同時に、検索
条件から抽出された単語ごとに単語検索部６０を呼び出
して、単語の出現集合を求める。この、文字列検索部５
０と単語検索部６０の呼び出しを別々にするのではな
く、同時に行うように構成すれば高速化できる。次に、
制御部１０は、単語ごとに出現度推定部７０を呼び出し
て、出現度を求め、充足度算定部８０を呼び出して、検
索条件に対する文書の充足度を求める。この文書の充足
度を文書群記憶部３０中のすべての文書について算出す
る。引き続き、制御部１０は、出力部９０を呼び出し
て、充足度に応じて文書を並べた検索結果を出力する。The control unit 10 calls the character string search unit 50 for each word extracted from the search conditions by the word extraction unit 40 for the documents stored in the document group storage unit 30,
Find the appearance set of a character string. At the same time, the word search unit 60 is called for each word extracted from the search condition, and the appearance set of words is obtained. This character string search unit 5
If 0 and the word retrieval unit 60 are not called separately, but they are called simultaneously, the speed can be increased. next,
The control unit 10 calls the appearance degree estimation unit 70 for each word to obtain the appearance degree, and calls the satisfaction degree calculation unit 80 to obtain the degree of document satisfaction with respect to the search condition. The sufficiency of this document is calculated for all the documents in the document group storage unit 30. Subsequently, the control unit 10 calls the output unit 90 and outputs the search result in which the documents are arranged according to the degree of sufficiency.

【００２３】文字列検索部５０は、検索条件から抽出し
た１つの単語に対し、その単語の文字列と同じ文字列が
文書に出現するかを調べる。文書に出現した場合、その
文字列が出現した開始位置（文書の先頭からのオフセッ
トであり、１を起算とする）を要素として出現集合を求
める。出現しなかった場合には、出現集合は空集合とな
る。文書群記憶部３０に文字列検索のための索引づけを
記憶させておけば、この処理を高速化できる。The character string search unit 50 checks, for one word extracted from the search condition, whether the same character string as the character string of the word appears in the document. When the character string appears in the document, the appearance set is obtained with the start position (offset from the head of the document, starting from 1) where the character string appears as an element. When it does not appear, the appearance set is an empty set. If the document group storage unit 30 stores indexing for character string search, this processing can be speeded up.

【００２４】単語検索部６０は、検索条件から抽出した
１つの単語に対し、その単語が文書中に単語として出現
するかどうかを調べる。このとき、文書は単語抽出部４
０により、字種や形態素解析を利用して単語に分割され
る。文書に出現した場合、その単語の文字列が出現した
開始位置（文書の先頭からのオフセットであり、１を起
算とする）を要素として出現集合を求める。出現しなか
った場合には、出現集合は空集合となる。文書群記憶部
３０に単語検索のための索引づけを記憶させておけば、
この処理を高速化できる。The word search unit 60 checks whether or not one word extracted from the search condition appears as a word in the document. At this time, the document is the word extraction unit 4
With 0, it is divided into words using the character type and morphological analysis. When it appears in the document, the appearance set is obtained with the start position (offset from the beginning of the document, starting from 1) where the character string of the word appears as an element. When it does not appear, the appearance set is an empty set. If the document group storage unit 30 stores indexing for word search,
This processing can be speeded up.

【００２５】出現度推定部７０は、検索条件から抽出し
た単語ごとに求められた文字列の出現集合と単語の出現
集合から１つの単語に対する出現度を次のようにして算
出する。１つの単語ｗに対する文書Ｄにおける文字列の
出現集合ＳをＳ＝｛ｓ（１），ｓ（２），．．，ｓ（ｎ）｝とする。ここでｓ（ｉ）は、単語ｗが文書Ｄに文字列と
してｉ番目に出現したときの文字列開始位置である。ま
た、１つの単語ｗに対する文書Ｄにおける単語の出現集
合ＷをＷ＝｛ｗ（１），ｗ（２），．．，ｗ（ｍ）｝とする。ここでｗ（ｉ）は、単語ｗが文書Ｄに単語とし
てｉ番目に出現したときの単語開始位置である。また、
一般にｍ＜ｎである。単語ｗの文書Ｄにおける出現
度Ｅ（ｗ，Ｄ）を次のように定義する。The appearance degree estimation unit 70 calculates the appearance degree for one word from the appearance set of character strings and the appearance set of words obtained for each word extracted from the search condition as follows. The appearance set S of the character strings in the document D for one word w is S = {s (1), s (2) ,. ． , S (n)}. Here, s (i) is a character string start position when the word w appears in the document D as the i-th character string. Also, a set W of appearances of words in the document D for one word w is W = {w (1), w (2) ,. ． , W (m)}. Here, w (i) is the word start position when the word w appears as the i-th word in the document D. Also,
Generally, m <n. The degree of appearance E (w, D) of the word w in the document D is defined as follows.

【００２６】文字列の出現集合Ｓが空集合のときは、Ｅ
（ｗ，Ｄ）＝０とする。出現集合Ｓが空集合でなけれ
ば、各ｓ（ｉ），（ｉ＝１，．．，ｎ）に対して、次に
示す単位出現度ｕ（ｓ（ｉ））を求める。ｓ（ｉ）が単
語の出現集合Ｗの要素である場合、ｕ（ｓ（ｉ））＝１、さもなければ、ｕ（ｓ（ｉ））＝ｅ，ただし０＜ｅ＜１（ここでｅは、既定値（例えば、０.１等）とする。）
とし、単語ｗの文書Ｄにおける出現度Ｅ（ｗ，Ｄ）を単
位出現度ｕ（ｓ（ｉ））の総和として、次のように定め
る。Ｅ（ｗ，Ｄ）＝Σｕ（ｓ（ｉ））When the appearance set S of the character string is an empty set, E
(W, D) = 0. If the appearance set S is not an empty set, the following unit appearance u (s (i)) is obtained for each s (i), (i = 1, ..., N). If s (i) is an element of the word occurrence set W, then u (s (i)) = 1, otherwise u (s (i)) = e, where 0 <e <1 (where e Is a default value (for example, 0.1).)
Then, the appearance degree E (w, D) of the word w in the document D is defined as the sum of the unit appearance degrees u (s (i)) as follows. E (w, D) = Σu (s (i))

【００２７】充足度算定部８０は、検索条件から抽出さ
れたすべての単語の出現度から文書の充足度を算出す
る。例えば、検索条件Ｑから抽出したすべての単語ｗに
対して計算された出現度Ｅ（ｗ，Ｄ）の総和として文書
Ｄの充足度Ｒ（Ｑ，Ｄ）を次の式で求める。Ｒ（Ｑ，Ｄ）＝ΣＥ（ｗ，Ｄ）The sufficiency calculating unit 80 calculates the sufficiency of the document from the appearances of all the words extracted from the search conditions. For example, the sufficiency R (Q, D) of the document D is calculated as the sum of the appearances E (w, D) calculated for all the words w extracted from the search condition Q by the following formula. R (Q, D) = ΣE (w, D)

【００２８】出力部９０は、各文書に対して算出された
充足度に応じて、検索結果とする文書を選択し、表示装
置やプリンタ等に出力する。この選択は、充足度が非零
の文書の集合、また充足度の大きい順に並べた文書の集
合とする。表示する際、検索された文書集合のうち一定
の個数で打ち切ったり、充足度が一定の値以上のものを
選択するようにしてもよい。The output unit 90 selects a document as a search result according to the degree of sufficiency calculated for each document and outputs it to a display device, a printer, or the like. This selection is a set of documents with non-zero sufficiency or a set of documents arranged in descending order of sufficiency. At the time of display, a fixed number of documents may be discontinued from the retrieved document set, or a document whose satisfaction level is a certain value or more may be selected.

【００２９】図２は、本実施例の文書検索処理の手順を
説明するフローチャートである。ここで説明のために、
文書群記憶部３０には前出の文書Ｄ１および文書Ｄ２Ｄ１：知識と演繹Ｄ２：東京都が記憶されているものとする。FIG. 2 is a flow chart for explaining the procedure of the document search process of this embodiment. For explanation here,
It is assumed that the document group storage unit 30 stores the document D1 and the document D2 D1: knowledge and deduction D2: Tokyo.

【００３０】先ず、制御部１０は、入力部２０を用いて
検索者から検索条件を入力する（ステップＳ０１）。こ
の検索条件を一時的に検索条件バッファ等に記憶してお
く。以下、説明のために、検索条件Ｑを“機械の知"と
する。First, the control unit 10 inputs a search condition from a searcher using the input unit 20 (step S01). This search condition is temporarily stored in a search condition buffer or the like. Hereinafter, for the sake of explanation, the search condition Q is “knowledge of machine”.

【００３１】次に制御部１０は、単語抽出部４０を用い
て検索条件Ｑから単語を抽出する（ステップＳ０２）。
ここでは字種の切れ目を利用して検索条件を単語に分割
し、単語を抽出するようにする。検索条件Ｑは、機械／の／知のように単語に分割され、３つの単語｛機械，の，知｝
が得られる。Next, the control unit 10 uses the word extraction unit 40 to extract a word from the search condition Q (step S02).
Here, the search condition is divided into words by using the breaks of the character types, and the words are extracted. The search condition Q is divided into words such as machine / no / knowledge and three words {machine, no, knowledge}
Is obtained.

【００３２】次に制御部１０は、文書群記憶部３０に記
憶された文書Ｄごとに、各単語ｗに対し下記の処理を行
って、単語ｗの出現度を求める。Next, the control unit 10 performs the following process for each word w for each document D stored in the document group storage unit 30 to obtain the appearance degree of the word w.

【００３３】文書記憶部３０から１つの文書を取り出す
（ステップＳ０３）。この文書をＤと呼ぶ。先に、単語
抽出部４０で検索条件から抽出した単語の１つを取り出
す（ステップＳ０４）。この単語をｗと呼ぶ。One document is retrieved from the document storage unit 30 (step S03). This document is called D. First, one of the words extracted from the search condition by the word extracting unit 40 is extracted (step S04). This word is called w.

【００３４】この単語ｗについて文書Ｄに文字列として
出現する出現集合を文字列検索部５０により求める（ス
テップＳ０５）。例えば、文書Ｄの先頭から末尾にかけ
て、１文字づつ単語ｗの文字列と照合し、両文字列が一
致した場合に、その文字列開始位置（先頭からのオフセ
ット、１起算とする）を出現集合の要素として登録す
る。例えば、単語ｗ＝“知”、文書Ｄ１＝“知識と演
繹”からは｛１｝が出現集合として得られる。単語ｗ＝
“知”、文書Ｄ２＝“東京都”からは空集合｛｝が得ら
れる。The character string retrieving unit 50 obtains an appearance set of the word w that appears as a character string in the document D (step S05). For example, from the beginning to the end of the document D, the character string is collated one by one with the character string of the word w, and when both character strings match, the character string start position (offset from the beginning, 1 counting) appears in the set. Register as an element of. For example, {1} is obtained as an appearance set from the word w = “knowledge” and the document D1 = “knowledge and deduction”. Word w =
An empty set {} is obtained from "knowledge" and document D2 = "Tokyo".

【００３５】次に、この単語ｗについて文書Ｄに単語と
して出現する出現集合を単語検索部６０により求める
（ステップＳ０６）。例えば、単語抽出部４０を使って
文書Ｄを字種や形態素解析により単語に分割する。これ
により文書Ｄは、知識／と／演繹のように単語に分割される。この分割された文書Ｄの先
頭から末尾にかけて、１文字づつ単語ｗと照合し、両文
字列が一致し、かつ、単語の境界がその前後にある場合
に、その文字列開始位置（先頭からのオフセット、１起
算とする）を出現集合へ登録する。例えば、単語ｗ＝
“知”、文書Ｄ１＝“知識と演繹”からは空集合｛｝が
出現集合として得られる。また、同様にして、単語ｗ＝
“知”、文書Ｄ２＝“東京都”からも空集合｛｝が得ら
れる。Next, the word search unit 60 finds the appearance set of the word w that appears as words in the document D (step S06). For example, the word extraction unit 40 is used to divide the document D into words by character type and morphological analysis. This divides the document D into words like knowledge / and / deduction. From the beginning to the end of the divided document D, one character is compared with the word w, and when both character strings match and the word boundaries are before and after that, the character string start position (from the beginning The offset and 1 count are registered in the appearance set. For example, the word w =
From "knowledge" and document D1 = "knowledge and deduction", an empty set {} is obtained as an appearance set. Similarly, the word w =
The empty set {} is also obtained from "knowledge" and document D2 = "Tokyo".

【００３６】次に、出現度推定部７０を用いて単語ｗの
文書Ｄにおける出現度を求める（ステップＳ０７）。上
述した文字列検索部５０からは、単語ｗに対する文書Ｄ
における文字列の出現集合Ｓが得られている。これを次
のように表す。Ｓ＝｛ｓ（１），ｓ（２），．．，ｓ（ｎ）｝ここでｓ（ｉ）は、単語ｗが文書Ｄに文字列としてｉ番
目に出現したときの文字列開始位置である。また、上述
した単語検索部６０からは、単語ｗに対する文書Ｄにお
ける単語の出現集合Ｗが得られている。これを次のよう
に表す。Ｗ＝｛ｗ（１），ｗ（２），．．，ｗ（ｍ）｝ここでｗ（ｉ）は、単語ｗが文書Ｄに単語としてｉ番目
に出現したときの単語開始位置である。これらの出現集
合から単語ｗの文書Ｄにおける出現度Ｅ（ｗ，Ｄ）は次
に示す単位出現度ｕ（ｓ（ｉ））の総和として計算す
る。Next, the appearance degree estimation unit 70 is used to obtain the appearance degree of the word w in the document D (step S07). From the character string search unit 50 described above, the document D for the word w
The appearance set S of the character string in is obtained. This is expressed as follows. S = {s (1), s (2) ,. ． , S (n)} where s (i) is the character string start position when the word w appears in the document D as the i-th character string. Further, the word search unit 60 described above obtains the word appearance set W in the document D for the word w. This is expressed as follows. W = {w (1), w (2) ,. ． , W (m)} where w (i) is the word start position when the word w appears i-th as a word in the document D. From these appearance sets, the appearance degree E (w, D) of the word w in the document D is calculated as the sum of the unit appearance degrees u (s (i)) shown below.

【００３７】文字列の出現集合Ｓが空集合のときは、Ｅ
（ｗ，Ｄ）＝０とする。出現集合Ｓが空集合でなけれ
ば、各ｓ（ｉ），（ｉ＝１，．．，ｎ）に対して、次に
示す単位出現度ｕ（ｓ（ｉ））を求める。ｓ（ｉ）が単
語の出現集合Ｗの要素である場合、ｕ（ｓ（ｉ））＝１、さもなければ、ｕ（ｓ（ｉ））＝ｅ，ただし０＜ｅ＜１（ここでｅは、既定値（例えば、０.１等）とする。）When the appearance set S of the character string is an empty set, E
(W, D) = 0. If the appearance set S is not an empty set, the following unit appearance u (s (i)) is obtained for each s (i), (i = 1, ..., N). If s (i) is an element of the word occurrence set W, then u (s (i)) = 1, otherwise u (s (i)) = e, where 0 <e <1 (where e Is a default value (for example, 0.1).)

【００３８】以上から、出現度Ｅ（知，Ｄ１）＝０.１
が得られる。ここで求めた単語ｗに対する文書Ｄの出現
度Ｅ（ｗ，Ｄ）を一時的に記憶しておく。From the above, the degree of appearance E (intelligence, D1) = 0.1
Is obtained. The appearance degree E (w, D) of the document D with respect to the word w obtained here is temporarily stored.

【００３９】以上のステップＳ０４からＳ０７までを検
索条件Ｑから抽出したすべての単語について実行する
（ステップＳ０８）。これにより、文書Ｄ１に対する出
現度はＥ（機械，Ｄ１）＝０，Ｅ（の，Ｄ１）＝０，Ｅ（知，
Ｄ１）＝０.１The above steps S04 to S07 are executed for all the words extracted from the search condition Q (step S08). As a result, the appearance degree for the document D1 is E (machine, D1) = 0, E (of, D1) = 0, E (know,
D1) = 0.1

【００４０】検索条件Ｑから抽出したすべての単語につ
いて出現度が計算されると、それらから検索条件Ｑに対
する文書Ｄの充足度Ｒ（Ｑ，Ｄ）を求める（ステップＳ
０９）。例えば、充足度Ｒ（Ｑ，Ｄ）を検索条件Ｑから
抽出したすべての単語に対する出現度Ｅ（ｗ，Ｄ）の総
和とするとＲ（Ｑ，Ｄ１）＝Ｅ（機械，Ｄ１）＋Ｅ（の，Ｄ１）＋Ｅ（知，Ｄ１）＝０＋０＋０.１＝０.１となる。ここで求めた検索条件Ｑに対する文書Ｄの充足
度Ｒ（Ｑ，Ｄ）を文書Ｄと対応付けて一時的に記憶して
おく（ステップＳ０７）。以上のステップＳ０３からＳ
０９までを文書群記憶部３０中のすべての文書について
実行する（ステップＳ１０）。When the appearance degrees are calculated for all the words extracted from the search condition Q, the satisfaction degree R (Q, D) of the document D with respect to the search condition Q is obtained from them (step S
09). For example, if the satisfaction degree R (Q, D) is the sum of appearance degrees E (w, D) for all the words extracted from the search condition Q, R (Q, D1) = E (machine, D1) + E (of, D1) + E (intelligence, D1) = 0 + 0 + 0.1 = 0.1. The satisfaction degree R (Q, D) of the document D with respect to the search condition Q obtained here is temporarily stored in association with the document D (step S07). Steps S03 to S above
Up to 09 are executed for all the documents in the document group storage unit 30 (step S10).

【００４１】最後に、制御部１０は、出力部９０を呼び
出して、各文書の充足度に応じて、検索結果とする文書
を選択して表示装置やプリンタへ出力する（ステップＳ
１１）。例えば、充足度が非零の文書の集合を検索結果
とすると、｛Ｄ１｝が検索結果となる。また、充足度の
大きい順に文書（またはその識別子）を並べて検索結果
としてもよい。この場合は、｛Ｄ１，Ｄ２｝が検索結果
となる。検索結果の個数が多いときには、充足度の大き
いものから一定の個数だけを出力したり、一定値以上の
充足度を持つ検索結果を出力するようにしてもよい。Finally, the control unit 10 calls the output unit 90, selects a document as a search result according to the degree of sufficiency of each document, and outputs it to a display device or a printer (step S).
11). For example, if a search result is a set of documents with a non-zero degree of sufficiency, {D1} will be the search result. In addition, documents (or their identifiers) may be arranged in descending order of satisfaction level and used as search results. In this case, {D1, D2} is the search result. When the number of search results is large, only a certain number may be output from the one having a high degree of sufficiency, or a search result having a degree of sufficiency of a certain value or more may be output.

【００４２】以上のように、実施例を構成すると、文書
群Ｄ１および文書Ｄ２に対してＤ１：知識と演繹Ｄ２：東京都検索条件Ｑ１と検索条件Ｑ２Ｑ１：機械の知Ｑ２：京都で検索した結果、Ｒ（Ｑ１，Ｄ１）は非零の値となり検
索結果となる。また、Ｒ（Ｑ２，Ｄ２）も非零となる
が、文書群中に"京都"を単語として含む文書が他にあれ
ば、それらの文書の充足度より低くなる。よって、充足
度の大きい順に一定数の文書を選択して検索結果とする
ならば、文書Ｄ２が検索結果となることは従来に比べて
起こりにくくなる。As described above, when the embodiment is configured, the document group D1 and the document D2 are searched for in D1: knowledge and deduction D2: Tokyo search condition Q1 and search condition Q2 Q1: machine knowledge Q2: Kyoto. As a result, R (Q1, D1) becomes a non-zero value and becomes the search result. Further, R (Q2, D2) is also non-zero, but if there are other documents that include "Kyoto" as a word in the document group, the degree of sufficiency is lower than those documents. Therefore, if a certain number of documents are selected in descending order of sufficiency and used as the search result, the document D2 is less likely to be the search result than in the conventional case.

【００４３】さらに、本発明は上述した実施の形態のみ
に限定されたものではない。上述した実施の形態を構成
する各機能をそれぞれプログラム化し、あらかじめＣＤ
−ＲＯＭ等の記録媒体に書き込んでおき、このＣＤ−Ｒ
ＯＭをＣＤ−ＲＯＭドライブのような媒体駆動装置を搭
載したコンピュータに装着して、これらのプログラムを
コンピュータのメモリあるいは記憶装置に格納し、それ
を実行することによっても、本発明の目的が達成される
ことは言うまでもない。この場合、記録媒体から読出さ
れたプログラム自体が上述した実施の形態の機能を実現
することになり、そのプログラムおよびそのプログラム
を記録した記録媒体も本発明を構成することになる。Furthermore, the present invention is not limited to the above-mentioned embodiments. Each function that constitutes the above-described embodiment is programmed, and a CD is prepared in advance.
-Write to a recording medium such as a ROM and write this CD-R
The object of the present invention can also be achieved by mounting the OM on a computer equipped with a medium driving device such as a CD-ROM drive, storing these programs in a memory or a storage device of the computer, and executing the programs. Needless to say. In this case, the program itself read from the recording medium realizes the functions of the above-described embodiments, and the program and the recording medium recording the program also constitute the present invention.

【００４４】なお、記録媒体としては半導体媒体（例え
ば、ＲＯＭ、不揮発性メモリカード等）、光媒体（例え
ば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体（例
えば、磁気テープ、フレキシブルディスク等）のいずれ
であってもよい。The recording medium is a semiconductor medium (eg, ROM, non-volatile memory card, etc.), optical medium (eg, DVD, MO, MD, CD-R, etc.), magnetic medium (eg, magnetic tape, flexible disk). Etc.).

【００４５】また、ロードしたプログラムを実行するこ
とにより上述した実施の形態の機能が実現されるだけで
なく、そのプログラムの指示に基づき、オペレーティン
グシステム等が実際の処理の一部または全部を行い、そ
の処理によって上述した実施の形態の機能が実現される
場合も含まれる。Further, not only the functions of the above-described embodiment are realized by executing the loaded program, but the operating system or the like performs some or all of the actual processing based on the instructions of the program, The case where the functions of the above-described embodiments are realized by the processing is also included.

【００４６】さらに、上述したプログラムをサーバーコ
ンピュータの磁気ディスク等の記憶装置に格納してお
き、通信網で接続されたユーザのコンピュータからダウ
ンロード等の形式で頒布する場合、このサーバーコンピ
ュータの記憶装置も本発明の記録媒体に含まれる。Further, when the above-mentioned program is stored in a storage device such as a magnetic disk of a server computer and is distributed in a format such as download from a user's computer connected by a communication network, the storage device of this server computer is also stored. It is included in the recording medium of the present invention.

【００４７】[0047]

【発明の効果】以上説明したように本発明によれば、検
索条件中の各単語と文書群中の各文書に対し、文字列ま
たは単語の文書内での出現頻度そのものを用いるのでは
なく、単語の文書中での文字列として出現する出現集合
と単語として出現する出現集合から、その単語のその文
書における出現度を求め、その出現度から求めた充足度
を基に検索結果の文書を選択するため、従来に比べて検
索洩れおよび過剰検索のいずれの問題も発生しにくく、
検索条件に対し適切な検索結果を得ることができる。As described above, according to the present invention, for each word in the search condition and each document in the document group, the appearance frequency itself of the character string or the word in the document is not used, but From the appearance set that appears as a character string in a document and the appearance set that appears as a word, determine the degree of occurrence of that word in that document, and select the document that is the search result based on the degree of sufficiency obtained from that degree of occurrence. As a result, both problems of omission of search and excessive search are less likely to occur,
An appropriate search result can be obtained for the search condition.

[Brief description of drawings]

【図１】本発明の一実施例の文書検索装置を構成する
機能構成図である。FIG. 1 is a functional configuration diagram of a document search device according to an embodiment of the present invention.

【図２】実施例の文書検索の処理手順を説明するフロ
ーチャートである。FIG. 2 is a flowchart illustrating a document search processing procedure according to the embodiment.

[Explanation of symbols]

１０…制御部、２０…入力部、３０…文書群記憶部、４
０…単語抽出部、５０…文字列検索部、６０…単語検索
部、７０…出現度推定部、８０…充足度算定部、９０…
出力部。10 ... Control unit, 20 ... Input unit, 30 ... Document group storage unit, 4
0 ... Word extraction unit, 50 ... Character string search unit, 60 ... Word search unit, 70 ... Appearance estimation unit, 80 ... Satisfaction degree calculation unit, 90 ...
Output section.

Claims

[Claims]

1. An input unit for inputting a request for search conditions and the like, a document group storage unit for storing a plurality of documents composed of character strings expressed in natural language, a word from the search conditions and the document. With respect to the words extracted from the document, a word extraction unit to extract, a character string search unit to obtain an appearance set of the same character string as the notation of the word extracted from the search condition for the document, A word search unit for obtaining an appearance set in which the same word as the word extracted from the search condition appears, and for each word extracted from the search condition, obtained from each of the character string search unit and the word search unit. An appearance degree estimation unit that obtains the appearance degree based on the appearance set, a satisfaction degree calculation unit that obtains the satisfaction degree based on the appearance degree obtained for all the words extracted from the search condition, and the document group storage Everything remembered in the club And a output unit that selects and outputs a document as a search result according to the satisfaction level obtained for the document.

2. The document search device according to claim 1, wherein for each element of the appearance set of the character string search unit, which is obtained for each word extracted from the search conditions, the element is the word search unit. 1 if present in the appearance set
If not, the document retrieval apparatus is characterized in that, as a default value, the sum of all elements is the appearance degree.

3. The document search device according to claim 1, wherein the sum of the appearance degrees obtained for each word extracted from the search conditions is set as the satisfaction degree. .

4. For each word existing in a search condition or the like, an appearance set that appears as a character string and an appearance set that appears as a word in each document in the document group are obtained, and the appearance degree for the word is calculated from these appearance sets. A document search method characterized by selecting a document as a search result based on the degree of sufficiency of the document obtained from the appearance degree.

5. A computer as claimed in claim 1, 2 or 3
A program for functioning as the document search device described in 1.

6. A computer-readable recording medium in which the document search program according to claim 5 is recorded.