JP2002024280A

JP2002024280A - Device and method for document retrieval

Info

Publication number: JP2002024280A
Application number: JP2000202561A
Authority: JP
Inventors: Yosuke Kunishi; 洋介国司
Original assignee: Shin Etsu Polymer Co Ltd; Shin Etsu Chemical Co Ltd
Current assignee: Shin Etsu Polymer Co Ltd; Shin Etsu Chemical Co Ltd
Priority date: 2000-07-04
Filing date: 2000-07-04
Publication date: 2002-01-25
Anticipated expiration: 2020-07-04
Also published as: JP3441703B2

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieving device which retrieves a document having contents close to a reference document from plural object documents to be retrieved. SOLUTION: This document retrieving device 10 is equipped with an evaluated value DB 43 stored with keywords included in the reference document and evaluated values showing how often the keywords are included uniquely in the reference document and obtains a total value by totalizing the evaluated values of the keywords included in the object documents to be retrieved according to the evaluated value DB 43 and divides the total value by the number of the keywords included in the documents to be retrieved to find the document evaluated values of the documents to be retrieved. The document evaluated values are compared with a previously set reference value and an object document having a document evaluated value larger than the reference value is extracted as a document having contents close to the reference document.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、検索対象となる複
数の検索対象文書から、基準となる基準文書に近い内容
の文書を検索する文書検索装置及び文書検索方法に関す
る。[0001] 1. Field of the Invention [0002] The present invention relates to a document search apparatus and a document search method for searching a plurality of search target documents for documents having contents close to a reference document serving as a reference.

【０００２】[0002]

【従来の技術】従来から、複数の検索対象文書から所望
の文書を検索する方法としてキーワード検索があった。
キーワード検索は、入力されたキーワードを含む文書を
複数の検索対象文書から検索する方法である。この方法
によれば、所望の文書に含まれていると予測されるキー
ワードをユーザが入力することにより、複数の検索対象
文書から所望の文書を検索することができる。2. Description of the Related Art Conventionally, there has been a keyword search as a method of searching for a desired document from a plurality of search target documents.
The keyword search is a method of searching for a document including an input keyword from a plurality of search target documents. According to this method, a user can search for a desired document from a plurality of search target documents by inputting a keyword predicted to be included in the desired document.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、キーワ
ード検索には以下のような問題点があった。すなわち、
検索対象となる文書中で表現が統一されていない場合に
は、キーワードによる検索は表記の揺れの問題によって
検索漏れが生じる場合があった。また、キーワードはユ
ーザが選択するものであるが、所望の文書に含まれてい
るキーワードを予測することは困難であり、キーワード
検索によって所望の文書を検索できない場合もあった。However, the keyword search has the following problems. That is,
When expressions are not unified in a document to be searched, a search by a keyword may cause a search omission due to a problem of spelling of a notation. In addition, although a keyword is selected by a user, it is difficult to predict a keyword included in a desired document, and in some cases, a desired document cannot be searched by a keyword search.

【０００４】そこで、本発明は上記課題を解決し、検索
対象となる複数の検索対象文書から、基準となる基準文
書に近い内容の文書を検索する文書検索装置及び文書検
索方法を提供することを目的とする。Accordingly, the present invention has been made to solve the above-mentioned problems, and to provide a document retrieval apparatus and a document retrieval method for retrieving a document having a content close to a reference document as a reference from a plurality of retrieval target documents. Aim.

【０００５】[0005]

【課題を解決するための手段】本発明に係る文書検索装
置は、検索対象となる複数の検索対象文書から、基準と
なる基準文書に近い内容の文書を検索する文書検索装置
であって、基準文書に含まれる複数のキーワードと、そ
れぞれのキーワードが基準文書に固有に含まれる程度を
示す評価値とを格納した評価値格納手段と、複数の検索
対象文書に含まれるすべてのワードを検索ワードとして
抽出する検索ワード抽出手段と、検索ワード抽出手段に
よって抽出された検索ワードを抽出元の検索対象文書を
特定する検索対象文書コードに関連付けて格納する検索
ワード格納手段と、検索ワード格納手段に格納された検
索ワードと評価値格納手段に格納されたキーワードとを
照合し、検索ワードとキーワードとが一致する場合に検
索ワードにキーワードの評価値を付与し、検索ワードに
付与された評価値を検索対象文書コードに基づいて集計
して検索対象文書毎の集計値を算出する評価値集計手段
と、評価値集計手段によって算出されたそれぞれの検索
対象文書における評価値の集計値を、検索対象文書に含
まれるキーワードの数で除してそれぞれの検索対象文書
の文書評価値を算出する文書評価値算出手段と、文書評
価値算出手段によって算出された文書評価値とあらかじ
め設定された基準値とを比較し、基準値より大きい文書
評価値を有する検索対象文書を抽出する検索文書抽出手
段とを備えることを特徴とする。SUMMARY OF THE INVENTION A document search apparatus according to the present invention is a document search apparatus for searching a plurality of search target documents to be searched for documents having contents close to a reference document as a reference. An evaluation value storage unit that stores a plurality of keywords included in the document and an evaluation value indicating the degree to which each keyword is uniquely included in the reference document; and all words included in the plurality of search target documents as search words. Search word extraction means to be extracted, search word storage means for storing the search word extracted by the search word extraction means in association with a search target document code for specifying the search target document to be extracted, and search word storage means The search word is compared with the keyword stored in the evaluation value storage means. If the search word matches the keyword, the keyword is added to the search word. Evaluation value totaling means for assigning an evaluation value of the search word, and totalizing the evaluation value given to the search word based on the search target document code to calculate a total value for each search target document; and Document evaluation value calculating means for calculating the document evaluation value of each search target document by dividing the total evaluation value of each search target document by the number of keywords included in the search target document; A search document extracting unit that compares a document evaluation value calculated by the unit with a preset reference value and extracts a search target document having a document evaluation value larger than the reference value.

【０００６】このように検索ワード抽出手段によって、
検索対象文書からすべての検索ワードを抽出し、抽出さ
れた検索ワードと基準文書に含まれているキーワードと
を照合することにより、基準文書に含まれるキーワード
と一致する検索ワードを検索対象文書からすべて抽出す
ることができる。そして、評価値格納手段に格納された
評価値に基づいて、各検索対象文書に含まれるキーワー
ドの評価値を集計し、集計値を各検索対象文書に含まれ
るキーワードの数で除することで、それぞれの検索対象
文書の文書評価値を算出する。そして、文書評価値とあ
らかじめ設定された基準値とを比較して、基準文書固有
のキーワードが含まれる検索対象文書を抽出することが
できる。これにより、基準文書に近い内容の検索対象文
書を検索することが可能である。As described above, the search word extracting means
By extracting all search words from the search target document and comparing the extracted search words with the keywords included in the reference document, all search words that match the keywords included in the reference document are searched from the search target documents. Can be extracted. Then, based on the evaluation values stored in the evaluation value storage unit, the evaluation values of the keywords included in each search target document are totaled, and the total value is divided by the number of keywords included in each search target document, The document evaluation value of each search target document is calculated. Then, by comparing the document evaluation value with a preset reference value, a search target document including a keyword unique to the reference document can be extracted. As a result, it is possible to search for a search target document having contents similar to the reference document.

【０００７】また、上記文書検索装置において、検索ワ
ード抽出手段は、一の検索対象文書から同一の検索ワー
ドを重複して抽出しないことが好ましい。In the above-described document search apparatus, it is preferable that the search word extracting means does not redundantly extract the same search word from one search target document.

【０００８】また、上記文書検索装置において、評価値
格納手段には、同一のキーワードが重複して格納されて
いないことを特徴としても良い。このように評価値格納
手段に格納されるキーワードの重複を回避することによ
り、検索ワード格納手段に格納された検索ワードとの照
合に要する時間を最適化することができる。[0008] In the above-mentioned document retrieval apparatus, the same keyword may not be stored in the evaluation value storage means in duplicate. By avoiding duplication of keywords stored in the evaluation value storage means in this way, it is possible to optimize the time required for comparison with the search word stored in the search word storage means.

【０００９】また、上記文書検索装置において、キーワ
ードは基準文書に含まれるひらがな、句読点、特殊記号
及びスペースを区切記号として抽出され、また、検索ワ
ードは検索対象文書に含まれるひらがな、句読点、特殊
記号及びスペースを区切記号として抽出されることを特
徴としても良い。In the above document search apparatus, the keywords are extracted as hiragana, punctuation marks, special symbols and spaces included in the reference document as delimiters, and the search words are extracted as hiragana, punctuation marks and special symbols included in the search target document. And spaces may be extracted as delimiters.

【００１０】ひらがな、句読点、特殊記号及びスペース
は、文書の内容に依らずに含まれているので、これらを
キーワードや検索ワードとして抽出しないようにするこ
とにより、文書検索の精度を高めることができる。Hiragana, punctuation, special symbols, and spaces are included irrespective of the contents of the document. Therefore, by not extracting them as keywords or search words, the accuracy of document search can be improved. .

【００１１】また、上記文書検索装置は、少なくとも一
の基準文書からすべてのワードをキーワードとして抽出
するキーワード抽出手段と、キーワード抽出手段によっ
て基準文書毎に抽出されたキーワードを格納するキーワ
ード格納手段と、キーワードの評価値を決定するための
参照対象となる複数の参照文書からすべてのワードを参
照ワードとして抽出する参照ワード抽出手段と、キーワ
ード抽出手段によって基準文書毎に抽出されたキーワー
ド及び参照ワード抽出手段によって参照文書毎に抽出さ
れた参照ワードを格納する全ワード格納手段と、それぞ
れのキーワードについて、キーワード格納手段に含まれ
るキーワードの数を基準文書の数で除して、基準文書内
キーワード出現率を算出する基準文書内キーワード出現
率算出手段と、それぞれのキーワードについて、全ワー
ド格納手段に含まれるキーワードの数及びキーワードと
同一の参照ワードの数の合計を基準文書及び参照文書の
数の合計で除して全文書内キーワード出現率を算出する
全文書内キーワード出現率算出手段と、基準文書内キー
ワード出現率を全文書内キーワード出現率で除して、キ
ーワードの評価値を算出する評価値算出手段とをさらに
備え、評価値算出手段によって算出された評価値を評価
値格納手段に格納することを特徴としても良い。[0011] Further, the document search device includes a keyword extracting means for extracting all words as keywords from at least one reference document, a keyword storage means for storing the keywords extracted for each reference document by the keyword extracting means, Reference word extraction means for extracting all words as reference words from a plurality of reference documents to be referenced for determining the evaluation value of a keyword, and keyword and reference word extraction means extracted for each reference document by the keyword extraction means All the word storage means for storing the reference words extracted for each reference document, and for each keyword, the number of keywords included in the keyword storage means is divided by the number of reference documents to obtain the keyword appearance rate in the reference document. Means for calculating the keyword appearance rate in the reference document to be calculated; For each keyword, the keyword appearance rate in all documents is calculated by dividing the total number of keywords included in the all word storage means and the number of reference words identical to the keywords by the total number of reference documents and reference documents. A keyword appearance rate calculating means for all documents; and an evaluation value calculating means for calculating an evaluation value of the keyword by dividing the keyword appearance rate in the reference document by the keyword occurrence rate in all documents. The obtained evaluation value may be stored in the evaluation value storage means.

【００１２】このように、基準文書でのキーワードの出
現率を、基準文書と参照文書とを合わせた全文書でのキ
ーワードの出現率で除することによって得られる評価値
は、キーワードが基準文書に固有に含まれる程度を示す
値となる。すなわち、基準文書内でのキーワードの出現
率は、各キーワードが基準文書にどの程度の頻度で出現
するかを示し、全文書内でのキーワードの出現率は、各
キーワードが全文書内でどの程度の頻度で出現するかを
示す。これにより、基準文書に頻繁に出現し、かつ全文
書に多く見られないキーワードは、基準文書に固有のも
のであると判断できる。逆に、基準文書に頻繁に出現す
るが、全文書にも頻繁に出現するキーワードは、基準文
書に固有なのではなく一般的な文書に頻繁に出現するワ
ードであると判断できる。従って、（基準文書内キーワ
ード出現率）／（全文書内キーワード出現率）によっ
て、キーワードが基準文書に固有に含まれる程度を数値
化して評価値とすることができる。As described above, the evaluation value obtained by dividing the keyword appearance rate in the reference document by the keyword appearance rate in all documents including the reference document and the reference document indicates that the keyword is included in the reference document. A value indicating the degree of unique inclusion. That is, the appearance rate of keywords in the reference document indicates how frequently each keyword appears in the reference document, and the appearance rate of keywords in all documents indicates the degree of appearance of each keyword in all documents. Indicates whether it appears at a frequency of As a result, a keyword that frequently appears in the reference document and is not often found in all documents can be determined to be unique to the reference document. Conversely, a keyword that frequently appears in the reference document but frequently appears in all documents can be determined to be a word that frequently appears in a general document instead of being unique to the reference document. Therefore, the degree to which a keyword is uniquely included in the reference document can be quantified as (evaluation rate of keyword in reference document) / (keyword appearance rate in all documents) and used as an evaluation value.

【００１３】また、上記文書検索装置は、少なくとも一
の基準文書からすべてのワードをキーワードとして抽出
するキーワード抽出手段と、キーワード抽出手段によっ
て基準文書毎に抽出されたキーワードを格納するキーワ
ード格納手段と、キーワードの評価値を決定するための
参照対象となる複数の参照文書からすべてのワードを参
照ワードとして抽出する参照ワード抽出手段と、参照ワ
ード抽出手段によって参照文書毎に抽出された参照ワー
ドを格納する参照ワード格納手段と、それぞれのキーワ
ードについて、キーワード格納手段に含まれるキーワー
ドの数を基準文書の数で除して基準文書内キーワード出
現率を算出する基準文書内キーワード出現率算出手段
と、それぞれのキーワードについて、参照ワード格納手
段に含まれるキーワードと同一の参照ワードの数を参照
文書の数で除して参照文書内キーワード出現率を算出す
る参照文書内キーワード出現率算出手段と、基準文書内
キーワード出現率を参照文書内キーワード出現率に所定
の定数を加えた数で除して、キーワードの評価値を算出
する評価値算出手段とをさらに備え、評価値算出手段に
よって算出された評価値を評価値格納手段に格納するこ
とを特徴としても良い。[0013] Further, the document search device includes a keyword extracting means for extracting all words as keywords from at least one reference document, a keyword storage means for storing the keywords extracted for each reference document by the keyword extracting means, A reference word extracting means for extracting all words as reference words from a plurality of reference documents to be referred to for determining an evaluation value of a keyword, and a reference word extracted for each reference document by the reference word extracting means. A reference word storage unit, a reference document keyword appearance rate calculation unit for calculating a keyword appearance ratio in the reference document by dividing the number of keywords included in the keyword storage unit by the number of reference documents for each keyword; Keyword included in the reference word storage means for the keyword Means for calculating the keyword appearance rate in the reference document by dividing the number of reference words identical to the number of reference documents by the number of reference documents, and the keyword appearance rate in the reference document as the keyword appearance rate in the reference document. Evaluation value calculation means for calculating an evaluation value of the keyword by dividing by a number obtained by adding a predetermined constant, wherein the evaluation value calculated by the evaluation value calculation means is stored in the evaluation value storage means. Is also good.

【００１４】このように、基準文書でのキーワードの出
現率を参照文書でのキーワードの出現率に所定の定数を
加算した数で除することによって得られる評価値は、キ
ーワードが基準文書に固有に含まれる程度を示す値とな
る。すなわち、基準文書内でのキーワードの出現率は、
各キーワードが基準文書にどの程度の頻度で出現するか
を示し、参照文書内でのキーワードの出現率は、各キー
ワードが参照文書内でどの程度の頻度で出現するかを示
す。これにより、基準文書に頻繁に出現し、かつ参照文
書に多く見られないキーワードは、基準文書に固有のも
のであると判断できる。逆に、基準文書に頻繁に出現す
るが、参照文書にも頻繁に出現するキーワードは、基準
文書に固有なのではなく一般的な文書に頻繁に出現する
ワードであると判断できる。従って、（基準文書内キー
ワード出現率）／（参照文書内キーワード出現率＋定
数）によって、キーワードが基準文書に固有に含まれる
程度を数値化して評価値とすることができる。また、参
照文書内キーワード出現率に所定の定数を加算している
ので、参照文書内にキーワードが出現しない場合にも分
母が０となることを防止し、常に評価値を算出可能とし
ている。As described above, the evaluation value obtained by dividing the keyword appearance rate in the reference document by the number obtained by adding a predetermined constant to the keyword appearance rate in the reference document is determined by the keyword unique to the reference document. A value indicating the degree of inclusion. That is, the appearance rate of keywords in the reference document is
It indicates how frequently each keyword appears in the reference document, and the appearance rate of the keyword in the reference document indicates how often each keyword appears in the reference document. Thus, a keyword that frequently appears in the reference document and is not often found in the reference document can be determined to be unique to the reference document. Conversely, a keyword that frequently appears in the reference document but frequently appears in the reference document can be determined to be a word that is not unique to the reference document but frequently appears in a general document. Therefore, the degree to which the keyword is uniquely included in the reference document can be quantified as the evaluation value by (keyword appearance rate in the reference document) / (keyword appearance rate in the reference document + constant). Further, since a predetermined constant is added to the keyword appearance rate in the reference document, the denominator is prevented from being 0 even when the keyword does not appear in the reference document, and the evaluation value can always be calculated.

【００１５】また、上記文書検索装置において、キーワ
ード抽出手段は、一の基準文書から同一のキーワードを
重複して抽出しないことが好ましい。In the above-mentioned document search apparatus, it is preferable that the keyword extracting means does not redundantly extract the same keyword from one reference document.

【００１６】また、上記文書検索装置において、参照ワ
ード抽出手段は、一の参照文書から同一の参照ワードを
重複して抽出しないことが好ましい。Further, in the above document search device, it is preferable that the reference word extracting means does not redundantly extract the same reference word from one reference document.

【００１７】本発明に係る文書検索方法は、検索対象と
なる複数の検索対象文書から、基準となる基準文書に近
い内容の文書を検索する文書検索方法であって、基準文
書に含まれる複数のキーワードと、それぞれのキーワー
ドが基準文書に固有に含まれる程度を示す評価値とを格
納する評価値格納ステップと、複数の検索対象文書に含
まれるすべてのワードを検索ワードとして抽出する検索
ワード抽出ステップと、検索ワード抽出ステップにおい
て抽出された検索ワードを抽出元の検索対象文書を特定
する検索対象文書コードに関連付けて格納する検索ワー
ド格納ステップと、検索ワード格納ステップにおいて格
納された検索ワードと評価値格納ステップにおいて格納
されたキーワードとを照合し、検索ワードとキーワード
とが一致する場合に検索ワードにキーワードの評価値を
付与し、検索ワードに付与された評価値を検索対象文書
コードに基づいて集計して検索対象文書毎の集計値を算
出する評価値集計ステップと、評価値集計ステップにお
いて算出されたそれぞれの検索対象文書における評価値
の集計値を、検索対象文書に含まれるキーワードの数で
除してそれぞれの検索対象文書の文書評価値を算出する
文書評価値算出ステップと、文書評価値算出ステップに
おいて算出された文書評価値とあらかじめ設定された基
準値とを比較し、基準値より大きい文書評価値を有する
検索対象文書を抽出する検索文書抽出ステップとを備え
ることを特徴とする。A document retrieval method according to the present invention is a document retrieval method for retrieving a document having a content close to a reference document serving as a reference from a plurality of documents to be retrieved. An evaluation value storing step of storing a keyword and an evaluation value indicating the degree to which each keyword is uniquely included in the reference document; and a search word extracting step of extracting all words included in a plurality of search target documents as search words And a search word storing step of storing the search word extracted in the search word extracting step in association with a search target document code for specifying the search target document of the extraction source; and the search word and the evaluation value stored in the search word storage step Match the keyword stored in the storage step and match the search word with the keyword An evaluation value aggregation step of assigning a keyword evaluation value to a search word, and calculating the aggregation value for each search target document by totalizing the evaluation values assigned to the search word based on the search target document code, and an evaluation value aggregation step A document evaluation value calculating step of calculating the document evaluation value of each search target document by dividing the total evaluation value of each search target document calculated in the above by the number of keywords included in the search target document; A search document extracting step of comparing the document evaluation value calculated in the evaluation value calculation step with a preset reference value, and extracting a search target document having a document evaluation value larger than the reference value. .

【００１８】このように検索ワード抽出ステップにおい
て、検索対象文書からすべての検索ワードを抽出し、抽
出された検索ワードと基準文書に含まれているキーワー
ドとの照合をすることにより、基準文書に含まれるキー
ワードと一致する検索ワードを検索対象文書からすべて
抽出することができる。そして、評価値格納ステップに
おいて格納された評価値に基づいて、各検索対象文書に
含まれるキーワードの評価値を集計し、集計値を各検索
対象文書に含まれるキーワードの数で除することで、そ
れぞれの検索対象文書の文書評価値を算出する。そし
て、文書評価値とあらかじめ設定された基準値とを比較
して、基準文書固有のキーワードが含まれる検索対象文
書を抽出することができる。これにより、基準文書に近
い内容の検索対象文書を検索することが可能である。As described above, in the search word extracting step, all search words are extracted from the document to be searched, and the extracted search words are compared with the keywords included in the reference document, so that the search words are included in the reference document. All search words that match the keyword to be searched can be extracted from the search target document. Then, based on the evaluation values stored in the evaluation value storage step, the evaluation values of the keywords included in each search target document are totaled, and the total value is divided by the number of keywords included in each search target document, The document evaluation value of each search target document is calculated. Then, by comparing the document evaluation value with a preset reference value, a search target document including a keyword unique to the reference document can be extracted. As a result, it is possible to search for a search target document having contents similar to the reference document.

【００１９】また、上記文書検索方法において、検索ワ
ード抽出ステップは、一の検索対象文書から同一の検索
ワードを重複して抽出しないことが好ましい。In the above-described document search method, it is preferable that the search word extracting step does not redundantly extract the same search word from one search target document.

【００２０】また、上記文書検索方法において、評価値
格納ステップは、同一のキーワードを重複して格納しな
いことが好ましい。このように評価値格納ステップにお
いて同一のキーワードを重複して格納することを回避す
ることにより、検索ワードとの照合に要する時間を最適
化することができる。Further, in the above document search method, it is preferable that the evaluation value storing step does not store the same keyword repeatedly. By avoiding storing the same keyword in the evaluation value storing step in this way, the time required for matching with the search word can be optimized.

【００２１】また、上記文書検索方法において、キーワ
ードは、基準文書に含まれるひらがな、句読点、特殊記
号及びスペースを区切記号として抽出され、検索ワード
は、検索対象文書に含まれるひらがな、句読点、特殊記
号及びスペースを区切記号として抽出されることが好ま
しい。In the above document search method, the keywords are extracted as hiragana, punctuation marks, special symbols and spaces included in the reference document as delimiters, and the search words are extracted as hiragana, punctuation marks and special symbols included in the search target document. And spaces are preferably extracted as delimiters.

【００２２】ひらがな、句読点、特殊記号及びスペース
は、文書の内容に依らずに含まれているので、これらを
キーワードや検索ワードとして抽出しないようにするこ
とにより、文書検索の精度を高めることができる。Hiragana, punctuation, special symbols, and spaces are included irrespective of the contents of the document. Therefore, by not extracting these as keywords or search words, the accuracy of document search can be improved. .

【００２３】また、上記文書検索方法は、少なくとも一
の基準文書からすべてのワードをキーワードとして抽出
するキーワード抽出ステップと、キーワード抽出ステッ
プにおいて基準文書毎に抽出されたキーワードをキーワ
ード格納手段に格納するキーワード格納ステップと、キ
ーワードの評価値を決定するための参照対象となる複数
の参照文書からすべてのワードを参照ワードとして抽出
する参照ワード抽出ステップと、キーワード抽出ステッ
プにおいて基準文書毎に抽出されたキーワード及び参照
ワード抽出ステップにおいて参照文書毎に抽出された参
照ワードを全ワード格納手段に格納する全ワード格納ス
テップと、それぞれのキーワードについて、キーワード
格納手段に含まれるキーワードの数を基準文書の数で除
して、基準文書内キーワード出現率を算出する基準文書
内キーワード出現率算出ステップと、それぞれのキーワ
ードについて、全ワード格納手段に含まれるキーワード
の数及びキーワードと同一の参照ワードの数の合計を基
準文書及び参照文書の数の合計で除して全文書内キーワ
ード出現率を算出する全文書内キーワード出現率算出ス
テップと、基準文書内キーワード出現率を全文書内キー
ワード出現率で除して、キーワードの評価値を算出する
評価値算出ステップとをさらに備え、評価値算出ステッ
プにおいて算出された評価値を評価値格納ステップにお
いて格納することを特徴としても良い。Further, in the above-mentioned document search method, a keyword extracting step of extracting all words from at least one reference document as keywords, and a keyword storing the keywords extracted for each reference document in the keyword extracting step in a keyword storage means. A storage step, a reference word extraction step of extracting all words as reference words from a plurality of reference documents to be referred to for determining an evaluation value of the keyword, and a keyword extracted for each reference document in the keyword extraction step. An all-word storage step of storing the reference words extracted for each reference document in the reference word extraction step in the all-word storage means, and for each keyword, dividing the number of keywords included in the keyword storage means by the number of reference documents In the reference document A keyword appearance rate calculation step in the reference document for calculating the word appearance rate, and for each keyword, the number of keywords included in the all word storage means and the sum of the number of reference words identical to the keywords are determined by the number of reference documents and reference documents. And calculating the keyword appearance rate in all documents by dividing by the total of the keywords, and calculating the keyword evaluation value by dividing the keyword appearance rate in the reference document by the keyword appearance rate in all documents. An evaluation value calculation step may be further provided, and the evaluation value calculated in the evaluation value calculation step may be stored in the evaluation value storage step.

【００２４】このように、基準文書でのキーワードの出
現率を、基準文書と参照文書とを合わせた全文書でのキ
ーワードの出現率で除することによって得られる評価値
は、キーワードが基準文書に固有に含まれる程度を示す
値となる。すなわち、基準文書内でのキーワードの出現
率は、各キーワードが基準文書にどの程度の頻度で出現
するかを示し、全文書内でのキーワードの出現率は、各
キーワードが全文書内でどの程度の頻度で出現するかを
示す。これにより、基準文書に頻繁に出現し、かつ全文
書に多く見られないキーワードは、基準文書に固有のも
のであると判断できる。逆に、基準文書に頻繁に出現す
るが、全文書にも頻繁に出現するキーワードは、基準文
書に固有なのではなく一般的な文書に頻繁に出現するワ
ードであると判断できる。従って、（基準文書内キーワ
ード出現率）／（全文書内キーワード出現率）によっ
て、キーワードが基準文書に固有に含まれる程度を数値
化して評価値とすることができる。As described above, the evaluation value obtained by dividing the keyword appearance rate in the reference document by the keyword appearance rate in all the documents including the reference document and the reference document is calculated as follows. A value indicating the degree of unique inclusion. That is, the appearance rate of keywords in the reference document indicates how frequently each keyword appears in the reference document, and the appearance rate of keywords in all documents indicates the degree of appearance of each keyword in all documents. Indicates whether it appears at a frequency of As a result, a keyword that frequently appears in the reference document and is not often found in all documents can be determined to be unique to the reference document. Conversely, a keyword that frequently appears in the reference document but frequently appears in all documents can be determined to be a word that frequently appears in a general document instead of being unique to the reference document. Therefore, the degree to which a keyword is uniquely included in the reference document can be quantified as (evaluation rate of keyword in reference document) / (keyword appearance rate in all documents) and used as an evaluation value.

【００２５】また、上記文書検索方法は、少なくとも一
の基準文書からすべてのワードをキーワードとして抽出
するキーワード抽出ステップと、キーワード抽出ステッ
プにおいて基準文書毎に抽出されたキーワードをキーワ
ード格納手段に格納するキーワード格納ステップと、キ
ーワードの評価値を決定するための参照対象となる複数
の参照文書からすべてのワードを参照ワードとして抽出
する参照ワード抽出ステップと、参照ワード抽出ステッ
プにおいて参照文書毎に抽出された参照ワードを参照ワ
ード格納手段に格納する参照ワード格納ステップと、そ
れぞれのキーワードについて、キーワード格納手段に含
まれるキーワードの数を基準文書の数で除して基準文書
内キーワード出現率を算出する基準文書内キーワード出
現率算出ステップと、それぞれのキーワードについて、
参照ワード格納手段に含まれるキーワードと同一の参照
ワードの数を参照文書の数で除して参照文書内キーワー
ド出現率を算出する参照文書内キーワード出現率算出ス
テップと、基準文書内キーワード出現率を参照文書内キ
ーワード出現率に所定の定数を加えた数で除して、キー
ワードの評価値を算出する評価値算出ステップとをさら
に備え、評価値算出ステップにおいて算出された評価値
を評価値格納ステップにおいて格納することを特徴とし
ても良い。Further, in the above-mentioned document search method, a keyword extracting step of extracting all words from at least one reference document as keywords, and a keyword storing the keywords extracted for each reference document in the keyword extracting step in a keyword storage means. A storage step, a reference word extraction step of extracting all words as reference words from a plurality of reference documents to be referenced for determining the evaluation value of the keyword, and a reference extracted for each reference document in the reference word extraction step A reference word storage step of storing words in the reference word storage means, and for each keyword in the reference document calculating the keyword appearance rate in the reference document by dividing the number of keywords included in the keyword storage means by the number of reference documents Keyword appearance rate calculation step , For each keyword,
Calculating the keyword appearance rate in the reference document by dividing the number of reference words identical to the keywords contained in the reference word storage means by the number of reference documents; An evaluation value calculating step of calculating an evaluation value of the keyword by dividing by a number obtained by adding a predetermined constant to the keyword appearance rate in the reference document, and storing the evaluation value calculated in the evaluation value calculating step in an evaluation value storing step May be stored.

【００２６】このように、基準文書でのキーワードの出
現率を参照文書でのキーワードの出現率に所定の定数を
加算した数で除することによって得られる評価値は、キ
ーワードが基準文書に固有に含まれる程度を示す値とな
る。すなわち、基準文書内でのキーワードの出現率は、
各キーワードが基準文書にどの程度の頻度で出現するか
を示し、参照文書内でのキーワードの出現率は、各キー
ワードが参照文書内でどの程度の頻度で出現するかを示
す。これにより、基準文書に頻繁に出現し、かつ参照文
書に多く見られないキーワードは、基準文書に固有のも
のであると判断できる。逆に、基準文書に頻繁に出現す
るが、参照文書にも頻繁に出現するキーワードは、基準
文書に固有なのではなく一般的な文書に頻繁に出現する
ワードであると判断できる。従って、（基準文書内キー
ワード出現率）／（参照文書内キーワード出現率＋定
数）によって、キーワードが基準文書に固有に含まれる
程度を数値化して評価値とすることができる。また、参
照文書内キーワード出現率に所定の定数を加算している
ので、参照文書内にキーワードが出現しない場合にも分
母が０となることを防止し、常に評価値を算出可能とし
ている。As described above, the evaluation value obtained by dividing the keyword occurrence rate in the reference document by the number obtained by adding a predetermined constant to the keyword occurrence rate in the reference document is determined by the keyword unique to the reference document. A value indicating the degree of inclusion. That is, the appearance rate of keywords in the reference document is
It indicates how frequently each keyword appears in the reference document, and the appearance rate of the keyword in the reference document indicates how often each keyword appears in the reference document. Thus, a keyword that frequently appears in the reference document and is not often found in the reference document can be determined to be unique to the reference document. Conversely, a keyword that frequently appears in the reference document but frequently appears in the reference document can be determined to be a word that is not unique to the reference document but frequently appears in a general document. Therefore, the degree to which the keyword is uniquely included in the reference document can be quantified as the evaluation value by (keyword appearance rate in the reference document) / (keyword appearance rate in the reference document + constant). Further, since a predetermined constant is added to the keyword appearance rate in the reference document, the denominator is prevented from being 0 even when the keyword does not appear in the reference document, and the evaluation value can always be calculated.

【００２７】また、上記文書検索方法において、キーワ
ード抽出ステップは、一の基準文書から同一のキーワー
ドを重複して抽出しないことが好ましい。In the above-mentioned document search method, it is preferable that the keyword extracting step does not duplicately extract the same keyword from one reference document.

【００２８】また、上記文書検索方法において、参照ワ
ード抽出ステップは、一の参照文書から同一の参照ワー
ドを重複して抽出しないことが好ましい。In the above-described document search method, it is preferable that the reference word extracting step does not redundantly extract the same reference word from one reference document.

【００２９】[0029]

【発明の実施の形態】以下、図面と共に本発明に係る文
書検索装置の好適な実施形態について詳細に説明する。
なお、図面の説明においては同一要素には同一符号を付
し、重複する説明を省略する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a document retrieval apparatus according to the present invention will be described below in detail with reference to the drawings.
In the description of the drawings, the same elements will be denoted by the same reference symbols, without redundant description.

【００３０】まず、本実施形態に係る文書検索装置１０
の概要について説明すると、文書検索装置１０は、基準
となる基準文書に近い内容の文書を複数の検索対象の文
書から検索する装置である。First, the document retrieval apparatus 10 according to the present embodiment
The document retrieval device 10 is a device that retrieves a document having contents close to a reference document serving as a reference from a plurality of documents to be retrieved.

【００３１】図１は、文書検索装置１０の構成を示すブ
ロック図である。文書検索装置１０は、文書を入力する
入力手段２０と、文書からワードを抽出するワード抽出
手段３０と、ワード抽出手段３０によって抽出されたワ
ードを格納する各種データベース４０とを備えている。
また、文書検索装置１０は、複数の検索対象文書から基
準文書に近い内容を有する文書を検索するための、評価
値集計手段５４と、文書評価値算出手段５５と、検索文
書抽出手段５６とを備えている。さらに、文書検索装置
１０は、評価値を算出するための、基準文書内キーワー
ド出現率算出手段５１と、全文書内キーワード出現率算
出手段５２と、評価値算出手段５３とを備えている。FIG. 1 is a block diagram showing the configuration of the document search device 10. The document search apparatus 10 includes an input unit 20 for inputting a document, a word extracting unit 30 for extracting words from the document, and various databases 40 for storing the words extracted by the word extracting unit 30.
Further, the document search device 10 includes an evaluation value totalizing unit 54, a document evaluation value calculating unit 55, and a search document extracting unit 56 for searching a document having a content close to the reference document from a plurality of search target documents. Have. Further, the document search device 10 includes a keyword appearance rate calculation means 51 in the reference document, a keyword appearance rate calculation means 52 in all documents, and an evaluation value calculation means 53 for calculating the evaluation value.

【００３２】入力手段２０は、基準文書を入力するため
の基準文書入力手段２１と、参照文書を入力するための
参照文書入力手段２２と、検索対象文書を入力するため
の検索対象文書入力手段２３とを備えている。ここで、
参照文書とは、基準文書に含まれるキーワードに評価値
を設定する際に参照される文書である。本実施形態に係
る文書検索装置１０においては、基準文書入力手段２
１、参照文書入力手段２２、検索対象文書入力手段２３
がそれぞれ設けられているが、これらは入力対象となる
文書が異なるのみで、文書を入力するという機能は同じ
であるので、これらの入力手段は一つに統合しても良
い。入力手段の具体的な例としては、紙に記載された文
書を読み込むスキャナやファイルに保存された文書を読
み込むディスクドライブ等が考えられる。The input means 20 includes a reference document input means 21 for inputting a reference document, a reference document input means 22 for inputting a reference document, and a search target document input means 23 for inputting a search target document. And here,
The reference document is a document that is referred to when setting an evaluation value for a keyword included in the reference document. In the document search device 10 according to the present embodiment, the reference document input unit 2
1. Reference document input means 22, search target document input means 23
However, these are different from each other only in a document to be input and have the same function of inputting a document. Therefore, these input means may be integrated into one. Specific examples of the input unit include a scanner that reads a document written on paper and a disk drive that reads a document stored in a file.

【００３３】ワード抽出手段３０は、基準文書からワー
ドをキーワードとして抽出するキーワード抽出手段３１
と、参照文書からワードを参照ワードとして抽出する参
照ワード抽出手段３２と、検索対象文書からワードを検
索ワードとして抽出する検索ワード抽出手段３３とを備
えている。本実施形態に係る文書検索装置１０において
は、キーワード抽出手段３１、参照ワード抽出手段３
２、検索ワード抽出手段３３がそれぞれ設けられている
が、これらはワードの抽出元である文書が異なるのみ
で、文書からワードを抽出するという機能は同じである
ので、これらのワード抽出手段は、一つに統合しても良
い。ワード抽出手段はいずれも、ひらがな、句読点、特
殊記号及びスペースを区切記号として文書内のワードを
抽出する機能を有する。また、ワード抽出手段はいずれ
も、一の文書から重複してワードを抽出しないように、
文書から切り出されたワードは、同じ文書から既に切り
出されたワードと照合され、一致しないワードのみを抽
出する機能を有する。The word extracting unit 30 extracts a word from the reference document as a keyword.
And a reference word extracting unit 32 for extracting a word from the reference document as a reference word, and a search word extracting unit 33 for extracting a word from the search target document as a search word. In the document search device 10 according to the present embodiment, the keyword extracting unit 31 and the reference word extracting unit 3
2. A search word extraction unit 33 is provided, but these are different only in the document from which the word is extracted, and have the same function of extracting a word from the document. It may be integrated into one. Each of the word extracting means has a function of extracting words in a document by using hiragana, punctuation marks, special symbols, and spaces as delimiters. In addition, all word extraction means do not extract duplicate words from one document,
Words cut out from a document are collated with words already cut out from the same document, and have a function of extracting only words that do not match.

【００３４】次に、各データベース４０について説明す
る。文書検索装置１０は、キーワードデータベース（以
下、「キーワードＤＢ」という）４１と、全ワードデー
タベース（以下、「全ワードＤＢ」という）４２と、評
価値データベース（以下、「評価値ＤＢ」という）４３
と、検索ワードデータベース（以下、「検索ワードＤ
Ｂ」という）４４とを有している。Next, each database 40 will be described. The document search device 10 includes a keyword database (hereinafter, referred to as “keyword DB”) 41, an all-word database (hereinafter, referred to as “all-word DB”) 42, and an evaluation value database (hereinafter, referred to as “evaluation value DB”) 43.
And a search word database (hereinafter, “search word D
B ”) 44).

【００３５】キーワードＤＢ４１は、基準文書から抽出
したキーワードを保存する。キーワードは、抽出元であ
る基準文書を特定する基準文書コードに関連付けて保存
されている。キーワードＤＢ４１の例を図４に示す。The keyword DB 41 stores keywords extracted from the reference document. The keyword is stored in association with a reference document code that specifies the reference document from which the keyword is extracted. FIG. 4 shows an example of the keyword DB 41.

【００３６】全ワードＤＢ４２は、基準文書から抽出さ
れたキーワードと参照文書から抽出された参照ワードと
を保存する。キーワード及び参照ワードは、それぞれの
抽出元である基準文書を特定する基準文書コード及び参
照文書を特定する参照文書コードに関連付けて保存され
ている。全ワードＤＢ４２の例を図５に示す。The all-word DB 42 stores the keywords extracted from the reference document and the reference words extracted from the reference document. The keyword and the reference word are stored in association with a reference document code that specifies the reference document from which each is extracted and a reference document code that specifies the reference document. FIG. 5 shows an example of the all-word DB 42.

【００３７】評価値ＤＢ４３は、基準文書に含まれるキ
ーワードがその基準文書に固有である程度を示す評価値
を格納している。評価値ＤＢ４３の例を図７に示す。The evaluation value DB 43 stores evaluation values indicating the degree to which keywords contained in the reference document are unique to the reference document. FIG. 7 shows an example of the evaluation value DB 43.

【００３８】検索ワードＤＢ４４は、検索対象文書から
抽出される検索ワードを保存する。検索ワードは、抽出
元である検索対象文書を特定する検索対象文書コードに
関連付けて保存されている。検索ワードＤＢ４４の例を
図８に示す。The search word DB 44 stores search words extracted from documents to be searched. The search word is stored in association with a search target document code that specifies a search target document as an extraction source. FIG. 8 shows an example of the search word DB 44.

【００３９】次に、検索対象文書から所望の文書を検索
するための手段について説明する。Next, means for searching for a desired document from the search target document will be described.

【００４０】評価値集計手段５４は、一の検索対象文書
に含まれるすべてのキーワードの評価値を加算して集計
する機能を有する。そして、集計によって求められた集
計値は、文書評価値算出手段５５に入力される。The evaluation value totalizing means 54 has a function of adding and totaling the evaluation values of all keywords included in one search target document. Then, the tally value obtained by tallying is input to the document evaluation value calculation means 55.

【００４１】文書評価値算出手段５５は、評価値集計手
段５４によって集計された集計値を、当該検索対象文書
に含まれるキーワードの数で除して文書評価値を算出す
る機能を有する。そして、算出された文書評価値は、検
索文書抽出手段５６に入力される。The document evaluation value calculation means 55 has a function of calculating the document evaluation value by dividing the total value calculated by the evaluation value totalization means 54 by the number of keywords included in the search target document. Then, the calculated document evaluation value is input to the search document extracting means 56.

【００４２】検索文書抽出手段５６は、文書評価値算出
手段５５によって算出された文書評価値とあらかじめ設
定された基準値とを比較することによって、複数の検索
対象文書から所望の文書を抽出する機能を有する。具体
的には、文書評価値が基準値より大きい場合に当該文書
評価値を有する文書を抽出する。The search document extracting unit 56 extracts a desired document from a plurality of search target documents by comparing the document evaluation value calculated by the document evaluation value calculation unit 55 with a preset reference value. Having. Specifically, when the document evaluation value is larger than the reference value, a document having the document evaluation value is extracted.

【００４３】次に、キーワードの評価値を算出する手段
について説明する。Next, means for calculating the evaluation value of the keyword will be described.

【００４４】基準文書内キーワード出現率算出手段５１
は、複数の基準文書のそれぞれに共通のキーワードが出
現する出現率を算出する機能を有する。基準文書がＭ個
でそのうちのＡ個に共通のキーワードが存在する場合に
は、基準文書内キーワード出現率は、Ａ／Ｍで算出され
る。本実施形態に係る文書検索装置１０では、キーワー
ドＤＢ４１に格納されたキーワードを検索して、同一の
キーワードが何個存在するか算出し、算出されたキーワ
ード数を基準文書の数で除することによって、基準文書
内キーワード出現率を算出する。なお、基準文書が１つ
である場合には、キーワードは重複なく抽出されること
から、基準文書内キーワード出現率は、いずれのキーワ
ードについても１となる。Means of calculating keyword appearance rate in reference document 51
Has a function of calculating an appearance rate at which a common keyword appears in each of a plurality of reference documents. When there are M reference documents and a common keyword exists for A of them, the keyword appearance rate in the reference document is calculated by A / M. The document search device 10 according to the present embodiment searches for a keyword stored in the keyword DB 41, calculates how many identical keywords exist, and divides the calculated number of keywords by the number of reference documents. , Calculate the keyword appearance rate in the reference document. When the number of reference documents is one, keywords are extracted without duplication. Therefore, the keyword appearance rate in the reference document is 1 for each keyword.

【００４５】全文書内キーワード出現率算出手段５２
は、基準文書と参照文書とを合わせた全文書に共通のキ
ーワードが出現する出現率を算出する機能を有する。基
準文書がＭ個、参照文書がＮ個で、その内のＢ個に共通
のキーワードが存在する場合には、全文書内キーワード
出現率は、Ｂ／（Ｍ＋Ｎ）で算出される。本実施形態に
係る文書検索装置１０では、全文書ＤＢに格納されたキ
ーワード及び参照ワードを検索して、同一のキーワード
及びキーワードと同一の参照ワードが何個存在するか算
出する。ここで、「参照ワード」とは参照文書から抽出
したワードに便宜的に付与した名称であるので、「キー
ワードと同一の参照ワード」とは、すなわち参照文書に
含まれるキーワードを意味する。算出されたキーワード
数を全文書の数で除することによって、全文書内キーワ
ード出現率を算出する。Means for calculating keyword appearance rate in all documents 52
Has a function of calculating an appearance rate at which a keyword common to all documents including the reference document and the reference document appears. When there are M reference documents and N reference documents, and there are B keywords common to them, the keyword appearance rate in all documents is calculated as B / (M + N). The document search device 10 according to the present embodiment searches for keywords and reference words stored in the entire document DB, and calculates how many identical keywords and the same reference words exist. Here, the “reference word” is a name conveniently added to a word extracted from the reference document, and thus “the same reference word as the keyword” means a keyword included in the reference document. The keyword appearance rate in all documents is calculated by dividing the calculated number of keywords by the number of all documents.

【００４６】評価値算出手段５３は、基準文書内キーワ
ード出現率を全文書内キーワード出現率で除して、キー
ワードの評価値を算出する機能を有する。The evaluation value calculating means 53 has a function of calculating the evaluation value of the keyword by dividing the keyword appearance rate in the reference document by the keyword appearance rate in all documents.

【００４７】次に、本実施形態の文書検索装置１０の動
作について説明し、併せて実施形態に係る文書検索方法
について説明する。Next, the operation of the document search apparatus 10 of the present embodiment will be described, and a document search method according to the embodiment will be described.

【００４８】ここでは、複数の検索対象文書である特許
公報から基準となる特許公報に近い内容の特許公報を検
索する場合を例として説明する。図２は、文書検索装置
１０の動作を示すフローチャートである。Here, a case will be described as an example in which a patent publication having contents close to the reference patent publication is searched from a plurality of patent publications to be searched. FIG. 2 is a flowchart showing the operation of the document search device 10.

【００４９】図２に示すように、文書検索装置１０によ
る文書の検索は、基準文書に含まれるキーワードの評価
値を算出する評価値算出段階（Ｓ１０〜Ｓ１８）と、評
価値に基づいて文書の検索を行う文書検索段階（Ｓ２０
〜Ｓ２６）とに分けることができる。キーワードの評価
値があらかじめ算出されている場合には、評価値算出段
階（Ｓ１０〜Ｓ１８）を経ないで、文書の検索段階（Ｓ
２０〜Ｓ２６）に進むことができる。As shown in FIG. 2, the document search by the document search apparatus 10 includes an evaluation value calculation step (S10 to S18) for calculating an evaluation value of a keyword included in the reference document, and a document search based on the evaluation value. Document search stage for searching (S20)
To S26). If the evaluation value of the keyword has been calculated in advance, the document is searched for (S10) without going through the evaluation value calculation steps (S10 to S18).
20 to S26).

【００５０】まず、評価値算出段階（Ｓ１０〜Ｓ１８）
について説明する。最初に、キーワード抽出手段３１
は、基準文書からキーワードを抽出する（Ｓ１０）。図
３は、基準文書の一例を示す図である。この基準文書を
ひらがな、句読点、特殊記号及びスペースを区切記号と
して文頭から読み込んでいくと、「課題」「遮光性着色
層」「形成」・・・といったワードが順次キーワードと
して抽出される。ここで、図３に示す基準文書の２行目
にも「遮光性着色層」というワードがあるが、「遮光性
着色層」は既にキーワードとして抽出されているので重
複して抽出することはしない。このようにして基準文書
からすべてのワードをキーワードとして抽出し、抽出さ
れたキーワードは、基準文書を特定する基準文書コード
に関連付けてキーワードＤＢ４１に格納する（Ｓ１
０）。図４は、キーワードＤＢ４１の例を示す図であ
る。ここでは、３個の基準文書Ｂ００１〜Ｂ００３から
キーワードを抽出した。First, the evaluation value calculation stage (S10 to S18)
Will be described. First, the keyword extracting means 31
Extracts a keyword from the reference document (S10). FIG. 3 is a diagram illustrating an example of the reference document. When this reference document is read from the beginning of the sentence using hiragana, punctuation marks, special symbols, and spaces as delimiters, words such as "task", "light-shielding colored layer", "formation", etc. are sequentially extracted as keywords. Here, the word “light-shielding colored layer” is also present in the second line of the reference document shown in FIG. 3, but since “light-shielding colored layer” has already been extracted as a keyword, it is not redundantly extracted. . In this way, all words are extracted as keywords from the reference document, and the extracted keywords are stored in the keyword DB 41 in association with the reference document code specifying the reference document (S1).
0). FIG. 4 is a diagram illustrating an example of the keyword DB 41. Here, keywords are extracted from the three reference documents B001 to B003.

【００５１】次に、参照ワード抽出手段３２は、参照文
書から参照ワードを抽出する（Ｓ１２）。参照ワードの
抽出は、キーワードの抽出と同様に、ひらがな、句読
点、特殊記号及びスペースを区切記号として抽出する。
そして、抽出された参照ワードは、参照文書を特定する
参照文書コードに関連付けて全ワードＤＢ４２に格納す
る（Ｓ１２）。また、全ワードＤＢ４２には、キーワー
ド抽出手段３１によって抽出されたキーワードも格納さ
れる。図５は、全ワードＤＢ４２の例を示す図である。
ここでは、９７個の参照文書Ｒ００１〜Ｒ０９７から参
照ワードを抽出した。Next, the reference word extracting means 32 extracts a reference word from the reference document (S12). In the extraction of the reference word, hiragana, punctuation, special symbols, and spaces are extracted as delimiters in the same manner as the extraction of the keyword.
Then, the extracted reference word is stored in the all-word DB 42 in association with the reference document code specifying the reference document (S12). The keyword extracted by the keyword extracting means 31 is also stored in the all-word DB 42. FIG. 5 is a diagram illustrating an example of the all-word DB 42.
Here, reference words are extracted from the 97 reference documents R001 to R097.

【００５２】次に、基準文書内キーワード出現率算出手
段５１は、キーワードＤＢ４１に基づいて、基準文書内
キーワード出現率を算出する（Ｓ１４）。具体的な例に
ついて、図６を参照しながら説明する。図４に示すキー
ワードＤＢ４１を見ると、キーワード「課題」は基準文
書Ｂ００１〜Ｂ００３のいずれにも含まれているので、
基準文書内キーワード数は３個である。そして、基準文
書の数は３個であるので、基準文書内キーワード出現率
は１となる。同様にして、図６に示すように、基準文書
内キーワード数及び基準文書内キーワード出現率を算出
している。ここでは、基準文書が３個の場合について説
明しているが、基準文書が１個である場合には、基準文
書内キーワード数及び基準文書数ともに１となるので、
基準文書内キーワード数及び基準文書内キーワード出現
率は、ともに常に１となる。Next, the reference document keyword appearance rate calculating means 51 calculates the keyword appearance rate in the reference document based on the keyword DB 41 (S14). A specific example will be described with reference to FIG. Referring to the keyword DB 41 shown in FIG. 4, the keyword “assignment” is included in any of the reference documents B001 to B003.
The number of keywords in the reference document is three. Since the number of reference documents is three, the keyword appearance rate in the reference document is 1. Similarly, as shown in FIG. 6, the number of keywords in the reference document and the keyword appearance rate in the reference document are calculated. Here, the case where the number of reference documents is three is described. However, when the number of reference documents is one, both the number of keywords in the reference document and the number of reference documents are one.
Both the number of keywords in the reference document and the keyword appearance rate in the reference document are always 1 both.

【００５３】また、全文書内キーワード出現率算出手段
５２は、基準文書内キーワード出現率と同様の方法によ
って、基準文書と参照文書を含めた全文書内でのキーワ
ード出現率を算出する（Ｓ１６）。すなわち、全文書内
に含まれるキーワードの数を全文書数（この場合は１０
０個）で除することによって、図６に示すような全文書
キーワード出現率が算出される。Further, the keyword occurrence rate calculation means 52 in all documents calculates the keyword occurrence rate in all documents including the reference document and the reference document by the same method as the keyword occurrence rate in reference document (S16). . That is, the number of keywords included in all documents is changed to the total number of documents (in this case, 10
0), the overall document keyword appearance rate as shown in FIG. 6 is calculated.

【００５４】次に、評価値算出手段５３は、基準文書内
キーワード出現率を全文書内キーワード出現率で除する
（基準文書内キーワード出現率／全文書内キーワード出
現率）ことによって、それぞれのキーワードの評価値を
算出する（Ｓ１８）。算出された評価値を図７に示す。
この評価値は、キーワードが基準文書にどの程度固有に
含まれるかを示す値である。より詳細に説明すると、評
価値算出の計算式の分子となる基準文書内キーワード出
現率は、基準文書のうち当該キーワードが含まれている
文書の割合を示すものであり、基準文書内キーワード出
現率の大小は当該キーワードの特殊性に依存すると言え
る。例えば、基準文書として半導体に関する文書を選択
すれば、「半導体」や「Ｐ型層」といった半導体に関係
する用語は、いずれの基準文書にも含まれていると考え
られ、文書検索において重要なキーとなるワードであ
る。ただし、「課題」や「従来」といったごく一般的な
用語も基準文書に多く含まれることとなるので、基準文
書内キーワード出現率をそのまま評価値としたのでは、
精度良く文書検索を行うことはできない。一方、評価値
算出の計算式の分母となる全文書内キーワード出現率
は、全文書のうち当該キーワードが含まれている文書の
割合を表すものであり、全文書内キーワード出現率の大
小は当該キーワードの一般性に依存するものである。す
なわち、全文書内キーワード出現率の高い「課題」や
「従来」といったキーワードは、どのような技術分野の
公報においても使用されるキーワードである。従って、
このような一般的なキーワードの評価値が大きくならな
いように、全文書内キーワード出現率で除することによ
って、基準文書に含まれるキーワードに固有性と一般性
とのバランスのとれた評価値となる。Next, the evaluation value calculating means 53 divides the keyword appearance rate in the reference document by the keyword appearance rate in all the documents (keyword appearance rate in the reference document / keyword appearance rate in all documents) to obtain each keyword. Is calculated (S18). FIG. 7 shows the calculated evaluation values.
This evaluation value is a value indicating to what extent the keyword is uniquely included in the reference document. More specifically, the keyword appearance rate in the reference document, which is the numerator of the evaluation value calculation formula, indicates the proportion of documents in which the keyword is included in the reference document. Can be said to depend on the specificity of the keyword. For example, if a document related to a semiconductor is selected as a reference document, terms relating to the semiconductor such as “semiconductor” and “P-type layer” are considered to be included in any reference document, and an important key in document search. Is the word However, since very common terms such as “problem” and “conventional” are also included in the reference document, if the keyword appearance rate in the reference document is used as it is as the evaluation value,
Document search cannot be performed with high accuracy. On the other hand, the keyword appearance rate in all documents, which is the denominator of the evaluation value calculation formula, represents the proportion of documents that include the keyword in all documents, and the magnitude of the keyword appearance rate in all documents is It depends on the generality of the keyword. That is, keywords such as “problem” and “conventional” having a high keyword appearance rate in all documents are keywords used in gazettes in any technical field. Therefore,
By dividing by the keyword appearance rate in all documents so that the evaluation value of such a general keyword does not increase, the evaluation value that balances the uniqueness and generality of the keyword included in the reference document is obtained. .

【００５５】次に、文書検索段階（Ｓ２０〜Ｓ２６）に
ついて説明する。Next, the document search step (S20 to S26) will be described.

【００５６】まず、検索ワード抽出手段３３は検索対象
文書から検索ワードを抽出し、抽出した検索ワードを抽
出元の検索対象文書を特定する検索対象文書コードに関
連付けて検索ワードＤＢ４４に格納する（Ｓ２０）。図
８は、検索ワードＤＢ４４を示す図である。First, the search word extracting means 33 extracts a search word from the search target document, and stores the extracted search word in the search word DB 44 in association with the search target document code for specifying the extraction source search target document (S20). ). FIG. 8 is a diagram showing the search word DB 44.

【００５７】次に、検索ワードＤＢ４４に格納された検
索ワードと、評価値ＤＢ４３に格納されたキーワードと
を照合して、検索対象文書コードによって特定される検
索対象文書毎にキーワードの評価値を集計する（Ｓ２
２）。すなわち、検索ワードＤＢ４４に含まれる検索ワ
ードをキーワードとを照合して、キーワードを一致する
検索ワードに当該キーワードの評価値を付与し、付与さ
れた評価値をそれぞれの検索対象文書毎に加算すること
によって、集計値を求める。図９に、評価値の集計値の
算出例を示す。Next, the search words stored in the search word DB 44 are compared with the keywords stored in the evaluation value DB 43, and the evaluation values of the keywords are totaled for each search target document specified by the search target document code. Yes (S2
2). That is, a search word included in the search word DB 44 is compared with a keyword, an evaluation value of the keyword is assigned to a search word that matches the keyword, and the assigned evaluation value is added to each search target document. To calculate the total value. FIG. 9 shows an example of calculating the total value of the evaluation values.

【００５８】続いて、評価値算出手段５３は、各検索対
象文書に含まれるキーワードの数を算出する。そして、
評価値集計手段５４によって集計された集計値をキーワ
ード数で除することによって、それぞれの検索対象文書
の文書評価値を算出する（Ｓ２４）。例えば、図９に示
すように、検索対象文書コードＦ００１には５３個のキ
ーワードが含まれ、それらの評価値の集計値は３８．２
７であるので、文書評価値は０．７２となる。Subsequently, the evaluation value calculation means 53 calculates the number of keywords included in each search target document. And
The document evaluation value of each search target document is calculated by dividing the total value calculated by the evaluation value totaling means 54 by the number of keywords (S24). For example, as shown in FIG. 9, the search target document code F001 includes 53 keywords, and the total value of the evaluation values is 38.2.
7, the document evaluation value is 0.72.

【００５９】次に、文書抽出手段は、文書評価値とあら
かじめ設定された基準値とを比較して、基準値より大き
い文書評価値を有する文書を検索文書として抽出する
（Ｓ２６）。これにより、複数の検索対象文書の中か
ら、基準文書に近い内容の検索対象文書が検索されるこ
ととなる。図９では、基準値として「２」を設定してい
るので、検索対象文書コードＦ００７，Ｆ００８の２個
の文書が基準文書に近い内容を有する文書として抽出さ
れている。Next, the document extracting means compares the document evaluation value with a preset reference value, and extracts a document having a document evaluation value larger than the reference value as a search document (S26). As a result, a search target document having contents close to the reference document is searched from among the plurality of search target documents. In FIG. 9, since “2” is set as the reference value, two documents of the search target document codes F007 and F008 are extracted as documents having contents close to the reference document.

【００６０】次に、本実施形態に係る文書検索装置１０
及び文書検索方法の効果について説明する。Next, the document retrieval apparatus 10 according to the present embodiment
The effect of the document search method will be described.

【００６１】本実施形態に係る文書検索装置１０は、基
準文書に含まれるキーワードがどの程度基準文書に固有
なものであるかを示す評価値を格納した評価値ＤＢ４３
を備え、その評価値に基づいて文書を検索しているの
で、基準文書に近い内容を有する文書を複数の検索対象
文書から抽出することができる。The document retrieval apparatus 10 according to the present embodiment has an evaluation value DB 43 storing an evaluation value indicating how much a keyword included in a reference document is unique to the reference document.
, And a document is searched based on the evaluation value, so that a document having a content close to the reference document can be extracted from a plurality of search target documents.

【００６２】また、本実施形態に係る文書検索装置１０
は、基準文書にキーワードが出現する割合である基準文
書内キーワード出現率を、基準文書と参照文書を合わせ
た全文書にキーワードが出現する割合である全文書内キ
ーワード出現率で除して評価値を算出しているので、評
価値はそれぞれのキーワードが基準文書に固有に含まれ
る程度を示すこととなり、この評価値を用いることで精
度の高い文書検索を行うことができる。The document retrieval device 10 according to the present embodiment
Is the evaluation value obtained by dividing the keyword appearance rate in the reference document, which is the rate at which the keyword appears in the reference document, by the keyword appearance rate in the entire document, which is the rate at which the keyword appears in all the documents combining the reference document and the reference document Is calculated, the evaluation value indicates the degree to which each keyword is uniquely included in the reference document, and a highly accurate document search can be performed by using this evaluation value.

【００６３】また、本実施形態に係る文書検索方法は、
基準文書に含まれるキーワードがどの程度基準文書に固
有なものであるかを示す評価値をそれぞれのキーワード
毎に算出し、その評価値に基づいて文書を検索している
ので、基準文書に近い内容を有する文書を複数の検索対
象文書から抽出することができる。The document search method according to the present embodiment
An evaluation value indicating the degree to which the keywords contained in the reference document are unique to the reference document is calculated for each keyword, and the document is searched based on the evaluation value. Can be extracted from a plurality of search target documents.

【００６４】次に、本発明の第２実施形態に係る文書検
索装置６０について説明する。第２実施形態に係る文書
検索装置６０は、文書を入力する入力手段２０と、文書
からワードを抽出するワード抽出手段３０と、ワード抽
出手段３０によって抽出されたワードを格納する各種デ
ータベース４０とを備えており、また、複数の検索対象
文書から基準文書に近い内容を有する文書を検索するた
めの、評価値集計手段５４と、文書評価値算出手段５５
と、検索文書抽出手段５６とを備えている点で、第１実
施形態に係る文書検索装置１０と基本的な構成は同一で
ある。Next, a document search device 60 according to a second embodiment of the present invention will be described. The document search device 60 according to the second embodiment includes an input unit 20 for inputting a document, a word extraction unit 30 for extracting words from the document, and various databases 40 for storing the words extracted by the word extraction unit 30. An evaluation value summation unit 54 and a document evaluation value calculation unit 55 for searching for a document having a content close to the reference document from a plurality of search target documents.
And a search document extracting unit 56, the basic configuration is the same as that of the document search apparatus 10 according to the first embodiment.

【００６５】次に、第１実施形態の文書検索装置１０と
機能が異なる第２実施形態の文書検索装置６０の要素に
ついて説明する。Next, elements of the document search device 60 according to the second embodiment, which have different functions from those of the document search device 10 according to the first embodiment, will be described.

【００６６】参照ワードデータベース（以下、「参照ワ
ードＤＢ」という）４６は、参照ワード抽出手段３２に
よって参照文書から抽出された参照ワードを、抽出元で
ある参照文書に関連付けて格納している。図１１は、参
照ワードＤＢ４６の例を示す図である。図１１に示すよ
うに、参照ワードＤＢ４６は、全ワードＤＢ４２と異な
り、基準文書から抽出したキーワードは格納されていな
い。The reference word database (hereinafter, referred to as “reference word DB”) 46 stores the reference words extracted from the reference documents by the reference word extraction means 32 in association with the reference documents from which the reference words are extracted. FIG. 11 is a diagram illustrating an example of the reference word DB 46. As shown in FIG. 11, unlike the all-word DB 42, the reference word DB 46 does not store keywords extracted from the reference document.

【００６７】また、参照文書内キーワード出現率算出手
段６１は、それぞれの参照文書に共通のキーワードが出
現する出現率を算出する機能を有する。参照文書がＮ個
で、その内のＢ個に共通のキーワードが存在する場合に
は、全文書内キーワード出現率は、Ｂ／Ｎで算出され
る。The reference document keyword appearance rate calculating means 61 has a function of calculating an appearance rate at which a keyword common to each reference document appears. If there are N reference documents and B has a common keyword, the keyword appearance rate in all documents is calculated as B / N.

【００６８】また、評価値算出手段６２は、基準文書内
キーワード出現率を参照文書内キーワード出現率に所定
の定数を加えた値で除して、それぞれのキーワードの評
価値を算出する機能を有する。The evaluation value calculating means 62 has a function of calculating the evaluation value of each keyword by dividing the keyword appearance rate in the reference document by a value obtained by adding a predetermined constant to the keyword appearance rate in the reference document. .

【００６９】次に、第２実施形態の文書検索装置６０の
動作を説明し、併せて第２実施形態に係る文書検索方法
について説明する。Next, the operation of the document search device 60 according to the second embodiment will be described, and a document search method according to the second embodiment will be described.

【００７０】図１２は、文書検索装置６０の動作を示す
フローチャートである。FIG. 12 is a flowchart showing the operation of the document search device 60.

【００７１】まず、基準文書からキーワードを抽出して
キーワードＤＢ４１に格納する（Ｓ３０）。次に、参照
文書から参照ワードを抽出して参照ワードＤＢに格納す
る（Ｓ３２）。続いて、基準文書内キーワード出現率算
出手段５１は、キーワードＤＢ４１に基づいて基準文書
内キーワード出現率を算出する（Ｓ３４）。そして、参
照文書内キーワード出現率算出手段６１は、参照ワード
ＤＢ４６に基づいて、参照文書内キーワード出現率を算
出する（Ｓ３６）。図１３に、基準文書内キーワード出
現率及び参照文書内キーワード出現率を算出した例を示
す。ここで、「ポリドームスイッチ」というキーワード
は基準文書に含まれているが、参照文書には含まれてい
なかったので、参照文書内キーワード出現率は０となっ
ている。First, a keyword is extracted from the reference document and stored in the keyword DB 41 (S30). Next, a reference word is extracted from the reference document and stored in the reference word DB (S32). Then, the reference document keyword appearance rate calculation unit 51 calculates the keyword appearance rate in the reference document based on the keyword DB 41 (S34). Then, the reference document keyword appearance rate calculation means 61 calculates the keyword appearance rate in the reference document based on the reference word DB 46 (S36). FIG. 13 shows an example of calculating the keyword appearance rate in the reference document and the keyword appearance rate in the reference document. Here, the keyword “polydome switch” is included in the reference document but not included in the reference document, so that the keyword appearance rate in the reference document is 0.

【００７２】次に、評価値算出手段６２は、基準文書内
キーワード出現率を参照文書内キーワード出現率に所定
の定数（ここでは０．０１）を加えた値で除することに
よって、それぞれのキーワードの評価値を算出し、図１
４に示すように算出した評価値を評価値ＤＢ４３に格納
する（Ｓ３８）。ここで、所定の定数を加えているの
は、例えば、キーワード「ポリドームスイッチ」（図１
３参照）のように参照文書内キーワード出現率が０とな
る場合にも、定数で除算をすることによって評価値が算
出できるようにするためである。なお、所定の定数の値
は、他の評価値との兼ね合いによって設定するのが良
い。Next, the evaluation value calculating means 62 divides the keyword appearance rate in the reference document by a value obtained by adding a predetermined constant (here, 0.01) to the keyword appearance rate in the reference document to obtain each keyword. Was calculated, and FIG.
The evaluation value calculated as shown in FIG. 4 is stored in the evaluation value DB 43 (S38). Here, the reason why the predetermined constant is added is, for example, the keyword “polydome switch” (FIG. 1).
Even if the keyword appearance rate in the reference document becomes 0 as in (3), the evaluation value can be calculated by dividing by a constant. Note that the value of the predetermined constant is preferably set according to the balance with other evaluation values.

【００７３】次に、第１実施形態で説明したのと同様の
方法によって、評価値ＤＢ４３に格納された評価値に基
づいて、検索対象文書から基準文書に近い内容の文書を
検索する（Ｓ２０〜Ｓ２６）。Next, in the same manner as described in the first embodiment, based on the evaluation value stored in the evaluation value DB 43, a document having contents close to the reference document is searched from the search target document (S20 to S20). S26).

【００７４】このように、基準文書内キーワード出現率
を参照文書内キーワード出現率に所定の定数を加えた値
で除することによって評価値に基づいて、文書を検索す
ることにより、精度の良い文書検索を行うことができ
る。As described above, by searching for a document based on the evaluation value by dividing the keyword appearance rate in the reference document by the value obtained by adding a predetermined constant to the keyword appearance rate in the reference document, a highly accurate document can be obtained. Search can be performed.

【００７５】以上、本発明の実施形態について詳細に説
明してきたが、本発明は上記実施形態に限定されるもの
ではない。Although the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments.

【００７６】上記実施形態では、参照文書や検索対象文
書からワードを抽出する際には、一の文書から抽出した
ワードが重複しないようにしているが、一の文書から同
一のワードを抽出することとしても良い。これにより、
繰り返し使用されているワードとそうでないワードとに
異なる重みづけをすることができる。In the above embodiment, when words are extracted from a reference document or a search target document, the words extracted from one document are not duplicated, but the same word is extracted from one document. It is good. This allows
Different weights can be assigned to words that are used repeatedly and words that are not.

【００７７】また、上記実施形態では、検索対象文書と
参照文書とは、異なるものとして説明しているが、検索
対象文書と参照文書とは同一であっても良い。In the above embodiment, the search target document and the reference document are described as being different from each other. However, the search target document and the reference document may be the same.

【００７８】また、ほとんどすべての検索対象文書に含
まれていることがあらかじめ分かっている一般的なワー
ド（上記実施形態では、例えばキーワード「課題」）
は、評価値として評価しない構成としても良い。すなわ
ち、ほぼ全検索対象文書に含まれるワードを格納した一
般ワードデータベースをさらに備え、評価値集計手段
は、一般ワードデータベースに格納されたワードを除い
て評価値を集計し、文書評価値算出手段は、各検索対象
文書に含まれるキーワードのうち、一般ワードデータベ
ースに格納されたワード以外のキーワードの数で集計値
を除することとしても良い。このような構成とすること
により、一般的なワードによって、文書評価値が希釈さ
れることがなくなるため、文書検索の精度をさらに高め
ることができる。A general word (for example, the keyword “problem” in the above embodiment) that is known in advance to be included in almost all search target documents
May not be evaluated as an evaluation value. That is, the system further includes a general word database storing words included in almost all search target documents, the evaluation value totalizing means totals evaluation values except for words stored in the general word database, and the document evaluation value calculating means Alternatively, of the keywords included in each search target document, the total value may be divided by the number of keywords other than the words stored in the general word database. With such a configuration, the document evaluation value is not diluted by a general word, so that the accuracy of document search can be further improved.

【００７９】[0079]

【発明の効果】本発明によれば、キーワードが基準文書
に固有に含まれる程度を示す評価値を格納した評価値格
納手段を備え、この評価値格納手段に格納された評価値
と検索対象文書に含まれるキーワードとに基づいて、そ
れぞれの検索対象文書の文書評価値を算出していること
により、検索対象文書から基準文書に近い内容の文書を
検索することができる。According to the present invention, there is provided an evaluation value storage means for storing an evaluation value indicating the degree to which a keyword is uniquely included in a reference document, and the evaluation value stored in the evaluation value storage means and a document to be searched. Since the document evaluation value of each search target document is calculated based on the keywords included in the search target document, it is possible to search the search target document for documents having contents similar to the reference document.

【００８０】また、本発明によれば、基準文書にキーワ
ードが出現する割合である基準文書内キーワード出現率
を、基準文書と参照文書を合わせた全文書にキーワード
が出現する割合である全文書内キーワード出現率で除す
ることによって、基準文書に含まれるキーワードの評価
値を算出しているので、キーワードが基準文書に固有に
含まれる程度を評価値とすることができる。Further, according to the present invention, the keyword appearance rate in the reference document, which is the rate at which the keyword appears in the reference document, is used to calculate the keyword appearance rate, which is the rate at which the keyword appears in all documents including the reference document and the reference document. Since the evaluation value of the keyword included in the reference document is calculated by dividing by the keyword appearance rate, the degree to which the keyword is uniquely included in the reference document can be used as the evaluation value.

[Brief description of the drawings]

【図１】実施形態に係る文書検索装置を示すブロック図
である。FIG. 1 is a block diagram illustrating a document search device according to an embodiment.

【図２】文書検索装置の動作を示すフローチャートであ
る。FIG. 2 is a flowchart illustrating an operation of the document search device.

【図３】基準文書の例を示す図である。FIG. 3 is a diagram illustrating an example of a reference document.

【図４】キーワードＤＢの例を示す図である。FIG. 4 is a diagram illustrating an example of a keyword DB.

【図５】全ワードＤＢの例を示す図である。FIG. 5 is a diagram illustrating an example of an all-word DB.

【図６】キーワード出現率を算出した例を示す図であ
る。FIG. 6 is a diagram illustrating an example of calculating a keyword appearance rate.

【図７】評価値ＤＢの例を示す図である。FIG. 7 is a diagram illustrating an example of an evaluation value DB.

【図８】検索ワードＤＢの例を示す図である。FIG. 8 is a diagram illustrating an example of a search word DB.

【図９】文書評価値を算出した例を示す図である。FIG. 9 is a diagram illustrating an example of calculating a document evaluation value.

【図１０】第２実施形態の文書検索装置を示すブロック
図である。FIG. 10 is a block diagram illustrating a document search device according to a second embodiment.

【図１１】参照ワードＤＢの例を示す図である。FIG. 11 is a diagram illustrating an example of a reference word DB.

【図１２】第２実施形態の文書検索装置の動作を示すフ
ローチャートである。FIG. 12 is a flowchart illustrating an operation of the document search device according to the second embodiment.

【図１３】キーワード出現率を算出した例を示す図であ
る。FIG. 13 is a diagram illustrating an example of calculating a keyword appearance rate.

【図１４】評価値ＤＢの例を示す図である。FIG. 14 is a diagram illustrating an example of an evaluation value DB.

[Explanation of symbols]

１０・・・文書検索装置、２０・・・入力手段、２１・・・基準
文書入力手段、２２・・・参照文書入力手段、２３・・・検索
対象文書入力手段、３０・・・ワード抽出手段、３１・・・キ
ーワード抽出手段、３２・・・参照ワード抽出手段、３３・
・・検索ワード抽出手段、４０・・・データベース、４１・・・
キーワードデータベース、４２・・・全ワードデータベー
ス、４３・・・評価値データベース、４４・・・検索ワードデ
ータベース、５１・・・基準文書内キーワード出現率算出
手段、５２・・・全文書内キーワード出現率算出手段、５
３・・・評価値算出手段、５４・・・評価値集計手段、５５・・
・文書評価値算出手段、５６・・・検索文書抽出手段。DESCRIPTION OF SYMBOLS 10 ... Document search apparatus, 20 ... Input means, 21 ... Reference document input means, 22 ... Reference document input means, 23 ... Search target document input means, 30 ... Word extraction means , 31 ... keyword extraction means, 32 ... reference word extraction means, 33
..Search word extracting means, 40 ... database, 41 ...
Keyword database, 42 ... All word database, 43 ... Evaluation value database, 44 ... Search word database, 51 ... Keyword appearance rate calculating means in reference document, 52 ... Keyword appearance rate in all documents Calculation means, 5
3 ... Evaluation value calculation means, 54 ... Evaluation value aggregation means, 55 ...
Document evaluation value calculation means, 56... Search document extraction means.

Claims

[Claims]

1. A document search apparatus for searching a plurality of search target documents for documents having contents close to a reference document serving as a reference, comprising: a plurality of keywords included in the reference document; An evaluation value storage unit that stores an evaluation value indicating a degree that a keyword is uniquely included in the reference document; a search word extraction unit that extracts all words included in the plurality of search target documents as search words; A search word storage unit that stores the search word extracted by the search word extraction unit in association with a search target document code that specifies the search target document as an extraction source; and the search word stored in the search word storage unit. The keyword is compared with the keyword stored in the evaluation value storage means, and when the search word matches the keyword, the search is performed. An evaluation value aggregation unit that assigns the evaluation value of the keyword to a word, and aggregates the evaluation value assigned to the search word based on the search target document code to calculate a total value for each search target document; Dividing the total value of the evaluation values in each of the search target documents calculated by the evaluation value totaling means by the number of the keywords included in the search target document, and calculating the document evaluation value of each of the search target documents A document evaluation value calculating unit that calculates the document evaluation value calculated by the document evaluation value calculating unit and a predetermined reference value, and the search target having the document evaluation value that is larger than the reference value A document retrieval apparatus, comprising: retrieval document extraction means for extracting a document.

2. The document search apparatus according to claim 1, wherein the search word extraction unit does not redundantly extract the same search word from one search target document.

3. The document search apparatus according to claim 1, wherein the same keyword is not stored in the evaluation value storage unit in duplicate.

4. The document search according to claim 1, wherein the keyword is extracted by using hiragana, punctuation, special symbols, and spaces included in the reference document as delimiters. apparatus.

5. The method according to claim 1, wherein the search word is extracted by using hiragana, punctuation marks, special symbols, and spaces included in the search target document as delimiters.
A document search device according to any one of the preceding claims.

6. A keyword extracting unit for extracting all words as keywords from at least one of the reference documents; a keyword storage unit for storing the keywords extracted for each of the reference documents by the keyword extracting unit; Reference word extraction means for extracting all words as reference words from a plurality of reference documents to be referred to for determining the evaluation value, and the keyword extracted for each of the reference documents by the keyword extraction means; and All word storage means for storing the reference words extracted for each of the reference documents by the reference word extraction means; and for each of the keywords, the number of the keywords included in the keyword storage means is divided by the number of the reference documents. To calculate the keyword appearance rate in the reference document. Means for calculating a keyword appearance rate in a reference document; and for each of the keywords, the sum of the number of the keywords included in the all word storage means and the number of the reference words identical to the keywords is calculated for the reference document and the reference document. A keyword occurrence rate calculating means for calculating a keyword occurrence rate in all documents by dividing by a sum of the numbers; a keyword occurrence rate in the reference document being divided by the keyword occurrence rate in all documents; 2. An evaluation value calculating means for calculating an evaluation value, further comprising: storing the evaluation value calculated by the evaluation value calculating means in the evaluation value storing means.
The document search device according to any one of claims 1 to 5.

7. A keyword extracting means for extracting all words as keywords from at least one reference document; a keyword storage means for storing the keywords extracted for each of the reference documents by the keyword extracting means; Reference word extraction means for extracting all words as reference words from a plurality of reference documents to be referred to for determining the evaluation value, and the reference words extracted for each of the reference documents by the reference word extraction means Reference word storage means for storing, for each of the keywords, the number of the keywords included in the keyword storage means divided by the number of the reference documents to calculate the keyword appearance rate in the reference document. Rate calculating means, and for each of the keywords, A reference document keyword appearance rate calculating means for calculating a keyword appearance rate in a reference document by dividing the number of the same reference words as the keywords contained in the reference word storage means by the number of the reference documents; And an evaluation value calculating means for calculating the evaluation value of the keyword by dividing the keyword appearance rate by the number obtained by adding a predetermined constant to the keyword appearance rate in the reference document. 2. The calculated evaluation value is stored in the evaluation value storage means.
The document search device according to any one of claims 1 to 5.

8. The document search device according to claim 6, wherein the keyword extracting unit does not redundantly extract the same keyword from one reference document.

9. The document search apparatus according to claim 6, wherein the reference word extracting unit does not redundantly extract the same reference word from one reference document. .

10. A document search method for searching a document having contents close to a reference document serving as a reference from a plurality of search target documents serving as a search target, comprising: a plurality of keywords included in the reference document; An evaluation value storing step of storing an evaluation value indicating a degree that a keyword is uniquely included in the reference document; a search word extracting step of extracting all words included in the plurality of search target documents as search words; A search word storage step of storing the search word extracted in the search word extraction step in association with a search target document code for specifying the search target document of the extraction source, and the search word stored in the search word storage step Collating the keyword stored in the evaluation value storing step with the search word; When the keyword matches, the evaluation value of the keyword is assigned to the search word, and the evaluation value assigned to the search word is tabulated based on the search target document code, and the evaluation value is calculated for each search target document. An evaluation value aggregation step of calculating an aggregation value, and dividing the aggregation value of the evaluation value in each of the search target documents calculated in the evaluation value aggregation step by the number of the keywords included in the search target document. A document evaluation value calculation step of calculating a document evaluation value of the search target document, and comparing the document evaluation value calculated in the document evaluation value calculation step with a preset reference value, and comparing the document evaluation value with the reference value. A document retrieval step for extracting the document to be retrieved having the document evaluation value.

11. The document search method according to claim 10, wherein the search word extracting step does not redundantly extract the same search word from one search target document.

12. The document search method according to claim 10, wherein the evaluation value storing step does not store the same keyword redundantly.

13. The method according to claim 10, wherein the keywords are extracted by using hiragana, punctuation, special symbols, and spaces included in the reference document as delimiters.
3. The document search method according to any one of 2.

14. The search word according to claim 10, wherein hiragana, punctuation, special symbols, and spaces included in the search target document are extracted as delimiters.
14. The document search method according to any one of items 13 to 13.

15. A keyword extracting step of extracting all words as keywords from at least one of the reference documents, and a keyword storing step of storing the keywords extracted for each of the reference documents in the keyword extracting step in a keyword storage unit. A reference word extracting step of extracting all words as reference words from a plurality of reference documents to be referred to for determining the evaluation value of the keyword; and a keyword extracted for each of the reference documents in the keyword extracting step. An all-word storage step of storing the keyword and the reference word extracted for each reference document in the reference word extraction step in an all-word storage means; and for each of the keywords, the keyword included in the keyword storage means Is divided by the number of the reference documents to calculate a keyword appearance rate in the reference document, and a keyword appearance rate in the reference document is calculated. For each of the keywords, the number of keywords included in the all word storage means And calculating a keyword occurrence rate in all documents by dividing a total number of reference words identical to the keyword by a total number of reference documents and the number of reference documents; An evaluation value calculating step of calculating the evaluation value of the keyword by dividing the keyword appearance rate in the document by the keyword occurrence rate in the entire document. The evaluation value calculated in the evaluation value calculating step is further provided. 15. The document search method according to claim 10, wherein the document is stored in the evaluation value storing step.

16. A keyword extracting step of extracting all words as keywords from at least one of the reference documents, and a keyword storing step of storing the keywords extracted for each of the reference documents in the keyword extracting step in a keyword storage unit. A reference word extraction step of extracting all words as reference words from a plurality of reference documents to be referred to for determining the evaluation value of the keyword; and A reference word storing step of storing the reference word in the reference word storage means, and for each of the keywords, dividing the number of the keywords included in the keyword storage means by the number of the reference documents, and generating a keyword in the reference document. Reference statement for calculating the rate Calculating the keyword appearance rate in the reference document by dividing the number of reference words identical to the keyword included in the reference word storage means by the number of reference documents for each of the keywords Calculating the keyword appearance rate in the reference document, and dividing the keyword appearance rate in the reference document by a number obtained by adding a predetermined constant to the keyword appearance rate in the reference document to calculate the evaluation value of the keyword The document search method according to any one of claims 10 to 14, further comprising a calculating step, wherein the evaluation value calculated in the evaluation value calculating step is stored in the evaluation value storing step. .

17. The document search method according to claim 15, wherein the keyword extracting step does not duplicately extract the same keyword from one reference document.

18. The document search according to claim 15, wherein the reference word extracting step does not duplicately extract the same reference word from one reference document. Method.