JP2008071198A

JP2008071198A - Document retrieval device, document retrieval method, document retrieval program and storage medium

Info

Publication number: JP2008071198A
Application number: JP2006250049A
Authority: JP
Inventors: Hiroo Hayano; 浩生早野; Tetsuya Ikeda; 哲也池田; Takuya Hiraoka; 卓也平岡; Shiro Horibe; 史郎堀部
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2006-09-14
Filing date: 2006-09-14
Publication date: 2008-03-27
Anticipated expiration: 2026-09-14
Also published as: JP4933869B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document retrieval device outputting an appropriate retrieval result to a retrieval request without unnecessarily limiting selection of a seed document or expansion word. <P>SOLUTION: The document retrieval device 10 for retrieving a document from a document database 15 based on an input retrieval condition comprises a seed document acquisition means 12 for acquiring a seed document based on an input seed document acquisition character string; a relevant document acquisition means for acquiring other documents used by a user of the seed document as relevant document; a word extraction means 13 for extracting a word relevant to the retrieval condition from the seed document and the relevant documents; and a retrieval means 14 for retrieving the document based on the retrieval condition and the word extracted by the word extraction means 13. The word extraction means 13 determines relevancy of each word with a predetermined keyword based on the distance between each word and the predetermined keyword, and extracts a predetermined number of words in descending order of the relevancy. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書検索装置、文書検索方法、文書検索プログラムおよび記録媒体に関し、特に、入力された検索条件に基づいて所定の文書の集合よりその検索条件に適合する文書を検索する文書検索装置、文書検索方法、文書検索プログラムおよび記録媒体に関する。 The present invention relates to a document search device, a document search method, a document search program, and a recording medium, and in particular, a document search device that searches a set of documents that match a search condition based on an input search condition, The present invention relates to a document search method, a document search program, and a recording medium.

文書検索の分野において、検索結果がユーザ（検索者）の検索要求に合致しているか否かは重要な評価基準の一つである。従来、検索要求に指定された検索語に基づいて検索要求に合致する度合い（以下、「適合度」という。）を文書毎に求め、適合度が大きい順に検索結果を出力する文書検索装置が提案されている（例えば、特許文献１）。 In the field of document search, whether or not the search result matches the search request of the user (searcher) is one of important evaluation criteria. 2. Description of the Related Art Conventionally, a document search apparatus that obtains a degree of matching with a search request (hereinafter referred to as “fitness”) for each document based on a search term specified in the search request and outputs a search result in descending order of suitability has been proposed. (For example, Patent Document 1).

また、高い品質の検索結果を得るために、利用者が検索要求に指定した検索語だけでなく関連する語も検索語として追加する手法（以下、「関連語拡張」という。）が存在する。関連語拡張により追加される検索語（以下、「拡張語」という。）の選択方法に対しても、様々な提案がされている。 In addition, in order to obtain a high-quality search result, there is a technique (hereinafter referred to as “related word expansion”) in which not only a search term specified by a user but also a related term is added as a search term. Various proposals have been made for a method of selecting a search term (hereinafter referred to as “extended word”) added by expansion of related terms.

例えば、適合性フィードバックという手法が知られており、この手法は、まず利用者が指定した検索語による検索（一次検索）の結果を利用者に提示し、結果として提示された文書を適合文書（利用者が所望とする文書）と非適合文書とに分類させる。その後、その結果を得て適合文書に含まれる語から選択された拡張語による検索（二次検索）の結果を最終的な結果として出力させる。以下、拡張語を選択するために用いられる文書を「シード文書」と呼ぶ。 For example, a method called relevance feedback is known. This method first presents the result of a search (primary search) with a search term designated by the user to the user, and the document presented as a result is a relevant document ( Documents desired by the user) and non-conforming documents. After that, the result is obtained, and the result of the search (secondary search) using the expanded word selected from the words included in the matching document is output as the final result. Hereinafter, a document used for selecting an extended word is referred to as a “seed document”.

また、適合性フィードバックが利用者に強いる負担を軽減するため、擬似適合性フィードバックという手法がある。これは、一次検索の結果の上位に位置付けられた文書をシード文書として拡張語を得るというものである。 In addition, there is a technique called pseudo-compatibility feedback in order to reduce the burden imposed on the user by conformity feedback. In this method, an extended word is obtained by using a document positioned higher in the result of the primary search as a seed document.

しかし、上述のような従来の適合性フィードバックや擬似適合性フィードバックでは、シード文書が検索対象の文書群（一次検索の結果）から選択されるので、拡張語の選択が一次検索の結果に制限されてしまい、最終的な検索結果の質を低下させてしまう場合がある。 However, in the conventional relevance feedback and pseudo relevance feedback as described above, since the seed document is selected from the document group to be searched (the result of the primary search), the selection of extended words is limited to the result of the primary search. This may reduce the quality of the final search result.

この欠点を補う手法はいくつか提案されており、例えば、特許文献２では、二次検索の適合度計算に一次検索の適合度計算の結果をフィードバックさせることで一次検索結果の質が悪い場合でも最終検索結果の質への悪影響を軽減させている。 Several methods have been proposed to compensate for this drawback. For example, in Patent Document 2, even if the quality of the primary search result is poor by feeding back the result of the suitability calculation of the primary search to the suitability calculation of the secondary search. Reducing the negative impact on the quality of the final search results.

また、特許文献３では、一次検索の結果得られたシード文書を著者や日付等の書誌事項に基づいて複数のグループに分割し、多様な観点から拡張語を選出することで最終検索結果の質を向上させている。 In Patent Document 3, the seed document obtained as a result of the primary search is divided into a plurality of groups based on bibliographic items such as authors and dates, and the quality of the final search result is selected by selecting extended words from various viewpoints. Has improved.

また一方で、単語毎に関連する語を予め登録しておき、その対応関係を元に関連語拡張を行う手法も提案されている。例えば、特許文献４では、共起語データベースという形で関連する語を登録しておく手法が提案されている。
特開平１１−２２４２６４号公報特開２００３−２４２１７０号公報特開２００４−１９２３７４号公報特開２００３−０２２２７５号公報 On the other hand, a method has been proposed in which related words are registered in advance for each word, and related words are expanded based on the correspondence. For example, Patent Document 4 proposes a method of registering related words in the form of a co-occurrence word database.
Japanese Patent Laid-Open No. 11-224264 JP 2003-242170 A JP 2004-192374 A JP 2003-022275 A

しかしながら、特許文献２および特許文献３における文書検索装置は、シード文書の選択に際し一次検索の影響を大きく受けてしまうことに変わりはない。また、特許文献４に記載の文書検索装置は、拡張語の対応関係を予め登録しておく必要があるので、対応関係のメンテナンスが必要となり、用語が次々と追加されるような分野には適用が困難であるという問題がある。 However, the document search apparatuses in Patent Document 2 and Patent Document 3 are still greatly affected by the primary search when selecting a seed document. In addition, the document search apparatus described in Patent Document 4 needs to register the correspondence relationship of extended words in advance, so that it is necessary to maintain the correspondence relationship and is applied to a field where terms are added one after another. There is a problem that is difficult.

本発明は、上記の点に鑑みてなされたものであって、シード文書や拡張語の選択に余計な制限を設けることなく、検索要求に対して適切な検索結果を出力することのできる文書検索装置、文書検索方法、文書検索プログラムおよび記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and is a document search that can output an appropriate search result in response to a search request without providing an extra restriction on the selection of a seed document or an extended word. An object is to provide an apparatus, a document search method, a document search program, and a recording medium.

上述の目的を達成するために、第一の発明に係る文書検索装置は、入力された検索条件に基づいて所定の文書データベースから文書を検索する文書検索装置であって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得手段と、前記シード文書取得手段により取得されたシード文書の利用者が利用した他の文書を関連文書として取得する関連文書取得手段と、前記シード文書および前記関連文書から前記検索条件に関連する単語を抽出する単語抽出手段と、前記検索条件と前記単語抽出手段が抽出した単語とに基づいて文書を検索する検索手段と、を有することを特徴とする。 In order to achieve the above object, a document search device according to a first invention is a document search device that searches a document from a predetermined document database based on an input search condition, and acquires an input seed document. A seed document acquisition unit that acquires a seed document based on a character string for use; a related document acquisition unit that acquires another document used by a user of the seed document acquired by the seed document acquisition unit; and A word extracting unit that extracts a word related to the search condition from the seed document and the related document, and a search unit that searches the document based on the search condition and the word extracted by the word extracting unit. Features.

また、第二の発明は、第一の発明に係る文書検索装置であって、前記関連文書は、前記シード文書の利用者が利用した他の文書の他に、前記シード文書の借用者が借りた他の文書、前記シード文書の購入者が購入した他の文書または前記シード文書の閲覧者が閲覧した他の文書を含むことを特徴とする。 A second invention is a document search apparatus according to the first invention, wherein the related document is borrowed by a borrower of the seed document in addition to other documents used by a user of the seed document. And other documents purchased by a purchaser of the seed document or other documents browsed by a viewer of the seed document.

また、第三の発明に係る文書検索装置は、入力された検索条件に基づいて所定の文書データベースから文書を検索する文書検索装置であって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得手段と、前記シード文書取得手段が取得した前記シード文書から前記検索条件に関連する単語を抽出する単語抽出手段と、前記検索条件と前記単語抽出手段が抽出した単語とに基づいて文書を検索する検索手段と、を有し、前記単語抽出手段は、各単語と所定のキーワードとの間の距離に基づいて各単語の該所定のキーワードに対する関連度を決定し、該関連度が高い順に所定数の単語を抽出することを特徴とする。 A document search device according to a third aspect of the invention is a document search device that searches a document from a predetermined document database based on an input search condition, and is based on an input seed document acquisition character string. Seed document acquisition means for acquiring a document; word extraction means for extracting a word related to the search condition from the seed document acquired by the seed document acquisition means; the search condition and the word extracted by the word extraction means; Search means for searching for a document based on the word, and the word extraction means determines the degree of relevance of each word to the predetermined keyword based on the distance between each word and the predetermined keyword, A predetermined number of words are extracted in descending order of relevance.

また、第四の発明は、第三の発明に係る文書検索装置であって、前記単語抽出手段は、各単語と前記所定のキーワードとの間の距離に加え、各単語の出現頻度もしくは各単語を含むシード文書の数に基づいて各単語の前記キーワードに対する関連度を決定し、該関連度の高い順に所定数の単語を抽出することを特徴とする。 Further, a fourth invention is a document search device according to the third invention, wherein the word extracting means includes the appearance frequency of each word or each word in addition to the distance between each word and the predetermined keyword. The degree of relevance of each word to the keyword is determined based on the number of seed documents including, and a predetermined number of words are extracted in descending order of the degree of relevance.

また、第五の発明は、第三または第四の発明に係る文書検索装置であって、単語と前記所定のキーワードとの間の距離が大きくなるに従って減少する関連度の減少率を設定させる減少率設定手段を有することを特徴とする。 Further, the fifth invention is a document search device according to the third or fourth invention, wherein a reduction rate for setting a reduction rate of the degree of association that decreases as the distance between a word and the predetermined keyword increases. It has a rate setting means.

また、第六の発明は、第五の発明に係る文書検索装置であって、前記減少率は、文毎に変化することを特徴とする。 The sixth invention is a document retrieval apparatus according to the fifth invention, wherein the reduction rate changes for each sentence.

また、第七の発明に係る文書検索方法は、入力された検索条件に基づいて所定の文書データベースから文書を検索する文書検索方法であって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得ステップと、前記シード文書取得ステップにおいて取得されたシード文書の利用者が利用した他の文書を関連文書として取得する関連文書取得ステップと、前記シード文書および前記関連文書から前記検索条件に関連する単語を抽出する単語抽出ステップと、前記検索条件と前記単語抽出ステップにおいて抽出された単語とに基づいて文書を検索する検索ステップと、を有することを特徴とする。 A document search method according to a seventh aspect of the invention is a document search method for searching a document from a predetermined document database based on an input search condition, wherein the seed search is performed based on an input seed document acquisition character string. A seed document acquisition step for acquiring a document, a related document acquisition step for acquiring another document used by a user of the seed document acquired in the seed document acquisition step as a related document, and the seed document and the related document. The method includes a word extraction step for extracting a word related to the search condition, and a search step for searching for a document based on the search condition and the word extracted in the word extraction step.

また、第八の発明に係る文書検索方法は、入力された検索条件に基づいて所定の文書データベースから文書を検索する文書検索方法であって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得ステップと、前記シード文書取得ステップにおいて取得された前記シード文書から前記検索条件に関連する単語を抽出する単語抽出ステップと、前記検索条件と前記単語抽出ステップにおいて抽出された単語とに基づいて文書を検索する検索ステップと、を有し、前記単語抽出ステップは、単語と所定のキーワードとの間の距離に基づいて該単語の前記所定のキーワードに対する関連度を決定し、該関連度の高い順に所定数の単語を抽出することを特徴とする。 A document search method according to an eighth aspect of the present invention is a document search method for searching a document from a predetermined document database based on an input search condition, wherein the seed search is based on an input seed document acquisition character string. A seed document acquisition step for acquiring a document; a word extraction step for extracting a word related to the search condition from the seed document acquired in the seed document acquisition step; and a search condition and the word extraction step extracted A search step for searching for a document based on a word, wherein the word extraction step determines a degree of relevance of the word to the predetermined keyword based on a distance between the word and the predetermined keyword, A predetermined number of words are extracted in descending order of the degree of association.

また、第九の発明に係る文書検索プログラムは、入力された検索条件に基づいて所定の文書データベースからの文書の検索をコンピュータに実行させる文書検索プログラムであって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得ステップと、前記シード文書取得ステップにおいて取得されたシード文書の利用者が利用した他の文書を関連文書として取得する関連文書取得ステップと、前記シード文書および前記関連文書から前記検索条件に関連する単語を抽出する単語抽出ステップと、前記検索条件と前記単語抽出ステップにおいて抽出された単語とに基づいて文書を検索する検索ステップとを有することを特徴とする。 A document search program according to the ninth invention is a document search program for causing a computer to search a document from a predetermined document database based on an input search condition, wherein the input seed document acquisition character A seed document acquisition step for acquiring a seed document based on a column; a related document acquisition step for acquiring another document used by a user of the seed document acquired in the seed document acquisition step; and the seed document And a word extraction step for extracting a word related to the search condition from the related document, and a search step for searching for a document based on the search condition and the word extracted in the word extraction step. To do.

また、第十の発明に係る文書検索プログラムは、入力された検索条件に基づいて所定の文書データベースからの文書の検索をコンピュータに実行させる文書検索プログラムであって、入力されたシード文書取得用文字列に基づいてシード文書を取得するシード文書取得ステップと、前記シード文書取得ステップにおいて取得された前記シード文書から前記検索条件に関連する単語を抽出する単語抽出ステップと、前記検索条件と前記単語抽出ステップにおいて抽出された単語とに基づいて文書を検索する検索ステップと、を有し、前記単語抽出ステップは、単語と所定のキーワードとの間の距離に基づいて該単語の前記所定のキーワードに対する関連度を決定し、該関連度の高い順に所定数の単語を抽出することを特徴とする。 A document search program according to a tenth invention is a document search program for causing a computer to search for a document from a predetermined document database based on an input search condition, wherein the input seed document acquisition character is A seed document acquisition step of acquiring a seed document based on a column; a word extraction step of extracting a word related to the search condition from the seed document acquired in the seed document acquisition step; the search condition and the word extraction A search step for searching for a document based on the word extracted in the step, wherein the word extraction step relates the word to the predetermined keyword based on a distance between the word and the predetermined keyword. The degree is determined, and a predetermined number of words are extracted in descending order of the degree of association.

また、第十一の発明に係る記録媒体は、第九または第十の発明に係る文書検索プログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする。 The recording medium according to the eleventh invention is a computer-readable recording medium recording the document search program according to the ninth or tenth invention.

本発明によれば、シード文書や拡張語の選択に余計な制限を設けることなく、検索要求に対して適切な検索結果を出力することのできる文書検索装置、文書検索方法、文書検索プログラムおよび記録媒体を提供することができる。 According to the present invention, a document search apparatus, a document search method, a document search program, and a record capable of outputting an appropriate search result in response to a search request without placing an extra restriction on the selection of seed documents and extended words A medium can be provided.

以下、図面に基づいて本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態における文書検索装置の機能構成例を示す図である。図１において、文書検索装置１０は、検索要求入力部１１、シード文書取得部１２、拡張語抽出部１３、文書検索部１４および文書データベース１５から構成される。 FIG. 1 is a diagram illustrating a functional configuration example of a document search apparatus according to an embodiment of the present invention. In FIG. 1, the document search apparatus 10 includes a search request input unit 11, a seed document acquisition unit 12, an extended word extraction unit 13, a document search unit 14, and a document database 15.

検索要求入力部１１は、検索要求入力画面を表示させ、所望とする文書を検索するための検索語、検索文字列、検索式等を入力させて検索条件を取得するための手段である。 The search request input unit 11 is a means for displaying a search request input screen and acquiring a search condition by inputting a search word, a search character string, a search expression, and the like for searching for a desired document.

また、検索要求入力部１１は、所望とする文書の内容を表す単語、文字列または文章等をユーザに入力させてシード文書を取得するためのシード文書取得用文字列（単語、複合語または文章等である。）を取得する。 Further, the search request input unit 11 is a seed document acquisition character string (word, compound word, or sentence) for acquiring a seed document by allowing a user to input a word, a character string, or a sentence representing the content of a desired document. Etc.).

図２は、検索要求入力画面の表示例を示す図であり、検索要求入力画面１１０は、検索条件入力領域１１１、シード文書取得用文字列入力領域１１２、シード数入力領域１１３および検索ボタン１１４から構成される。 FIG. 2 is a diagram showing a display example of the search request input screen. The search request input screen 110 includes a search condition input area 111, a seed document acquisition character string input area 112, a seed number input area 113, and a search button 114. Composed.

検索条件入力領域１１１は、検索条件を入力させるためのテキストボックスであり、シード文書取得用文字列入力領域１１２は、シード文書取得用文字列またはそれを含む文章を入力させるためのテキストボックスである。 The search condition input area 111 is a text box for inputting a search condition, and the seed document acquisition character string input area 112 is a text box for inputting a seed document acquisition character string or a sentence including the character string. .

シード文書取得用文字列入力領域１１２には、例えば、話し言葉のような自然文が入力されてもよく、その場合、検索要求入力部１１は、入力された自然文から形態素解析等によりシード文書取得用文字列を抽出する。 For example, a natural sentence such as spoken language may be input to the seed document acquisition character string input area 112. In this case, the search request input unit 11 acquires a seed document from the input natural sentence by morphological analysis or the like. Extract the character string.

また、シード文書取得文字列入力領域１１２には、検索条件入力領域１１１に入力された検索条件に基づく検索結果（文書群）の中から関連度の最も高い文字列（例えば、検索された文書の中で出現頻度が最も高い文字列）が自動的に抽出されたうえで入力されてもよく、検索結果（文書群）の中からユーザが任意に選択した文字列が入力されてもよい。 In the seed document acquisition character string input area 112, a character string having the highest degree of relevance (for example, the searched document) among the search results (document group) based on the search condition input in the search condition input area 111 is displayed. The character string having the highest appearance frequency among them may be automatically extracted and input, or a character string arbitrarily selected by the user from the search results (document group) may be input.

シード数入力領域１１３は、シード文書取得用文字列によって取得するシード文書の最大数を入力させるためのテキストボックスであり、例えば、シード文書の最大数に「１０」が入力された場合、シード文書取得用文字列に基づいて検索された文書が１００件であってもシード文書取得用文字列の出現頻度等に基づいて１００件のうちの１０件のみをシード文書とする。 The seed number input area 113 is a text box for inputting the maximum number of seed documents acquired by the seed document acquisition character string. For example, when “10” is input as the maximum number of seed documents, the seed document is input. Even if there are 100 documents retrieved based on the acquisition character string, only 10 out of 100 documents are set as seed documents based on the appearance frequency of the seed document acquisition character string.

検索ボタン１１４は、文書の検索を開始させるためのボタンであり、検索ボタン１１４が押下されるとシード文書取得用文字列に基づいてシード文書が抽出され、シード文書から拡張語が抽出され、検索条件と拡張語に基づいて文書が検索される。 The search button 114 is a button for starting a search of a document. When the search button 114 is pressed, a seed document is extracted based on a character string for obtaining a seed document, and an extended word is extracted from the seed document. Documents are searched based on conditions and extended words.

シード文書取得部１２は、検索要求入力部１１が取得したシード文書取得用文字列に基づいてシード文書を取得するための手段である。 The seed document acquisition unit 12 is a means for acquiring a seed document based on the seed document acquisition character string acquired by the search request input unit 11.

また、シード文書取得部１２は、検索要求入力部１１が取得したシード文書取得用文字列に基づいて一次的な検索を行い、その一次的な検索によって得られた文書の利用者が利用した他の文書を関連文書として取得する。 In addition, the seed document acquisition unit 12 performs a primary search based on the seed document acquisition character string acquired by the search request input unit 11, and is used by a user of the document obtained by the primary search. Is acquired as a related document.

関連文書をシード文書に加えてシード文書の数を増大させ、シード文書から抽出される拡張語をより適切なものとするためであり、また、シード文書の利用者は、シード文書と内容が類似する文書を利用している可能性が高いからである。 In order to increase the number of seed documents by adding related documents to the seed documents, and to make the extended words extracted from the seed documents more appropriate, the users of the seed documents are similar in content to the seed documents. This is because there is a high possibility that the document to be used is used.

拡張語抽出部１３は、シード文書を構成する単語から拡張語を所定数選択するための手段であり、シード文書および関連文書を構成する単語から拡張語を所定数選択するようにしてもよい。 The extended word extraction unit 13 is a means for selecting a predetermined number of extended words from the words constituting the seed document, and may be configured to select a predetermined number of extended words from the words constituting the seed document and the related document.

拡張語抽出部１３は、例えば、形態素解析によりシード文書に含まれるすべての単語を抽出し、各単語のシード文書における出現頻度を算出し、出現頻度の高い順に所定数（例えば、５個）の単語を拡張語として抽出する。 The extended word extraction unit 13 extracts, for example, all the words included in the seed document by morphological analysis, calculates the appearance frequency of each word in the seed document, and calculates a predetermined number (for example, 5) in descending order of appearance frequency. Extract words as extended words.

文書検索部１４は、検索条件と拡張語抽出部１３で抽出された拡張語とに基づいて文書データベース１５に蓄積されている文書の集合（以下、「被検索文書」という。）の中から適合する文書を検索して検索結果の一覧を利用者に提示するための手段であり、例えば、検索条件および拡張語の双方を含む文書を検索してもよく、検索条件または拡張語のいずれかを含む文書を検索してもよい。 The document search unit 14 is adapted from a set of documents stored in the document database 15 (hereinafter referred to as “searched document”) based on the search condition and the extended word extracted by the extended word extraction unit 13. This is a means for searching a document to be searched and presenting a list of search results to the user. For example, a document including both a search condition and an extended word may be searched. You may search the document which contains.

また、文書検索部１４は、拡張語のすべてを含む文書を検索してもよく、所定数（例えば、３個）以上の拡張語を含む文書を検索してもよい。 Further, the document search unit 14 may search for a document including all of the extended words, or may search for a document including a predetermined number (for example, three) or more of extended words.

文書データベース１５は、被検索文書を蓄積したデータベースである。 The document database 15 is a database in which searched documents are accumulated.

なお、文書検索装置１０は、一台のコンピュータで構成されてもよく、クライアント・サーバ型等を採用して複数台のコンピュータで構成されてもよい。後者の場合、例えば、検索要求入力部１１がクライアントに実装され、シード文書取得部１２、拡張語抽出部１３、文書検索部１４および文書データベース１５がサーバに実装されるようにしてもよい。 The document retrieval apparatus 10 may be configured by a single computer, or may be configured by a plurality of computers by adopting a client / server type or the like. In the latter case, for example, the search request input unit 11 may be mounted on the client, and the seed document acquisition unit 12, the extended word extraction unit 13, the document search unit 14, and the document database 15 may be mounted on the server.

図３は、本発明の実施の形態における文書検索装置１０のハードウェア構成例を示す図である。図３の文書検索装置１０は、ドライブ装置１００、補助記憶装置１０２、メモリ装置１０３、演算処理装置１０４、表示装置１０５および入力装置１０６から構成される。 FIG. 3 is a diagram illustrating a hardware configuration example of the document search apparatus 10 according to the embodiment of the present invention. 3 includes a drive device 100, an auxiliary storage device 102, a memory device 103, an arithmetic processing device 104, a display device 105, and an input device 106.

ドライブ装置１００は、記録媒体１０１に記録されたプログラム等を読み出すための装置である。 The drive device 100 is a device for reading a program or the like recorded on the recording medium 101.

記録媒体１０１は、各種データを記録するための持ち運び可能な記録媒体であり、例えば、ＣＤ−ＲＯＭやＤＶＤ―ＲＯＭ等がある。 The recording medium 101 is a portable recording medium for recording various data, and examples thereof include a CD-ROM and a DVD-ROM.

補助記憶装置１０２は、文書検索装置１０において各種処理を実行するためのプログラムを記憶するための不揮発性記録媒体であり、例えば、ハードディスクがある。文書検索装置１０は、プログラムを記録した記録媒体１０１がドライブ装置１００にセットされると、ドライブ装置１００の記録媒体１０１からそのプログラムを読み出して補助記憶装置１０２にインストールする。 The auxiliary storage device 102 is a non-volatile recording medium for storing a program for executing various processes in the document search device 10, and includes, for example, a hard disk. When the recording medium 101 on which the program is recorded is set in the drive device 100, the document search device 10 reads the program from the recording medium 101 of the drive device 100 and installs it in the auxiliary storage device 102.

メモリ装置１０３は、文書検索装置１０において各種処理を実行するためのプログラムをロードするための揮発性記録媒体であり、例えば、ＲＡＭ(Random Access Memory)がある。文書検索装置１０は、プログラムの起動命令があった場合、補助記憶装置１０２からプログラムを読み出してメモリ装置１０３にロードする。 The memory device 103 is a volatile recording medium for loading a program for executing various processes in the document search device 10, and includes, for example, a RAM (Random Access Memory). When there is a program start command, the document search device 10 reads the program from the auxiliary storage device 102 and loads it into the memory device 103.

演算処理装置１０４は、メモリ装置１０３にロードされたプログラムを逐次実行させるための装置である。 The arithmetic processing unit 104 is a device for sequentially executing a program loaded in the memory device 103.

表示装置１０５は、プログラムによるＧＵＩ（Graphical User Interface）等を表示するための装置であり、入力装置１０６は、キーボードおよびマウス等で構成され、様々な操作指示を受け付けるための装置である。 The display device 105 is a device for displaying a GUI (Graphical User Interface) or the like by a program, and the input device 106 is configured by a keyboard, a mouse, and the like, and is a device for receiving various operation instructions.

次に、図４を参照しながら、文書検索装置１０における処理手順について説明する。図４は、第一の実施の形態における文書検索装置１０による文書検索処理を説明するためのフローチャートである。 Next, a processing procedure in the document search apparatus 10 will be described with reference to FIG. FIG. 4 is a flowchart for explaining the document search processing by the document search apparatus 10 in the first embodiment.

最初に、検索要求入力部１１は、検索要求入力画面１１０を表示装置１０５に表示させ、利用者に検索要求を入力させる（ステップＳ１０１）。 First, the search request input unit 11 displays the search request input screen 110 on the display device 105, and causes the user to input a search request (step S101).

検索条件、シード文書取得用文字列もしくはそれを含む文章、および、シード文書の最大数等が入力され、検索ボタン１１４がクリックされると、シード文書取得部１２は、シード文書取得用文字列入力領域１１２に入力された文章を形態素解析により単語に分割する（ステップＳ１０２）。 When a search condition, a character string for acquiring a seed document or a sentence including the same, a maximum number of seed documents, and the like are input, and the search button 114 is clicked, the seed document acquiring unit 12 inputs a character string for acquiring the seed document The sentence input to the area 112 is divided into words by morphological analysis (step S102).

次に、シード文書取得部１２は、単語毎に被検索文書における出現頻度を算出する（ステップＳ１０３）。 Next, the seed document acquisition unit 12 calculates the appearance frequency in the searched document for each word (step S103).

次に、シード文書取得部１２は、出現頻度の最も高い単語を選択し（ステップＳ１０４）、選択された単語と検索条件入力領域１１１に入力された検索条件とシード文書の最大数とに基づいて文書データベース１５に対する検索要求を示す命令文を生成する（ステップＳ１０５）。なお、シード文書取得部１２は、出現頻度の高い順に複数の単語を選択してもよい。 Next, the seed document acquisition unit 12 selects the word having the highest appearance frequency (step S104), and based on the selected word, the search condition input in the search condition input area 111, and the maximum number of seed documents. A command statement indicating a search request for the document database 15 is generated (step S105). Note that the seed document acquisition unit 12 may select a plurality of words in descending order of appearance frequency.

検索要求を示す命令文は、公知のＳＱＬ(Structured Query Language)構文またはその拡張構文で記述され、例えば、以下のような副問い合せを用いた拡張構文とする。 The command statement indicating the search request is described in a well-known SQL (Structured Query Language) syntax or its extended syntax, and is, for example, an extended syntax using the following subquery.

select タイトル from ドキュメント where 本文 contains '環境保護'expand from (select タイトル from ドキュメント where 文書ＩＤ in (select 文書ＩＤ from 文書ＩＤ履歴 where 利用者ＩＤ in (select 第一利用者ＩＤ from 利用者ＩＤ履歴 where 文書ＩＤ in (select 文書ＩＤ from ドキュメント where 本文 contains '温暖化' limit 10))))
なお、以下は、上述の命令文を説明のため複数の部分に分割したものである。 select title from document where body contains 'environmental protection' expand from (select title from document where document ID in (select document ID from document ID history where user ID in (select first user ID from user ID history where document ID in (select document ID from document where text contains 'warming' limit 10))))
In the following, the above-described command sentence is divided into a plurality of parts for explanation.

select タイトル from ドキュメント where 本文 contains '環境保護' ・・・（１）
expand from・・・（２）
(select タイトル from ドキュメント where 文書ＩＤ in ・・・（３）
(select 文書ＩＤ from 文書ＩＤ履歴 where 利用者ＩＤ in ・・・（４）
(select 第一利用者ＩＤ from 利用者ＩＤ履歴 where 文書ＩＤ in ・・・（５）
(select 文書ＩＤ from ドキュメント where 本文 contains '温暖化' limit 10))))・・・（６）
（１）の部分は、文書データベース１５に定義されているドキュメントテーブルに対する検索命令であり、より詳しくは、「ドキュメントテーブルにおいて文書の本文に'環境保護'という語を含む文書のタイトルを抽出せよ。」という命令を意味する。 select Title from Document where Text contains 'Environmental protection' (1)
expand from ... (2)
(select title from document where document ID in (3)
(select document ID from document ID history where user ID in (4)
(select first user ID from user ID history where document ID in (5)
(select document ID from document where text contains 'warming' limit 10))))) (6)
The part (1) is a search command for a document table defined in the document database 15. More specifically, “extract the title of a document that includes the word“ environmental protection ”in the body of the document in the document table. "Means the command.

「ドキュメントテーブル」は、文書ＩＤで特定される文書に関する各種データを体系的に構成したテーブルであり、例えば、図４（Ａ）に示すように、文書ＩＤ、タイトル、著者、出版社、翻訳者等のフィールドを有する。 The “document table” is a table that systematically forms various data related to the document specified by the document ID. For example, as shown in FIG. 4A, the document ID, title, author, publisher, translator Etc. fields.

また、「文書ＩＤ」とは、文書データベース１５に格納された文書を特定するための識別子であり、例えば、数字、記号、文字列等で表現され、ドキュメントテーブル、利用者ＩＤ履歴テーブルおよび文書ＩＤ履歴テーブルに共通する項目として用いられる。 The “document ID” is an identifier for specifying a document stored in the document database 15 and is expressed by, for example, a number, a symbol, a character string, etc., and includes a document table, a user ID history table, and a document ID. Used as a common item in the history table.

「利用者ＩＤ履歴テーブル」は、利用者の履歴を文書毎に記録したテーブルであり、例えば、文書データベース１５に格納され、図４（Ｂ）に示すように、文書ＩＤ、第一利用者ＩＤ、第二利用者ＩＤ等のフィールドを有する。 The “user ID history table” is a table in which the user history is recorded for each document. For example, the user ID history table is stored in the document database 15 and, as shown in FIG. 4B, the document ID and the first user ID. And a field such as a second user ID.

また、利用者ＩＤ履歴テーブルは、各文書を利用した利用者の履歴を時系列で記録するテーブルであって、例えば、図書館の貸し出し履歴の管理、書店の販売履歴の管理、ウェブサイトの閲覧履歴の管理等に利用される。 The user ID history table is a table for recording the history of users who used each document in time series. For example, the management of library rental history, the management of bookstore sales history, the browsing history of websites, etc. It is used for management etc.

「利用者ＩＤ」とは、利用者を特定するための識別子であり、例えば、数字、記号、文字列等で表現され、利用者ＩＤ履歴テーブルおよび文書ＩＤ履歴テーブルにおけるフィールドとして用いられる。「利用者」とは、文書を利用した者であり、例えば、文書検索装置１０が図書館に導入された場合における文書（書籍）の借用者、文書検索装置１０が書店に導入された場合における文書（書籍）の購入者、文書検索装置１０がウェブサイトに導入された場合における文書（コンテンツ）の閲覧者等を含む。 The “user ID” is an identifier for specifying a user, and is represented by, for example, a number, a symbol, a character string, etc., and is used as a field in the user ID history table and the document ID history table. A “user” is a person who uses a document, for example, a borrower of a document (book) when the document search apparatus 10 is introduced into a library, and a document when the document search apparatus 10 is introduced into a bookstore. (Books) purchasers, documents (contents) viewers when the document search device 10 is installed on a website, and the like.

また、第一利用者ＩＤは、対応する文書を利用した直近の利用者の識別子であり、第二利用者ＩＤは、第一利用者ＩＤが示す利用者の前に文書を利用した利用者の識別子である。 The first user ID is the identifier of the most recent user who used the corresponding document, and the second user ID is the user who used the document before the user indicated by the first user ID. It is an identifier.

「文書ＩＤ履歴テーブル」とは、各利用者が利用した文書の履歴を利用者毎に記録したテーブルであり、例えば、文書データベース１５に格納され、図４（Ｃ）に示すように、利用者ＩＤ、第一文書ＩＤ、第二文書ＩＤ等のフィールドを有する。 The “document ID history table” is a table in which the history of documents used by each user is recorded for each user. For example, the document ID history table is stored in the document database 15 and, as shown in FIG. Fields such as ID, first document ID, and second document ID are included.

また、文書ＩＤ履歴テーブルは、各利用者が利用した文書の履歴を時系列で記録するテーブルであって、例えば、利用者ＩＤ履歴テーブルと同様、図書館の貸し出し履歴の管理、書店の販売履歴の管理、ウェブサイトの閲覧履歴の管理等に利用される。 The document ID history table is a table that records the history of documents used by each user in time series. For example, as with the user ID history table, the management of library rental history and the sales history of bookstores are recorded. Used for management, management of website browsing history, etc.

また、第一文書ＩＤは、対応する利用者が利用した直近の文書の識別子であり、第二文書ＩＤは、対応する利用者が利用した第一文書ＩＤで示す文書の前に利用した文書の識別子である。 The first document ID is an identifier of the latest document used by the corresponding user, and the second document ID is the document used before the document indicated by the first document ID used by the corresponding user. It is an identifier.

また、「expand from」という記述（２）に続く副問い合せにおける最も外側のselect文（３）は、より多くのシード文書を取得するための検索命令であり、より詳しくは、「ドキュメントテーブルにおいて文書ＩＤの値が（４）の検索結果の値に一致するレコードのタイトルを抽出せよ。」という命令を意味する。 The outermost select statement (3) in the subquery following the description (2) “expand from” is a search command for acquiring more seed documents. This means a command “extract the title of the record whose ID value matches the search result value of (4)”.

なお、「expand from X」は、「Ｘで示される文書群から所定数の拡張語を抽出せよ。」という命令を意味する。 Note that “expand from X” means an instruction “extract a predetermined number of expansion words from the document group indicated by X”.

また、二番目に外側のselect文（４）は、「文書ＩＤ履歴テーブルにおいて利用者ＩＤの値が（５）の検索結果の値に一致するレコードの文書ＩＤを抽出せよ。」という命令を意味する。 The second outer select statement (4) means an instruction “extract the document ID of a record whose user ID value matches the search result value of (5) in the document ID history table”. To do.

また、三番目に外側のselect文（５）は、「利用者ＩＤ履歴テーブルにおいて文書ＩＤの値が（６）の検索結果の値に一致するレコードの第一利用者ＩＤを抽出せよ。」という命令を意味する。 The third outer select statement (5) says, “Extract the first user ID of a record whose document ID value matches the search result value of (6) in the user ID history table”. Means an instruction.

また、最も内側のselect文（６）は、「ドキュメントテーブルにおいて文書の本文に'温暖化'という語を含むレコードの上位１０件の文書ＩＤを検索せよ。」という命令を意味する。上位１０件を定める順位は、例えば、各文書における「温暖化」の出現頻度に基づいて決定される。 The innermost select statement (6) means an instruction “search the top 10 document IDs of records including the word“ warming ”in the document body in the document table”. The ranking for determining the top 10 cases is determined based on, for example, the appearance frequency of “warming” in each document.

なお、「温暖化」という単語は、シード文書取得用文字列より抽出された単語であり、「limit 10」は、取得するシード文書の最大数を示す。また、「環境保護」は、検索条件として入力された検索語である。 The word “warming” is a word extracted from the character string for seed document acquisition, and “limit 10” indicates the maximum number of seed documents to be acquired. “Environmental protection” is a search term input as a search condition.

すなわち、上記のＳＱＬ構文は、（６）において検索されたシード文書を利用した利用者の第一利用者ＩＤ（直近の利用者ＩＤを意味する。）を（５）において検索し、（５）において検索された第一利用者ＩＤを有する利用者が利用したシード文書以外の文書の文書ＩＤを（４）において検索し、さらに、（４）において検索された文書ＩＤが示す文書を（３）において関連文書として抽出し、（３）において抽出された関連文書から所定数の拡張語を（２）において抽出し、（２）において抽出された拡張語または検索語「環境保護」を本文に含む文書のタイトルを抽出せよ。」を意味することとなる。 That is, the above SQL syntax searches for the first user ID (meaning the latest user ID) of the user who used the seed document searched in (6) in (5), and (5) The document ID of the document other than the seed document used by the user having the first user ID searched for in (4) is searched for in (4), and the document indicated by the document ID searched in (4) is (3) In step (2), a predetermined number of extended words are extracted from the related document extracted in (3), and the expanded word or the search term “environmental protection” extracted in (2) is included in the text. Extract the document title. "".

文書検索装置１０は、例えば、図５（Ａ）のドキュメントテーブルから文書の本文に'温暖化'という語を含むレコードの文書ＩＤの値２を取得し、利用者ＩＤ履歴テーブルを参照して文書ＩＤの値２に対応する利用者ＩＤの履歴を取得する（図５（Ｂ）の場合、第一利用者ＩＤ＝３を取得する。）。 For example, the document search apparatus 10 acquires the document ID value 2 of the record including the word “warming” in the body of the document from the document table in FIG. 5A, and refers to the user ID history table to obtain the document. The history of the user ID corresponding to the ID value 2 is acquired (in the case of FIG. 5B, the first user ID = 3 is acquired).

その後、文書検索装置１０は、文書ＩＤ履歴テーブルを参照して利用者ＩＤの値３に対応する文書ＩＤの履歴を取得する（図５（Ｃ）の場合、第一文書ＩＤ＝２、第二文書ＩＤ＝４、第三文書ＩＤ＝５を取得する。）。 Thereafter, the document search apparatus 10 refers to the document ID history table to obtain a document ID history corresponding to the user ID value 3 (in the case of FIG. 5C, the first document ID = 2, the second document ID). Document ID = 4 and third document ID = 5 are acquired).

その後、文書検索装置１０は、文書ＩＤの値が２、４または５の文書を関連文書として抽出し、シード文書およびこれら関連文書から拡張語を抽出し、さらに、抽出された拡張語または検索語「環境保護」を本文に含む文書を検索する。 Thereafter, the document search apparatus 10 extracts a document having a document ID value of 2, 4, or 5 as a related document, extracts an extended word from the seed document and these related documents, and further extracts the extracted extended word or search word. Search for documents that contain "Environmental protection" in the text.

なお、文書検索装置１０は、第一文書ＩＤで示される文書のみを関連文書としてもよく、履歴にあるすべての文書を関連文書としてもよい。 Note that the document search apparatus 10 may use only the document indicated by the first document ID as a related document, or may use all documents in the history as related documents.

これによって、（６）において検索されたシード文書のみを拡張語抽出の対象とする場合に比べ、より多くの文書を拡張語抽出の対象とすることができ、抽出される拡張語をより適切なものとすることができる。 As a result, compared to the case where only the seed document searched in (6) is the target of extended word extraction, more documents can be the target of extended word extraction, and the extracted extended words are more appropriate. Can be.

なお、上述の命令文をユーザ（検索者）に明示的に入力させてもよい。但し、検索要求入力画面１１０のようなＧＵＩを提供することによりシステム側が自動的に命令文を作成する方が、ＳＱＬに不慣れな利用者に対する利便性という観点からも望ましい。 Note that the above-described command sentence may be explicitly input by the user (searcher). However, it is desirable from the viewpoint of convenience for users unfamiliar with SQL that the system automatically creates a command sentence by providing a GUI such as the search request input screen 110.

続いて、再度図４を参照すると、シード文書取得部１２は、生成した命令文に基づいて文書データベース１５よりシード文書を実際に取得する（ステップＳ１０６）。すなわち、シード文書取得部１２は、上述の（６）の命令を文書データベース１５に対して実行することで、「温暖化」というキーワードを含む文書のうちの上位１０件をシード文書として取得する。 Subsequently, referring to FIG. 4 again, the seed document acquisition unit 12 actually acquires a seed document from the document database 15 based on the generated command statement (step S106). That is, the seed document acquisition unit 12 acquires the top 10 documents among the documents including the keyword “warming” as a seed document by executing the above-described instruction (6) on the document database 15.

続いて、シード文書取得部１２は、命令文（５）に基づいて各シード文書の第一利用者ＩＤを取得する（ステップＳ１０７）。 Subsequently, the seed document acquisition unit 12 acquires the first user ID of each seed document based on the command statement (5) (step S107).

その後、シード文書取得部１２は、命令文（３）および（４）に基づいて第一利用者ＩＤが示す利用者が利用したシード文書以外の文書を関連文書として取得する（ステップＳ１０８）。 Thereafter, the seed document acquisition unit 12 acquires a document other than the seed document used by the user indicated by the first user ID as a related document based on the statements (3) and (4) (step S108).

すなわち、シード文書取得部１２は、上述の（３）乃至（５）の命令を文書データベース１５に対して実行することで、「温暖化」というキーワードを含む文書のうちの上位１０件のシード文書の利用者が利用した他の文書を関連文書として取得する。 That is, the seed document acquisition unit 12 executes the above-described commands (3) to (5) on the document database 15, so that the top 10 seed documents among the documents including the keyword “warming” are included. Other documents used by other users are acquired as related documents.

上述のように、文書検索装置１０は、シード文書取得用文字列により抽出したシード文書の利用者（借用者、購入者または閲覧者等をいう。）の利用者ＩＤに基づいてシード文書を利用した利用者が利用（借用、購入または閲覧等を含む。）した他の文書の文書ＩＤを抽出し、それら文書ＩＤで示される文書を関連文書として取得する。 As described above, the document search apparatus 10 uses a seed document based on the user ID of a user (referred to as a borrower, purchaser, or viewer) of the seed document extracted by the seed document acquisition character string. The document IDs of other documents used (including borrowing, purchasing, browsing, etc.) are extracted, and the documents indicated by these document IDs are acquired as related documents.

続いて、拡張語抽出部１３は、シード文書取得部１２によって取得されたシード文書および関連文書から拡張語の選択と抽出を行う。 Subsequently, the extended word extracting unit 13 selects and extracts extended words from the seed document and related documents acquired by the seed document acquiring unit 12.

すなわち、拡張語抽出部１３は、シード文書および関連文書を単語に分割し（ステップＳ１０９）、単語毎に文書頻度を算出する（ステップＳ１１０）。ここで、単語に対する「文書頻度」とは、単語を含むシード文書または関連文書の数をいい、例えば、全シード文書数に対する割合で表され、シード文書と関連文書の合計が５０件であって、ある単語がそのうちの２５件に含まれる場合、文書頻度は０．５（５０％）となる。 That is, the extended word extraction unit 13 divides the seed document and the related document into words (step S109), and calculates the document frequency for each word (step S110). Here, “document frequency” for a word refers to the number of seed documents or related documents including the word, and is expressed as a ratio to the total number of seed documents, for example, and the total number of seed documents and related documents is 50. When a certain word is included in 25 of them, the document frequency is 0.5 (50%).

さらに、拡張語抽出部１３は、文書頻度が高い順に所定数の単語を選択し、選択された単語を拡張語として抽出する（ステップＳ１１１）。なお、文書頻度の代わりに出現頻度（シード文書における単語の出現数）が用いられてもよい。 Further, the extended word extraction unit 13 selects a predetermined number of words in descending order of document frequency, and extracts the selected words as extended words (step S111). Note that an appearance frequency (the number of words appearing in the seed document) may be used instead of the document frequency.

また、シード文書および関連文書の単語への分割は、空白で区切られた単位を用いてもよいし、公知の形態素解析を用いてもよい。或いは、単純に一定の文字数で区切ったものを用いてもよい。 In addition, the seed document and the related document may be divided into words by using units separated by white space or by using known morphological analysis. Or you may use what was simply divided by a fixed number of characters.

また、拡張語抽出部１３は、拡張語とするには不適切な単語を予め登録しておき、それら単語を拡張語として抽出しないといった仕組みを実装するようにしてもよい。 Further, the extended word extraction unit 13 may implement a mechanism in which words inappropriate for use as extended words are registered in advance and these words are not extracted as extended words.

また、拡張語抽出部１３は、拡張語として抽出する単語の個数を固定値としてもよく、検索要求入力部１１によりＧＵＩ等を介してユーザ（検索者）に指定させるようにしてもよい。 Further, the extended word extraction unit 13 may set the number of words to be extracted as an extended word as a fixed value, and may cause the search request input unit 11 to specify a user (searcher) via the GUI or the like.

続いて、文書検索部１４は、検索要求入力画面１１０において入力された検索条件（検索語）と拡張語抽出部１３により抽出された拡張語の全部または一部とを含む文書を文書データベース１５における文書の集合の中から検索し（ステップＳ１１２）、検索結果を利用者に提示する。かかる処理は、例えば、特開２００３−２８１１８１号公報に記載されている方法を用いてもよい。 Subsequently, the document search unit 14 stores in the document database 15 a document that includes the search condition (search word) input on the search request input screen 110 and all or part of the extended words extracted by the extended word extraction unit 13. A search is performed from the set of documents (step S112), and the search result is presented to the user. For this process, for example, a method described in JP-A-2003-281181 may be used.

また、文書検索部１４は、検索語または拡張語の全部もしくは一部を含む文書を検索するようにしてもよい。 Further, the document search unit 14 may search for a document including all or part of the search word or the extended word.

上述のように、第一の実施の形態における文書検索装置１０は、ユーザ（検索者）によって指定された文字列（シード文書取得用文字列）に基づいて拡張語を選択するので、ユーザ（検索者）の意図により近い高品質の検索結果を出力することができる。 As described above, the document search apparatus 10 according to the first embodiment selects an extended word based on a character string (a character string for seed document acquisition) specified by a user (searcher). High-quality search results closer to the user's intention.

また、第一の実施の形態における文書検索装置１０は、シード文書取得用文字列を検索条件の入力と共に入力させることができるため、ユーザ（検索者）による一回の入力操作で簡便に高品質の検索結果を提供することができる。 In addition, since the document search apparatus 10 according to the first embodiment can input the seed document acquisition character string together with the input of the search condition, it is easy to obtain high quality by a single input operation by the user (searcher). Search results can be provided.

また、第一の実施の形態における文書検索装置１０は、ユーザ（検索者）が指定したシード文書取得用文字列に基づいて検索される文書と利用者が共通する文書をシード文書に加えるため、拡張語を抽出するための集合（シード文書の母数）を大きくすることができ、より多くの文書の中から厳選された拡張語に基づいてユーザ（検索者）の期待に添った検索結果を提供することができる。 In addition, the document search apparatus 10 according to the first embodiment adds a document searched for based on a seed document acquisition character string specified by a user (searcher) and a document common to the user to the seed document. The set for extracting extended words (the number of seed documents) can be increased, and search results that meet the expectations of users (searchers) based on extended words carefully selected from more documents. Can be provided.

次に、第二の実施の形態について説明する。第二の実施の形態では、拡張語抽出部１３が単語とキーワードとの間の距離に基づいて各単語のキーワードに対する関連度を決定し、関連度の高い単語を拡張語として抽出する点に特徴を有する。 Next, a second embodiment will be described. The second embodiment is characterized in that the extended word extraction unit 13 determines the degree of association of each word with the keyword based on the distance between the word and the keyword, and extracts a word with a high degree of association as an extended word. Have

なお、第二の実施の形態において、文書検索装置１０の機能構成およびハードウェア構成は、それぞれ図１および図２に示されたものと同様とする。 In the second embodiment, the functional configuration and hardware configuration of the document search apparatus 10 are the same as those shown in FIGS. 1 and 2, respectively.

図６は、第二の実施の形態における文書検索装置１０による文書検索処理を説明するためのフローチャートであり、ステップＳ２０１乃至ステップＳ２０６が図４のフローチャートで説明した処理の流れと共通する。 FIG. 6 is a flowchart for explaining the document search process by the document search apparatus 10 in the second embodiment, and steps S201 to S206 are common to the process flow described in the flowchart of FIG.

ステップＳ２０１乃至ステップＳ２０６の処理によりシード文書を取得すると、拡張語抽出部１３は、シード文書を形態素解析等により単語に分割し（ステップＳ２０７）、各単語とキーワードとの間の距離（例えば、ある単語が同じ文書内に複数存在する場合には、キーワードとの間の最短距離とする。）を取得する（ステップＳ２０８）。 When the seed document is acquired by the processing from step S201 to step S206, the extended word extraction unit 13 divides the seed document into words by morphological analysis or the like (step S207), and the distance between each word and the keyword (for example, there is If there are a plurality of words in the same document, the shortest distance from the keyword is acquired) (step S208).

キーワードは、検索条件入力領域１１１に入力された検索語「環境保護」であってもよく、シード文書取得用文字列から選択した単語（例えば、「地球」、「温暖化」等をいう。）であってもよい。 The keyword may be the search term “environmental protection” input to the search condition input area 111, and is a word selected from the seed document acquisition character string (for example, “earth”, “warming”, etc.). It may be.

ここで、「距離」とは、各単語とキーワードとの間の文字数、単語数、文章数等で表現される間隔であり、距離が小さいほど各単語とキーワードとの間の関連度は高いものとされる。 Here, “distance” is an interval expressed by the number of characters, the number of words, the number of sentences, etc. between each word and the keyword, and the smaller the distance, the higher the degree of association between each word and the keyword. It is said.

拡張語抽出部１３は、単語毎に文書頻度または出現頻度（シード文書における各単語の出現数）を算出し、文書頻度または出現頻度が高い順に所定数の単語を拡張語として抽出するが、さらに、各単語とキーワードとの間の距離に基づいて重み係数を導出し、文書頻度または出現頻度に重み係数を乗じて各単語とキーワードとの間の関連度（関連度＝重み係数×文書頻度または出現頻度）を決定する（ステップＳ２０９）。 The extended word extraction unit 13 calculates the document frequency or appearance frequency (the number of occurrences of each word in the seed document) for each word, and extracts a predetermined number of words as an extended word in descending order of the document frequency or appearance frequency. , A weighting factor is derived based on the distance between each word and the keyword, and the degree of association between each word and the keyword is obtained by multiplying the document frequency or appearance frequency by the weighting factor (relevance = weighting factor × document frequency or Appearance frequency) is determined (step S209).

拡張語抽出部１３は、重み係数を０以上１以下の範囲で表し、キーワード自体の場合を１とし（距離が０の状態をいう。）、距離が大きくなるほど０に近づけ、所定距離以上の場合を０とする。なお、重み係数の算出方法は後述する。 The extended word extraction unit 13 expresses the weighting coefficient in the range of 0 to 1 and sets the case of the keyword itself to 1 (refers to a state where the distance is 0). Is set to 0. A method for calculating the weighting coefficient will be described later.

また、拡張語抽出部１３は、文書頻度または出現頻度を０以上１以下の範囲（例えば、文書頻度の場合には、各単語が含まれる文書数をシード文書数で除した値とする。）で表す。 Further, the extended word extraction unit 13 has a document frequency or appearance frequency in a range from 0 to 1 (for example, in the case of a document frequency, a value obtained by dividing the number of documents including each word by the number of seed documents). Represented by

なお、拡張語抽出部１３は、文書頻度または出現頻度に重み係数を加えた値を２で除して関連度を算出するようにしてもよい。 The extended word extraction unit 13 may calculate the relevance by dividing the value obtained by adding the weighting coefficient to the document frequency or the appearance frequency by 2.

さらに、拡張語抽出部１３は、各単語の出現頻度や文書頻度を考慮せず、重み係数をそのまま関連度（関連度＝重み係数×１）としてもよい。被検索文書や単語の性質により出現頻度や文書頻度が意味をなさない場合にも（例えば、被検索文書が特定の分野に偏っていたり、単語がどのような分野でも一般的に使用されるものであったりする場合をいう。）、適切な関連度を算出できるようにするためである。 Furthermore, the extended word extraction unit 13 may use the weighting coefficient as it is without regard to the appearance frequency of each word and the document frequency (relationship degree = weighting coefficient × 1). Even when the appearance frequency or document frequency does not make sense due to the nature of the searched document or word (for example, the searched document is biased to a specific field, or the word is commonly used in any field) This is because it is possible to calculate an appropriate degree of relevance.

このように、文書検索装置１０は、単語とキーワードとの間の距離に基づく関連度（０以上１以下の値）と文書頻度または出現頻度に基づく関連度（０以上１以下の値）とから最終的な関連度を導出して拡張語を抽出するので、ユーザ（検索者）の意図により近い高品質の検索結果を出力することができる。 As described above, the document search device 10 uses the relevance level based on the distance between the word and the keyword (a value between 0 and 1) and the relevance level based on the document frequency or appearance frequency (a value between 0 and 1). Since the final relevance is derived and the extended words are extracted, it is possible to output a high-quality search result that is closer to the intention of the user (searcher).

拡張語抽出部１３により各単語のキーワードに対する関連度を決定して拡張語を抽出する場合、文書検索装置１０は、例えば、検索要求を示す命令文を以下のような副問い合せを用いた拡張構文とする。 When the extended word extraction unit 13 determines the degree of relevance of each word to the keyword and extracts extended words, the document search apparatus 10 uses, for example, an extended syntax using a subquery as shown below as a command statement indicating a search request. And

select タイトル from ドキュメント where 本文 contains '環境保護' expand from (select タイトル from ドキュメント where 本文 contains '温暖化' limit 10) distance factor 0.2
なお、以下は、上述の命令文を説明のため複数の部分に分割したものである。 select title from document where text contains 'environmental protection' expand from (select title from document where text contains 'warming' limit 10) distance factor 0.2
In the following, the above-described command sentence is divided into a plurality of parts for explanation.

select タイトル from ドキュメント where 本文 contains '環境保護' ・・・（７）
expand from・・・（８）
(select タイトル from ドキュメント where 本文 contains '温暖化' limit 10) ・・・（９）
distance factor 0.2・・・（１０）
（７）の部分は、文書データベース１５に定義されているドキュメントテーブルに対する検索命令であり、より詳しくは、「ドキュメントテーブルにおいて文書の本文に'環境保護'という語を含むレコードのタイトルを抽出せよ。」という命令を意味する。 select Title from Document where Text contains 'Environmental protection' (7)
expand from ... (8)
(select title from document where text contains 'warming' limit 10) (9)
distance factor 0.2 (10)
The part (7) is a search command for the document table defined in the document database 15. More specifically, “extract the title of a record that includes the word“ environmental protection ”in the body of the document in the document table. "Means the command.

また、expand fromという記述（８）に続く副問い合せにおけるselect文（９）は、より多くのシード文書を取得するための検索命令である。より詳しくは、ドキュメントテーブルにおいて文書の本文に「温暖化」という語を含むレコードの上位１０件のタイトルを抽出せよ。」という命令を意味する。 Further, the select statement (9) in the sub-query following the description “expand from” (8) is a search command for acquiring more seed documents. More specifically, extract the top 10 titles of records that contain the word “warming” in the body of the document in the document table. "Means the command.

また、（１０）の部分は、（９）において検索されるシード文書における各単語とキーワード（例えば、文字列「温暖化」）との間の距離（文字数または単語数）に基づいて関連度を算出するための命令であり、値「０．２」は、各単語とキーワードとの間の距離が大きくなるに従って減少する重み係数の減少率を意味する。 The part (10) indicates the degree of association based on the distance (number of characters or number of words) between each word in the seed document searched in (9) and a keyword (for example, the character string “warming”). This is a command for calculation, and the value “0.2” means a decreasing rate of the weighting coefficient that decreases as the distance between each word and the keyword increases.

重み係数は、例えば、「重み係数＝１÷（（距離−１）＾減少率）」または「重み係数＝１−減少率×距離」（この場合、重み係数の最小値は０とする。）で示される数式を用いて算出される。何れの数式においても、減少率が大きい程、重み係数の低下が急激となる。 For example, the weighting factor is “weighting factor = 1 ÷ ((distance−1) ^ decrease rate)” or “weighting factor = 1−decrease rate × distance” (in this case, the minimum value of the weighting factor is 0). It is calculated using the mathematical formula shown by In any formula, the weight factor decreases more rapidly as the decrease rate is larger.

なお、減少率は、固定値であってもよく、検索要求入力画面１１０において値が直接入力されるようにしてもよく、或いは、「High」、「Middle」、「Low」の３段階のラジオボタンにより選択されるようにしてもよい。「High」、「Middle」、「Low」の何れかのラジオボタンが選択された場合、減少率は、例えば、それぞれ０．８、０．５、０．２となる。 Note that the reduction rate may be a fixed value, or a value may be directly input on the search request input screen 110, or a three-stage radio of “High”, “Middle”, and “Low”. You may make it select with a button. When any one of the “High”, “Middle”, and “Low” radio buttons is selected, the reduction rates are, for example, 0.8, 0.5, and 0.2, respectively.

このように、文書検索装置１０は、減少率を指定する簡単な方法を提供することにより、各単語のキーワードに対する関連度が拡張語の抽出に及ぼす影響度を調整できるようにし、検索結果の傾向（質）を調整する場合におけるユーザ（検索者）の利便性を向上させることができる。 As described above, the document search apparatus 10 provides a simple method for designating the reduction rate, thereby enabling the degree of relevance of each word to the keyword to be adjusted in the degree of influence on the extraction of the expanded word, and the tendency of the search result. The convenience of the user (searcher) when adjusting (quality) can be improved.

また、文書検索装置１０は、文（センテンス）毎に減少率を変化させるようにし、単語を含む文がキーワードを含む文から遠ざかるにつれて文毎に所定割合（例えば、１０％）刻みで減少率を低減させるようにしてもよい（例えば、減少率が０．２％、０．１８％、０．１６２％・・・のように文毎に減少する。）。この場合、所定割合は、「distance factor」の第二引数として設定されてもよく、その場合、（９）の部分は、例えば、減少率を０．２％、所定割合を１０％とすると「distance factor 0.2, 10」のように記述される。 Further, the document search device 10 changes the decrease rate for each sentence (sentence), and the decrease rate is increased at a predetermined rate (for example, 10%) for each sentence as the sentence including the word moves away from the sentence including the keyword. You may make it reduce (For example, a reduction rate reduces for every sentence like 0.2%, 0.18%, 0.162% ...). In this case, the predetermined ratio may be set as the second argument of “distance factor”. In this case, the part (9) is, for example, a reduction rate of 0.2% and a predetermined ratio of 10%. It is described as “distance factor 0.2, 10”.

また、文書検索装置１０は、例えば、単語を含む文がキーワードを含む文と同じである場合に重み係数を１とし、単語を含む文がキーワードを含む文から遠ざかるにつれて重み係数を所定の割合で０に近づけ、単語を含む文がキーワードを含む文から所定の文数以上離れた場合に重み係数を０とする。 For example, the document search apparatus 10 sets the weighting factor to 1 when the sentence including the word is the same as the sentence including the keyword, and sets the weighting coefficient at a predetermined rate as the sentence including the word moves away from the sentence including the keyword. The weight coefficient is set to 0 when a sentence including a word is separated from a sentence including a keyword by a predetermined number of sentences or more.

このように、文書検索装置１０は、キーワードとの間の距離は大きいがキーワードを含む文と同じ文に含まれる単語が、キーワードとの間の距離は小さいがキーワードを含む文と異なる文に含まれる単語よりも、キーワードに対する関連度が低くなってしまうのを防止し、適切な関連度を設定して適切な拡張語を抽出することにより、ユーザ（検索者）の意図により近い高品質の検索結果を出力することができる。 As described above, the document search apparatus 10 includes words included in the same sentence as the sentence including the keyword but having a large distance to the keyword, but included in a sentence different from the sentence including the keyword but having a small distance to the keyword. High-quality search that is closer to the intention of the user (searcher) by preventing the relevance of the keyword from becoming lower than the generated word and setting the appropriate relevance and extracting the appropriate expanded words The result can be output.

続いて、再度図６を参照すると、拡張語抽出部１３は、関連度（重み係数×文書頻度）の高い順に単語を拡張語として抽出し（ステップＳ２１０）、その後、文書検索装置１０は、検索条件入力領域１１１に入力された検索条件（検索語）と拡張語抽出部１３により抽出された拡張語の全部または一部とを含む文書を文書データベース１５における文書の集合の中から検索し（ステップＳ２１１）、検索結果を利用者に提示する。 Subsequently, referring to FIG. 6 again, the extended word extraction unit 13 extracts words as extended words in descending order of relevance (weight coefficient × document frequency) (step S210). A document including the search condition (search word) input in the condition input area 111 and all or part of the extended word extracted by the extended word extracting unit 13 is searched from the set of documents in the document database 15 (step S211), the search result is presented to the user.

上述のように、第二の実施の形態における文書検索装置１０は、シード文書に含まれる単語とキーワードとの間の距離に基づいて単語のキーワードに対する関連度を決定し拡張語を抽出するので、ユーザ（検索者）の意図により近い高品質の検索結果を出力することができる。 As described above, the document search apparatus 10 according to the second embodiment determines the degree of relevance of a word to a keyword based on the distance between the word and the keyword included in the seed document, and extracts an extended word. A high-quality search result closer to the intention of the user (searcher) can be output.

なお、第二の実施の形態における文書検索装置１０は、シード文書の利用者が利用した他の文書を関連文書として抽出し、シード文書または関連文書に含まれる単語とキーワードとの間の距離に基づいて単語のキーワードに対する関連度を決定し拡張語を抽出するようにしてもよい。 Note that the document search apparatus 10 according to the second embodiment extracts another document used by the user of the seed document as a related document, and sets the distance between the word included in the seed document or the related document and the keyword. Based on this, the degree of relevance of the word to the keyword may be determined to extract the extended word.

以上、本発明の実施例について詳述したが、本発明は、上述のような特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形または変更を加えることができる。 As mentioned above, although the Example of this invention was explained in full detail, this invention is not limited to the above specific embodiment, In the range of the summary of this invention described in the claim, Various modifications or changes can be made.

例えば、上述の実施例では、拡張語抽出部１３が関連度の高い順に所定数の単語を拡張語として抽出するが、関連度が所定値以上の単語を全て拡張語として抽出するようにしてもよい。 For example, in the above-described embodiment, the expanded word extraction unit 13 extracts a predetermined number of words as expanded words in descending order of relevance, but may extract all words having a relevance of a predetermined value or more as expanded words. Good.

また、上述の実施例では、シード文書取得部１２によりシード文書の利用者が利用した他の文書の全部または一部を関連文書として取得するが、シード文書の利用者が所定期間内に利用した文書のみを関連文書として取得するようにしてもよい。利用日が時間的に離れている場合、シード文書との関連性が低くなると考えられるからである。 In the above-described embodiment, the seed document acquisition unit 12 acquires all or a part of other documents used by the user of the seed document as related documents, but the user of the seed document used it within a predetermined period. Only a document may be acquired as a related document. This is because it is considered that the relevance with the seed document is low when the use dates are separated in time.

また、第二の実施例では、拡張語抽出部１３が各単語とキーワードとの間の最短距離（最小値）に基づいて関連度を算出するが、平均距離（平均値）、最長距離（最大値）、中間距離（中間値）に基づいて関連度を算出するようにしてもよい。 In the second embodiment, the extended word extraction unit 13 calculates the degree of association based on the shortest distance (minimum value) between each word and the keyword. However, the average distance (average value) and the longest distance (maximum) Value) and intermediate distance (intermediate value), the degree of association may be calculated.

本発明の実施の形態における文書検索装置の機能構成例を示す図である。It is a figure which shows the function structural example of the document search device in embodiment of this invention. 検索要求入力画面の表示例を示す図である。It is a figure which shows the example of a display of a search request input screen. 本発明の実施の形態における文書検索装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the document search apparatus in embodiment of this invention. 第一の実施の形態における文書検索装置による文書検索処理を説明するためのフローチャートである。It is a flowchart for demonstrating the document search process by the document search apparatus in 1st embodiment. 第一の実施の形態における文書検索装置により利用される各種テーブルを示す図である。It is a figure which shows the various tables utilized by the document search apparatus in 1st embodiment. 第二の実施の形態における文書検索装置による文書検索処理を説明するためのフローチャートである。It is a flowchart for demonstrating the document search process by the document search apparatus in 2nd embodiment.

Explanation of symbols

１０文書検索装置
１１検索要求入力部
１２シード文書取得部
１３拡張語抽出部
１４文書検索部
１５文書データベース
１００ドライブ装置
１０１記録媒体
１０２補助記憶装置
１０３メモリ装置
１０４演算処理装置
１０５表示装置
１０６入力装置
１１０検索要求入力画面
１１１検索条件入力領域
１１２シード文書取得用文字列入力領域
１１３シード数入力領域
１１４検索ボタン DESCRIPTION OF SYMBOLS 10 Document search device 11 Search request input part 12 Seed document acquisition part 13 Extended word extraction part 14 Document search part 15 Document database 100 Drive apparatus 101 Recording medium 102 Auxiliary storage apparatus 103 Memory apparatus 104 Arithmetic processing apparatus 105 Display apparatus 106 Input apparatus 110 Search request input screen 111 Search condition input area 112 Character string input area for seed document acquisition 113 Seed number input area 114 Search button

Claims

A document search device for searching a document from a predetermined document database based on an input search condition,
Seed document acquisition means for acquiring a seed document based on the input seed document acquisition character string;
Related document acquisition means for acquiring other documents used by the user of the seed document acquired by the seed document acquisition means as related documents;
Word extraction means for extracting words related to the search condition from the seed document and the related document;
Search means for searching for a document based on the search condition and the word extracted by the word extraction means;
A document search apparatus characterized by comprising:

The related document includes, in addition to other documents used by a user of the seed document, other documents borrowed by a borrower of the seed document, other documents purchased by a purchaser of the seed document, or the seed document Including other documents viewed by
The document search apparatus according to claim 1.

A document search device for searching a document from a predetermined document database based on an input search condition,
Seed document acquisition means for acquiring a seed document based on the input seed document acquisition character string;
A word extracting unit that extracts a word related to the search condition from the seed document acquired by the seed document acquiring unit;
Search means for searching for a document based on the search condition and the word extracted by the word extraction means,
The word extraction means determines a degree of association of each word with respect to the predetermined keyword based on a distance between each word and the predetermined keyword, and extracts a predetermined number of words in descending order of the degree of association;
A document search apparatus characterized by that.

The word extraction means determines the relevance of each word to the keyword based on the frequency of appearance of each word or the number of seed documents including each word, in addition to the distance between each word and the predetermined keyword, Extracting a predetermined number of words in descending order of the degree of association;
The document search apparatus according to claim 3.

A decrease rate setting means for setting a decrease rate of the degree of association that decreases as the distance between the word and the predetermined keyword increases;
5. The document search apparatus according to claim 3, further comprising:

The rate of decrease varies from sentence to sentence,
The document search apparatus according to claim 5, wherein:

A document search method for searching a document from a predetermined document database based on an input search condition,
A seed document acquisition step of acquiring a seed document based on the input seed document acquisition character string;
A related document acquisition step of acquiring, as a related document, another document used by a user of the seed document acquired in the seed document acquisition step;
A word extracting step of extracting a word related to the search condition from the seed document and the related document;
A search step for searching for a document based on the search condition and the word extracted in the word extraction step;
A document search method characterized by comprising:

A document search method for searching a document from a predetermined document database based on an input search condition,
A seed document acquisition step of acquiring a seed document based on the input seed document acquisition character string;
A word extraction step of extracting a word related to the search condition from the seed document acquired in the seed document acquisition step;
A search step for searching for a document based on the search condition and the word extracted in the word extraction step,
The word extraction step determines a degree of association of the word with the predetermined keyword based on a distance between the word and the predetermined keyword, and extracts a predetermined number of words in descending order of the degree of association.
A document search method characterized by the above.

A document search program for causing a computer to search for a document from a predetermined document database based on an input search condition,
A seed document acquisition step of acquiring a seed document based on the input seed document acquisition character string;
A related document acquisition step of acquiring, as a related document, another document used by a user of the seed document acquired in the seed document acquisition step;
A word extracting step of extracting a word related to the search condition from the seed document and the related document;
A search step for searching for a document based on the search condition and the word extracted in the word extraction step;
A document search program characterized by comprising:

A document search program for causing a computer to search for a document from a predetermined document database based on an input search condition,
A seed document acquisition step of acquiring a seed document based on the input seed document acquisition character string;
A word extraction step of extracting a word related to the search condition from the seed document acquired in the seed document acquisition step;
A search step for searching for a document based on the search condition and the word extracted in the word extraction step,
The word extraction step determines a degree of association of the word with the predetermined keyword based on a distance between the word and the predetermined keyword, and extracts a predetermined number of words in descending order of the degree of association.
A document search program characterized by that.

A computer-readable recording medium in which the document search program according to claim 9 or 10 is recorded.