JP2006251935A

JP2006251935A - Document retrieval device, document retrieval method and document retrieval program

Info

Publication number: JP2006251935A
Application number: JP2005064680A
Authority: JP
Inventors: Atsuyuki Goto; 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2005-03-08
Filing date: 2005-03-08
Publication date: 2006-09-21
Anticipated expiration: 2025-03-08
Also published as: JP4754849B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve the precision of document retrieval as for a document retrieval device, document retrieval method and document retrieval program. <P>SOLUTION: A document retrieval device 100 is provided with a retrieval word extracting part 201 for extracting a retrieval word from a retrieval sentence designated by a retriever being an operator, a relevant document retrieving part 202 for retrieving a relevant document from a document group, a conforming document designating part 203 for designating a conforming document from the relevant document according to the operation of a retriever to an input/output part 230, a relevant word extracting part 204 for extracting a relevant word based on the conforming document, a non-conforming document extracting part 205 for extracting any non-conforming document which is not the conforming document, a learning part 206 for generating a parameter for classification by using the conforming document and the non-conforming document, a pre-filtering part 207 for verifying the validity of the parameter for classification and a classifying part 208 for classifying the adaptive document from the relevant document by using a parameter for classification whose validity is verified by the pre-filtering part 207. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、文書検索装置、文書検索方法、および文書検索プログラムに関する。 The present invention relates to a document search device, a document search method, and a document search program.

文書検索における課題は、いかに効率よく目的とする文書を探し当てるかにある。この課題を解決するために、従来の文書検索では、キーワードを論理演算子と組み合わせて文書検索を行い、ここで得られた検索結果に対し新たなキーワードと論理演算子とを組み合わせて検索結果の絞込みを行っていた。特に、検索者が検索結果の一部から適合文書を選択して学習データとして与えれば、全文検索文書サーバが管理する全文書を適合文書と不適合文書に分類するための分類用パラメータを生成でき、検索者に適合文書のみを提示（フィルタリング）することが可能であった。 The problem in document search is how to efficiently find the target document. In order to solve this problem, in the conventional document search, a keyword is combined with a logical operator to perform a document search, and the search result obtained here is combined with a new keyword and a logical operator. I was narrowing down. In particular, if a searcher selects a conforming document from a part of a search result and gives it as learning data, a classification parameter for classifying all documents managed by the full-text search document server into a conforming document and a nonconforming document can be generated. It was possible to present (filter) only relevant documents to the searcher.

しかしながら、従来技術では、全文検索文書サーバが管理する全文書からフィルタリングによって取り出した適合文書に、学習データとして指定した適合文書が含まれる保障がないという問題がある。具体的には、次のような場合が考えられる。 However, the conventional technique has a problem that there is no guarantee that the conforming document specified as the learning data is included in the conforming document extracted by filtering from all the documents managed by the full-text search document server. Specifically, the following cases can be considered.

第一に、分類用パラメータに学習データである適合文書の単語が十分に反映されない場合、その適合文書は、フィルタリング処理で不適合文書として扱われるおそれがある。 First, when the words of the conforming document, which is the learning data, are not sufficiently reflected in the classification parameters, the conforming document may be handled as a nonconforming document in the filtering process.

第二に、適合文書に含まれない単語で不適合文書に含まれる単語が分類用パラメータになり、単純検索すると適合文書がヒットしまうおそれがある。たとえば、『リーン』が分類用パラメータに選ばれ、単純検索すると『リン』を含む適合文書がヒットするような場合である。これは、検索モジュールが単語『リーン』を正規化し、『リン』と同一視するための副作用が生じる。このとき、適合文書は『リーン』が含まれているとみなされ、分類用パラメータ『リーン』に対応する重みが減じられ、その結果、不適合文書に分類されることになる。 Secondly, words that are not included in the conforming document and included in the nonconforming document become classification parameters, and if the simple search is performed, the conforming document may be hit. For example, “Lean” is selected as a parameter for classification, and a simple search results in a hit with a matching document containing “Lin”. This has the side effect of causing the search module to normalize the word “lean” and identify it with “lin”. At this time, the conforming document is regarded as including “lean”, the weight corresponding to the classification parameter “lean” is reduced, and as a result, the document is classified as a nonconforming document.

第三に、部分文字列の一部が分類用パラメータになる場合、文書検索に不具合が生じるおそれがある。たとえば、分類用パラメータとして、『京都』が選択され、『京都』で単純検索すると『東京都』を含む文書もヒットする。すなわち、不適合文書が単語『京都』、適合文書が単語『東京都』を含む場合がこれに該当する。 Third, when a part of the partial character string is used as a classification parameter, there is a possibility that a document search may fail. For example, “Kyoto” is selected as the classification parameter, and a simple search for “Kyoto” also hits documents containing “Tokyo”. That is, this is the case when the nonconforming document includes the word “Kyoto” and the conforming document includes the word “Tokyo”.

以上のような不具合は、検索者が文書検索装置に対し学習データとして適合文書を指定したのに、フィルタリング結果にそれらの適合文書が現れないのは検索者にとり大きな不満になる。 The above-described problems make it very unsatisfactory for the searcher that the searcher designates the matching document as learning data to the document search apparatus, but the matching document does not appear in the filtering result.

この発明は、上述した従来技術による問題点を解消するため、検索者が学習データとして指定した適合文書が必ず文書検索した結果に含まれるようにすることで文書検索用の分類用パラメータが補正され、良好な文書検索結果が得られる文書検索装置、文書検索方法、および文書検索プログラムを提供することを目的とする。 In order to eliminate the above-described problems caused by the prior art, this invention corrects the classification parameters for document retrieval by ensuring that the relevant document specified as the learning data by the searcher is included in the document retrieval result. An object of the present invention is to provide a document search apparatus, a document search method, and a document search program that can obtain a good document search result.

上述した課題を解決し、目的を達成するため、この発明の請求項１にかかる文書検索装置は、操作命令を受け付ける操作手段と、前記操作手段に対する検索者の入力操作に応じて検索用の語句を検索語として抽出する検索語抽出手段と、電子化された複数の文書を文書群として蓄積する蓄積手段と、前記蓄積手段に蓄積されている文書群から、前記検索語抽出手段により抽出された前記検索語を含む文書を関連文書として検索する第１の関連文書検索手段と、表示動作を行う表示手段と、前記第１の関連文書検索手段による検索結果を前記表示手段に表示させる第１の表示制御手段と、前記操作手段に対する検索者の入力操作に応じて、前記第１の関連文書検索手段により検索された複数の前記関連文書から検索者が求める適合文書を指定する適合文書指定手段と、前記適合文書指定手段により指定された前記適合文書に基づいて検索用の語句を関連語として抽出する関連語抽出手段と、前記蓄積手段に蓄積されている文書群から、前記関連語抽出手段により抽出された前記関連語を含む文書を関連文書として検索する第２の関連文書検索手段と、前記第２の関連文書検索手段により検索された複数の前記関連文書から、前記適合文書指定手段により指定された前記適合文書に基づいて検索者が求めない不適合文書を抽出する不適合文書抽出手段と、前記適合文書指定手段により指定された前記適合文書および前記不適合文書抽出手段により抽出された前記不適合文書に基づいて分類用パラメータを生成する学習手段と、前記学習手段により生成された前記分類用パラメータの妥当性を検証するプレフィルタリング手段と、前記プレフィルタリング手段で妥当性が検証された分類用パラメータを検索語として、前記蓄積手段に蓄積されている文書群に対して再検索を行う第３の関連文書検索手段と、前記第３の関連文書検索手段による再検索結果に対して、前記プレフィルタリング手段で妥当性が検証された分類用パラメータに基づいて前記適合文書を分類する分類手段と、前記分類手段による分類結果を前記表示手段に表示させる第２の表示制御手段と、を備えていることを特徴とする。 In order to solve the above-described problems and achieve the object, a document search apparatus according to claim 1 of the present invention includes an operation unit that receives an operation command, and a search word / phrase according to an input operation of a searcher to the operation unit. Extracted from the search term extraction means, the storage means for storing a plurality of digitized documents as a document group, and the document group stored in the storage means. First related document search means for searching for a document including the search word as a related document, display means for performing a display operation, and a first result for causing the display means to display a search result by the first related document search means. In accordance with an input operation of the searcher with respect to the display control unit and the operation unit, a matching document desired by the searcher is specified from the plurality of related documents searched by the first related document search unit. From the document group stored in the storage means, the relevant word extraction means for extracting a search word / phrase as a related word based on the relevant document designated by the relevant document designation means, A second related document search unit that searches for a document including the related word extracted by the related word extraction unit as a related document, and the matching from a plurality of the related documents searched by the second related document search unit Based on the conforming document designated by the document designating means, a nonconforming document extracting means for extracting a nonconforming document that the searcher does not seek, the conforming document designated by the conforming document designating means, and the nonconforming document extracting means are extracted. Learning means for generating a classification parameter based on the non-conforming document, and validity of the classification parameter generated by the learning means. And a third related document search for performing a re-search on the document group stored in the storage means using the classification parameters verified by the pre-filtering means as search terms A classification means for classifying the relevant document based on a classification parameter verified by the pre-filtering means for a re-search result obtained by the third related document retrieval means, and by the classification means And a second display control means for displaying the classification result on the display means.

この請求項１に記載の発明によれば、検索者が指定した適合文書に基づく関連文書の検索に際し、蓄積手段などで管理されている全文書からフィルタリングによって取り出した適合文書に、必ず学習データとして指定した適合文書が含まれるため、文書検索の精度を向上させることができる。 According to the first aspect of the present invention, when searching for a related document based on a conforming document designated by a searcher, the conforming document extracted by filtering from all the documents managed by the storage means is always used as learning data. Since the specified conforming document is included, the accuracy of document retrieval can be improved.

また、請求項２にかかる文書検索装置は、請求項１に記載の発明において、前記プレフィルタリング手段が、前記学習手段が生成した分類用パラメータによって、前記適合文書と前記不適合文書が正確に適合文書と不適合文書とに分類されるように前記分類用パラメータを補正することを特徴とする。 According to a second aspect of the present invention, there is provided the document retrieval apparatus according to the first aspect, wherein the pre-filtering unit accurately determines that the conforming document and the non-conforming document are conforming documents based on the classification parameters generated by the learning unit. The classification parameters are corrected so as to be classified into non-conforming documents.

この請求項２に記載の発明によれば、文書検索に用いる分類用のパラメータの精度を向上させることができる。 According to the second aspect of the present invention, it is possible to improve the accuracy of classification parameters used for document retrieval.

また、請求項３にかかる文書検索装置は、請求項２に記載の発明において、前記プレフィルタリング手段が、妥当でない分類用パラメータを検出した際には、当該分類用パラメータを削除することを特徴とする。 The document retrieval apparatus according to claim 3 is characterized in that, in the invention according to claim 2, when the pre-filtering means detects an invalid classification parameter, the classification parameter is deleted. To do.

この請求項３に記載の発明によれば、妥当でない分類用パラメータが用いられるような不具合を回避することができる。 According to the third aspect of the present invention, it is possible to avoid a problem that an invalid classification parameter is used.

また、請求項４にかかる文書検索方法は、操作を受け付ける操作手段に対する検索者の入力操作に応じて検索用の語句を検索語として抽出する検索語抽出工程と、電子化された複数の文書を文書群として蓄積する蓄積手段に蓄積されている文書群から、前記検索語抽出工程により抽出された前記検索語を含む文書を関連文書として検索する第１の関連文書検索工程と、前記第１の関連文書検索工程による検索結果を表示させる第１の検索結果表示工程と、前記操作手段に対する検索者の入力操作に応じて、前記第１の関連文書検索工程により検索された複数の前記関連文書から検索者が求める適合文書を指定する適合文書指定工程と、前記適合文書指定工程により指定された前記適合文書に基づいて検索用の語句を関連語として抽出する関連語抽出工程と、前記蓄積手段に蓄積されている文書群から、前記関連語抽出工程により抽出された前記関連語を含む文書を関連文書として検索する第２の関連文書検索工程と、前記第２の関連文書検索工程により検索された複数の前記関連文書から、前記適合文書指定工程により指定された前記適合文書に基づいて検索者が求めない不適合文書を抽出する不適合文書抽出工程と、前記適合文書指定工程により指定された前記適合文書および前記不適合文書抽出工程により抽出された前記不適合文書に基づいて分類用パラメータを生成する分類用パラメータ生成工程と、前記分類用パラメータ生成工程により生成された前記分類用パラメータの妥当性を検証するプレフィルタリング工程と、前記プレフィルタリング工程で妥当性が検証された分類用パラメータを検索語として、前記蓄積手段に蓄積されている文書群に対して再検索を行う第３の関連文書検索工程と、前記第３の関連文書検索工程による再検索結果に対して、前記プレフィルタリング工程で妥当性が検証された分類用パラメータに基づいて前記適合文書を分類する適合文書分類工程と、前記適合文書分類工程による分類結果を表示する第２の表示工程と、を含むことを特徴とする。 According to a fourth aspect of the present invention, there is provided a document search method for extracting a search word / phrase as a search word in response to a searcher's input operation to an operation means for receiving an operation, and a plurality of digitized documents. A first related document search step of searching a document containing the search word extracted by the search word extraction step as a related document from a document group stored in a storage unit for storing as a document group; A first search result display step for displaying a search result in a related document search step, and a plurality of the related documents searched in the first related document search step in response to an input operation of a searcher to the operation means. A relevant document designating step for designating a relevant document requested by a searcher, and a related word extracting a search word / phrase based on the relevant document designated by the relevant document designating step. An extraction step; a second related document search step for searching as a related document a document including the related word extracted by the related word extraction step from the document group stored in the storage means; and the second A non-conforming document extraction step for extracting non-conforming documents that a searcher does not seek based on the conforming document designated by the conforming document designation step from the plurality of related documents retrieved by the related document retrieval step, and the conforming document designation A classification parameter generating step for generating a classification parameter based on the conforming document specified in the process and the nonconforming document extracted by the nonconforming document extracting step; and the classification parameter generated by the classification parameter generating step. A pre-filtering step for verifying the validity of parameters, and the amount of validity verified in the pre-filtering step. A third related document search step for performing a re-search on the document group stored in the storage means using a search parameter as a search term, and a re-search result by the third related document search step, Including a conforming document classification step of classifying the conforming document based on the classification parameter verified in the pre-filtering step, and a second display step of displaying a classification result by the conforming document classification step. Features.

この請求項４に記載の発明によれば、検索者が指定した適合文書に基づく関連文書の検索に際し、蓄積手段などで管理されている全文書からフィルタリングによって取り出した適合文書に、必ず学習データとして指定した適合文書が含まれるため、文書検索の精度を向上させることができる。 According to the fourth aspect of the present invention, when searching for related documents based on a conforming document designated by a searcher, the conforming document extracted by filtering from all the documents managed by the storage means is always used as learning data. Since the specified conforming document is included, the accuracy of document retrieval can be improved.

また、請求項５にかかる文書検索方法は、請求項４に記載の発明において、前記プレフィルタリング工程が、前記分類用パラメータ生成工程により生成された分類用パラメータによって、前記適合文書と前記不適合文書が正確に適合文書と不適合文書とに分類されるように前記分類用パラメータを補正することを特徴とする。 According to a fifth aspect of the present invention, there is provided the document retrieval method according to the fourth aspect, wherein the pre-filtering step determines whether the conforming document and the non-conforming document are based on the classification parameter generated by the classification parameter generation step. The classification parameter is corrected so as to be accurately classified into a conforming document and a nonconforming document.

この請求項５に記載の発明によれば、文書検索に用いる分類用のパラメータの精度を向上させることができる。 According to the fifth aspect of the present invention, it is possible to improve the accuracy of classification parameters used for document retrieval.

また、請求項６にかかる文書検索方法は、請求項５に記載の発明において、前記プレフィルタリング工程が、妥当でない分類用パラメータを検出した際には、当該分類用パラメータを削除することを特徴とする。 The document search method according to claim 6 is characterized in that, in the invention according to claim 5, when the pre-filtering step detects an invalid classification parameter, the classification parameter is deleted. To do.

この請求項６に記載の発明によれば、妥当でない分類用パラメータが用いられるような不具合を回避することができる。 According to the sixth aspect of the present invention, it is possible to avoid a problem that an invalid classification parameter is used.

また、請求項７にかかる文書検索プログラムは、請求項４〜６のいずれか一つに記載の文書検索方法をコンピュータに実行させることを特徴とする。 A document search program according to a seventh aspect causes a computer to execute the document search method according to any one of the fourth to sixth aspects.

この請求項７に記載の発明によれば、請求項４〜６のいずれか一つに記載の文書検索方法をコンピュータに実行させることができる。 According to the seventh aspect of the present invention, the computer can execute the document search method according to any one of the fourth to sixth aspects.

以上説明したように、請求項１に記載の発明によれば、操作命令を受け付ける操作手段と、前記操作手段に対する検索者の入力操作に応じて検索用の語句を検索語として抽出する検索語抽出手段と、電子化された複数の文書を文書群として蓄積する蓄積手段と、前記蓄積手段に蓄積されている文書群から、前記検索語抽出手段により抽出された前記検索語を含む文書を関連文書として検索する第１の関連文書検索手段と、表示動作を行う表示手段と、前記第１の関連文書検索手段による検索結果を前記表示手段に表示させる第１の表示制御手段と、前記操作手段に対する検索者の入力操作に応じて、前記第１の関連文書検索手段により検索された複数の前記関連文書から検索者が求める適合文書を指定する適合文書指定手段と、前記適合文書指定手段により指定された前記適合文書に基づいて検索用の語句を関連語として抽出する関連語抽出手段と、前記蓄積手段に蓄積されている文書群から、前記関連語抽出手段により抽出された前記関連語を含む文書を関連文書として検索する第２の関連文書検索手段と、前記第２の関連文書検索手段により検索された複数の前記関連文書から、前記適合文書指定手段により指定された前記適合文書に基づいて検索者が求めない不適合文書を抽出する不適合文書抽出手段と、前記適合文書指定手段により指定された前記適合文書および前記不適合文書抽出手段により抽出された前記不適合文書に基づいて分類用パラメータを生成する学習手段と、前記学習手段により生成された前記分類用パラメータの妥当性を検証するプレフィルタリング手段と、前記プレフィルタリング手段で妥当性が検証された分類用パラメータを検索語として、前記蓄積手段に蓄積されている文書群に対して再検索を行う第３の関連文書検索手段と、前記第３の関連文書検索手段による再検索結果に対して、前記プレフィルタリング手段で妥当性が検証された分類用パラメータに基づいて前記適合文書を分類する分類手段と、前記分類手段による分類結果を前記表示手段に表示させる第２の表示制御手段と、を備えているので、検索者が指定した適合文書に基づく関連文書の検索に際し、蓄積手段などで管理されている全文書からフィルタリングによって取り出した適合文書に、必ず学習データとして指定した適合文書が含まれるため、文書検索の精度を向上させることができるという効果を奏する。 As described above, according to the first aspect of the present invention, the operation means for accepting the operation command and the search word extraction for extracting the search word / phrase as the search word in accordance with the searcher's input operation to the operation means Means, storage means for storing a plurality of digitized documents as a document group, and a document containing the search word extracted by the search word extraction means from the document group stored in the storage means. A first related document search means for searching as a display means, a display means for performing a display operation, a first display control means for causing the display means to display a search result by the first related document search means, and an operation means A conforming document designating unit for designating a conforming document requested by the retriever from a plurality of the related documents retrieved by the first related document retrieving unit in response to an input operation by the retriever; A related word extracting means for extracting a search word / phrase as a related word based on the relevant document designated by the determining means, and the related word extracting means extracted from the document group stored in the accumulating means. A second related document search means for searching for a document including a related word as a related document; and the matching specified by the matching document specifying means from a plurality of the related documents searched by the second related document search means Non-conforming document extracting means for extracting non-conforming documents that the searcher does not seek based on the document, the conforming document specified by the conforming document specifying means, and the non-conforming document extracted by the non-conforming document extracting means Learning means for generating parameters, and pre-filtering means for verifying the validity of the classification parameters generated by the learning means A third related document search means for performing a re-search on the document group stored in the storage means, using the classification parameter verified by the pre-filtering means as a search word, and the third A classification unit that classifies the conforming document based on a classification parameter verified by the pre-filtering unit with respect to a re-search result by the related document retrieval unit, and a classification result by the classification unit in the display unit. A second display control means for displaying, so that when searching for related documents based on the relevant document designated by the searcher, the relevant documents extracted by filtering from all the documents managed by the storage means, Since the matching document specified as the learning data is always included, the document search accuracy can be improved.

また、請求項２に記載の発明によれば、請求項１に記載の発明において、前記プレフィルタリング手段が、前記学習手段が生成した分類用パラメータによって、前記適合文書と前記不適合文書が正確に適合文書と不適合文書とに分類されるように前記分類用パラメータを補正するので、文書検索に用いる分類用のパラメータの精度を向上させることができるという効果を奏する。 According to the invention described in claim 2, in the invention described in claim 1, the pre-filtering unit accurately matches the conforming document and the non-conforming document according to the classification parameter generated by the learning unit. Since the classification parameter is corrected so as to be classified into a document and a non-conforming document, the accuracy of the classification parameter used for document search can be improved.

また、請求項３に記載の発明によれば、請求項２に記載の発明において、前記プレフィルタリング手段が、妥当でない分類用パラメータを検出した際には、当該分類用パラメータを削除するので、妥当でない分類用パラメータが用いられるような不具合を回避することができるという効果を奏する。 According to the invention described in claim 3, in the invention described in claim 2, when the pre-filtering unit detects an invalid classification parameter, the classification parameter is deleted. Thus, there is an effect that it is possible to avoid such a problem that non-classifying parameters are used.

また、請求項４に記載の発明によれば、操作を受け付ける操作手段に対する検索者の入力操作に応じて検索用の語句を検索語として抽出する検索語抽出工程と、電子化された複数の文書を文書群として蓄積する蓄積手段に蓄積されている文書群から、前記検索語抽出工程により抽出された前記検索語を含む文書を関連文書として検索する第１の関連文書検索工程と、前記第１の関連文書検索工程による検索結果を表示させる第１の検索結果表示工程と、前記操作手段に対する検索者の入力操作に応じて、前記第１の関連文書検索工程により検索された複数の前記関連文書から検索者が求める適合文書を指定する適合文書指定工程と、前記適合文書指定工程により指定された前記適合文書に基づいて検索用の語句を関連語として抽出する関連語抽出工程と、前記蓄積手段に蓄積されている文書群から、前記関連語抽出工程により抽出された前記関連語を含む文書を関連文書として検索する第２の関連文書検索工程と、前記第２の関連文書検索工程により検索された複数の前記関連文書から、前記適合文書指定工程により指定された前記適合文書に基づいて検索者が求めない不適合文書を抽出する不適合文書抽出工程と、前記適合文書指定工程により指定された前記適合文書および前記不適合文書抽出工程により抽出された前記不適合文書に基づいて分類用パラメータを生成する分類用パラメータ生成工程と、前記分類用パラメータ生成工程により生成された前記分類用パラメータの妥当性を検証するプレフィルタリング工程と、前記プレフィルタリング工程で妥当性が検証された分類用パラメータを検索語として、前記蓄積手段に蓄積されている文書群に対して再検索を行う第３の関連文書検索工程と、前記第３の関連文書検索工程による再検索結果に対して、前記プレフィルタリング工程で妥当性が検証された分類用パラメータに基づいて前記適合文書を分類する適合文書分類工程と、前記適合文書分類工程による分類結果を表示する第２の表示工程と、を含むので、検索者が指定した適合文書に基づく関連文書の検索に際し、蓄積手段などで管理されている全文書からフィルタリングによって取り出した適合文書に、必ず学習データとして指定した適合文書が含まれるため、文書検索の精度を向上させることができるという効果を奏する。 According to the invention described in claim 4, a search word extracting step of extracting a search word / phrase as a search word in response to a searcher's input operation to an operation means for receiving an operation, and a plurality of digitized documents A first related document search step for searching a document including the search word extracted by the search word extraction step as a related document from the document group stored in the storage means for storing the document as a document group, and the first A first search result display step for displaying a search result of the related document search step, and a plurality of the related documents searched by the first related document search step in response to an input operation of a searcher to the operation means A matching document designating step for designating a matching document requested by a searcher from the search term, and a related term for extracting a search word / phrase as a related term based on the matching document designated by the matching document designating step A second related document search step for searching as a related document a document including the related word extracted by the related word extraction step from the document group stored in the storage means, and the second related document search step A non-conforming document extraction step for extracting non-conforming documents that a searcher does not seek based on the conforming document designated by the conforming document designation step from the plurality of related documents retrieved by the related document retrieval step, and the conforming document designation A classification parameter generating step for generating a classification parameter based on the conforming document specified in the process and the nonconforming document extracted by the nonconforming document extracting step; and the classification parameter generated by the classification parameter generating step. Pre-filtering process for verifying validity of parameters, and classification for which validity has been verified in the pre-filtering process A third related document search step for re-searching the document group stored in the storage means using a parameter as a search term, and a re-search result obtained by the third related document search step for the pre-search. Since it includes a conforming document classification step for classifying the conforming document based on the classification parameter verified in the filtering step, and a second display step for displaying a classification result obtained by the conforming document classification step. When searching for related documents based on relevant documents specified by the user, the relevant documents specified as learning data are always included in the relevant documents extracted by filtering from all documents managed by the storage means. The effect that can be improved.

また、請求項５に記載の発明によれば、請求項４に記載の発明において、前記プレフィルタリング工程が、前記分類用パラメータ生成工程により生成された分類用パラメータによって、前記適合文書と前記不適合文書が正確に適合文書と不適合文書とに分類されるように前記分類用パラメータを補正するので、文書検索に用いる分類用のパラメータの精度を向上させることができるという効果を奏する。 According to the invention described in claim 5, in the invention described in claim 4, the pre-filtering step uses the classification parameter generated by the classification parameter generation step, and the conforming document and the non-conforming document are included. Since the classification parameter is corrected so that the document is correctly classified into the conforming document and the nonconforming document, the accuracy of the classification parameter used for document search can be improved.

また、請求項６に記載の発明によれば、請求項５に記載の発明において、前記プレフィルタリング工程が、妥当でない分類用パラメータを検出した際には、当該分類用パラメータを削除するので、妥当でない分類用パラメータが用いられるような不具合を回避することができるという効果を奏する。 Further, according to the invention described in claim 6, in the invention described in claim 5, when the pre-filtering step detects an invalid classification parameter, the classification parameter is deleted. Thus, there is an effect that it is possible to avoid such a problem that non-classifying parameters are used.

また、請求項７に記載の発明によれば、請求項４〜６のいずれか一つに記載の文書検索方法をコンピュータに実行させることによって、請求項４〜６のいずれか一つに記載の文書検索方法をコンピュータで実現することが可能なプログラムが得られるという効果を奏する。 In addition, according to the invention described in claim 7, by causing a computer to execute the document search method described in any one of claims 4 to 6, the document search method described in any one of claims 4 to 6 is provided. There is an effect that a program capable of realizing the document search method by a computer can be obtained.

以下、添付図面を参照して、この発明にかかる文書検索装置、文書検索方法、および文書検索プログラムの好適な実施の形態を詳細に説明する。 Exemplary embodiments of a document search device, a document search method, and a document search program according to the present invention will be explained below in detail with reference to the accompanying drawings.

（文書検索装置のハードウエア構成）
まず、この発明の実施の形態にかかる文書検索装置のハードウエア構成について説明する。図１は、この発明の実施の形態にかかる文書検索装置のハードウエア構成を示す図である。この文書検索装置１００は、各種演算を行って装置全体を制御するＣＰＵ１０１と、各種のＲＯＭやＲＡＭからなるメモリ１０２とを備えており、それらはバス１０３で接続されている。 (Hardware configuration of document retrieval device)
First, the hardware configuration of the document search apparatus according to the embodiment of the present invention will be described. FIG. 1 is a diagram showing a hardware configuration of a document search apparatus according to an embodiment of the present invention. The document retrieval apparatus 100 includes a CPU 101 that performs various calculations and controls the entire apparatus, and a memory 102 that includes various ROMs and RAMs, and these are connected by a bus 103.

バス１０３には、所定のインターフェースを介して、ハードディスクなどの磁気記憶装置１０４と、キーボードやマウスなどの入力装置１０５と、表示動作を行うＬＣＤやＣＲＴなどの表示装置１０６と、光ディスクなどの記憶媒体１０７を読み取る記憶媒体読取装置１０８とが接続されている。また、バス１０３には、ネットワーク１１０と通信を行う通信制御装置１０９が接続されている。なお、記憶媒体１０７としては、ＣＤやＤＶＤなどの光ディスク、光磁気ディスク、フレキシブルディスクなどの各種メディアが用いられる。また、記憶媒体読取装置１０８は、記憶媒体１０７の種類に応じて光ディスク装置、光磁気ディスク装置、フレキシブルディスク装置などが用いられる。 A bus 103 is connected to a magnetic storage device 104 such as a hard disk, an input device 105 such as a keyboard and a mouse, a display device 106 such as an LCD and a CRT, and a storage medium such as an optical disk via a predetermined interface. A storage medium reading device 108 for reading 107 is connected. In addition, a communication control device 109 that communicates with the network 110 is connected to the bus 103. As the storage medium 107, various media such as an optical disk such as a CD and a DVD, a magneto-optical disk, and a flexible disk are used. As the storage medium reading device 108, an optical disk device, a magneto-optical disk device, a flexible disk device, or the like is used according to the type of the storage medium 107.

磁気記憶装置１０４には、この発明のプログラムを文書検索プログラム１２０が記憶されている。この文書検索プログラム１２０は、記憶媒体１０７から記憶媒体読取装置１０８により読み取るか、あるいは、インターネットなどのネットワーク１１０からダウンロードするかなどして、磁気記憶装置１０４にインストールされたものである。このインストールにより文書検索装置１００は動作可能な状態となる。なお、この文書検索プログラム１２０は、所定のＯＳ上で動作するものであってもよい。また、特定のアプリケーションソフトの一部をなすものであってもよい。 The magnetic storage device 104 stores a document search program 120 for the program of the present invention. The document search program 120 is installed in the magnetic storage device 104 by reading from the storage medium 107 by the storage medium reading device 108 or by downloading from the network 110 such as the Internet. With this installation, the document search apparatus 100 becomes operable. The document search program 120 may operate on a predetermined OS. Further, it may be a part of specific application software.

また、この文書検索装置１００がサーバ装置としてネットワーク１１０を介して端末装置に接続されているような場合には、検索者は文書検索装置１００を端末装置により操作することができる。端末装置としては、たとえば、パーソナルコンピュータ、携帯情報端末（ＰＤＡ）、携帯電話などの情報処理装置が用いられる。また、ネットワーク１１０としては、無線、有線及び放送波のいずれを用いたものでもよく、たとえば、ＬＡＮ、ＷＡＮ、インターネット、アナログ電話網、デジタル電話網、ＰＨＳ（パーソナルハンディホンシステム）網、携帯電話網、衛星通信網などを利用することができる。 When the document search apparatus 100 is connected as a server apparatus to a terminal device via the network 110, the searcher can operate the document search apparatus 100 using the terminal apparatus. As the terminal device, for example, an information processing device such as a personal computer, a personal digital assistant (PDA), or a mobile phone is used. Further, the network 110 may be any of wireless, wired and broadcast waves. For example, LAN, WAN, Internet, analog telephone network, digital telephone network, PHS (Personal Handyphone System) network, mobile phone network Satellite communication networks can be used.

（文書検索の機能的構成）
次に、この発明の実施の形態にかかる文書検索装置の機能的構成について説明する。図２は、この発明の実施の形態にかかる文書検索装置の機能的構成を示すブロック図である。 (Functional structure of document search)
Next, a functional configuration of the document search apparatus according to the embodiment of the present invention will be described. FIG. 2 is a block diagram showing a functional configuration of the document search apparatus according to the embodiment of the present invention.

図２に示すように、この文書検索装置１００は、電子化された複数の文書を文書群として蓄積している蓄積部であるデータベース（ＤＢ）２１０と、文書群から適合文書を抽出するための文書検索部２２０と、入出力部２３０とを備えている。なお、文書群は、電子化された複数の文書から構成されている。 As shown in FIG. 2, the document search apparatus 100 includes a database (DB) 210 that is a storage unit that stores a plurality of digitized documents as a document group, and for extracting a matching document from the document group. A document search unit 220 and an input / output unit 230 are provided. The document group is composed of a plurality of digitized documents.

データベース２１０は磁気記憶装置１０４により構成されており、入出力部２３０は入力装置１０５および表示装置１０６により構成されている。ここで、入出力部２３０は操作部および表示部として機能する。なお、データベース２１０は、磁気記憶装置１０４で構成されているが、これに限るものではなく、たとえば、ネットワーク１１０を介して文書検索装置１００に接続されていてもよい。 The database 210 is configured by the magnetic storage device 104, and the input / output unit 230 is configured by the input device 105 and the display device 106. Here, the input / output unit 230 functions as an operation unit and a display unit. The database 210 is configured by the magnetic storage device 104, but is not limited thereto, and may be connected to the document search device 100 via the network 110, for example.

文書検索部２２０は、操作者である検索者が指定した検索文から検索語（検索用の語句）を抽出する検索語抽出部２０１、文書群から関連文書を検索する関連文書検索部２０２、入出力部２３０に対する検索者の操作に応じて関連文書から適合文書を指定する適合文書指定部２０３、適合文書に基づいて関連語（検索用の語句）を抽出する関連語抽出部２０４、適合文書でない不適合文書を抽出する不適合文書抽出部２０５、不適合文書（学習データ）を使用して分類用パラメータを生成する学習部２０６、分類用パラメータの妥当性を検証するプレフィルタリング部２０７、プレフィルタリング部２０７で妥当性が検証された分類用パラメータを用いて、関連文書から適合文書を分類する分類部２０８を含み構成されている。 The document search unit 220 includes a search word extraction unit 201 that extracts a search word (search word / phrase) from a search sentence specified by a searcher who is an operator, a related document search unit 202 that searches for a related document from a document group, an input A conforming document designating unit 203 that designates a conforming document from a related document in accordance with a searcher's operation on the output unit 230, a related term extracting unit 204 that extracts a related word (search phrase) based on the conforming document, and not a conforming document A non-conforming document extraction unit 205 that extracts non-conforming documents, a learning unit 206 that generates classification parameters using non-conforming documents (learning data), a pre-filtering unit 207 that verifies the validity of classification parameters, and a pre-filtering unit 207 A classification unit 208 is configured to classify relevant documents from related documents using classification parameters whose validity has been verified.

以上のように構成された文書検索装置１００において、まず、検索者は入出力部２３０を操作することにより検索要求となる検索文を指定する。すると、検索語抽出部２０１は検索者が指定した検索文から検索語を抽出し、関連文書検索部２０２に入力する。関連文書検索部２０２は、データベース２１０の文書群から検索語を含む文書を関連文書としてランキング検索し、その検索結果を入出力部２３０に入力する。入出力部２３０はその検索結果を表示する。 In the document search apparatus 100 configured as described above, first, the searcher operates the input / output unit 230 to specify a search sentence that is a search request. Then, the search term extraction unit 201 extracts a search term from the search sentence designated by the searcher and inputs it to the related document search unit 202. The related document search unit 202 performs a ranking search for documents including a search word as a related document from the document group of the database 210, and inputs the search result to the input / output unit 230. The input / output unit 230 displays the search result.

検索者は検索結果の内容を吟味して、入出力部２３０を操作することにより自身が求める（すなわち適合する）文書を適合文書として選択する。すると、適合文書指定部２０３は、その選択に応じて検索結果から複数の適合文書を指定する。関連語抽出部２０４は、検索者が指定した適合文書から関連語を抽出し、関連文書検索部２０２に入力する。関連文書検索部２０２は、データベース２１０の文書群から関連語を含む文書を関連文書としてランキング検索し、その検索結果を入出力部２３０に入力する。入出力部２３０はその検索結果を表示する。これにより、検索者が指定した適合文書は検索上位に現れるようになる。このような適合文書の指定および関連文書の検索が複数回繰り返され、十分な適合文書が得られる。 The searcher examines the contents of the search result and operates the input / output unit 230 to select a document that the searcher wants (that is, conforms) as a conforming document. Then, the relevant document designating unit 203 designates a plurality of relevant documents from the search result according to the selection. The related word extraction unit 204 extracts related words from the relevant document designated by the searcher and inputs the related words to the related document search unit 202. The related document search unit 202 performs a ranking search for documents including related words from the document group of the database 210 as related documents, and inputs the search results to the input / output unit 230. The input / output unit 230 displays the search result. As a result, the matching document designated by the searcher appears at the top of the search. The specification of the relevant document and the retrieval of the related document are repeated a plurality of times, and a sufficient relevant document is obtained.

検索者は入出力部２３０を操作することによりフィルタリング要求を指定する。すると、不適合文書抽出部２０５は、適合文書を入力データとして検索者が要求しない（すなわち適合しない）不適合文書を「不適合文書の抽出法（後述する）」に従って検索結果から自動的に抽出する。抽出された不適合文書は、適合文書とともに学習部２０６に渡り、分類用パラメータを生成する学習データとなる。学習部２０６はその学習データを使用して分類用パラメータを生成し、分類用パラメータをプレフィルタリング部２０７に渡す。 The searcher operates the input / output unit 230 to specify a filtering request. Then, the non-conforming document extraction unit 205 automatically extracts non-conforming documents that the searcher does not request (ie, does not conform) using the conforming documents as input data according to the “non-conforming document extraction method (described later)”. The extracted nonconforming document is transferred to the learning unit 206 together with the conforming document, and becomes learning data for generating a classification parameter. The learning unit 206 generates classification parameters using the learning data, and passes the classification parameters to the pre-filtering unit 207.

プレフィルタリング部２０７では、分類用パラメータの妥当性を検証するために、実際に分類用パラメータを使用して検索者が指定した適合文書と不適合文書の抽出法により抽出された不適合文書を分類する。そして、確実に、適合文書と不適合文書とに分類されるように分類用パラメータを補正する。なお、妥当でない分類用パラメータが検出された場合には、その分類用パラメータを削除する。検証が終わると分類用パラメータを関連文書検索部２０２に渡す。 In order to verify the validity of the classification parameter, the pre-filtering unit 207 classifies the non-conforming document extracted by the extraction method of the conforming document and the non-conforming document actually designated by the searcher using the classification parameter. Then, the classification parameters are corrected so as to be surely classified into conforming documents and nonconforming documents. If an invalid classification parameter is detected, the classification parameter is deleted. When the verification is completed, the classification parameters are passed to the related document search unit 202.

関連文書検索部２０２は、妥当性が検証された分類用パラメータを検索語として再検索を行い、その再検索結果を分類部２０８に入力する。分類部２０８は、再検索結果を関連文書検索部２０２から受け取り、妥当性が検証された分類用パラメータを使用してフィルタリングを行い、関連文書のみを取り出して、その関連文書を適合文書として入出力部２３０に入力する。入出力部２３０はその適合文書を検索結果として表示する。 The related document search unit 202 performs a re-search using the classification parameters whose validity has been verified as a search term, and inputs the re-search result to the classification unit 208. The classification unit 208 receives the re-search result from the related document search unit 202, performs filtering using the classification parameters whose validity has been verified, extracts only the related document, and inputs / outputs the related document as a conforming document. Input to the unit 230. The input / output unit 230 displays the relevant document as a search result.

ここで、不適合文書の抽出法について説明する。この不適合文書の抽出法は、与えられた文書群（文書集合）の中から適合文書に基づいて不適合文書を抽出する方法であり、文書間の類似度を決めて、ベクトル空間上で類似度計算を行うことで不適合文書を抽出する。ここでは、適合文書と（適合文書の中心ベクトルをＣとする）とラベルなしの各文書（ラベルなし文書の文書ベクトルをＤとする）との類似度ｓｉｍが閾値α以下（ｓｉｍ（Ｃ，Ｄ）≦α）の文書が不適合文書とされる。また、ラベルなし文書としては、関連文書の検索結果の上位からユーザが指定した適合文書を除いたｎ個の文書が選択される。そして、不適合文書は適合文書と同じ数だけ抽出される。 Here, a method of extracting nonconforming documents will be described. This non-conforming document extraction method is a method for extracting non-conforming documents from a given document group (document set) based on conforming documents. The similarity between documents is determined and the similarity is calculated on a vector space. To extract non-conforming documents. Here, the similarity sim between the conforming document (the center vector of the conforming document is C) and each unlabeled document (the document vector of the unlabeled document is D) is equal to or less than a threshold α (sim (C, D ) ≦ α) is considered a non-conforming document. In addition, as the unlabeled document, n documents are selected from the top of the retrieval result of the related document, excluding the conforming document designated by the user. Then, the same number of non-conforming documents are extracted as the conforming documents.

不適合文書の抽出方法は次の手順による。まず、適合文書の集合Ｒから中心ベクトルＣを求める。関連文書の検索結果の上位からｎ個の文書を選択してＳとする。Ｓから未選択の文書を１つ選択し文書ベクトルＤを求め、中心ベクトルＣとの類似度ｓｉｍ（Ｃ，Ｄ）を計算し、その計算結果を優先順序キューＱに入れる。なお、優先順序キューＱはｓｉｍ（Ｃ，Ｄ）の値で半整列（判順序化）されている。また、キューサイズは適合文書サイズとなるように管理されている。優先順序キューＱ内の要素の最大値がαになると、優先順序キューＱ内の文書を不適合文書Ｎとし、不適合文書の抽出は完了する。Ｓ中の全て（ｎ個）の文書に対して類似度を計算しても、優先順序キューＱ内の要素の最大値がα以下にならない場合には、関連文書の検索結果からｍ個の文書をさらに選択してＳ中の文書数をｎからｎ＋ｍに拡張し、Ｓ中で未選択の文書の文書ベクトルＤに対して、上述と同じことを繰り返す。 The method for extracting nonconforming documents is as follows. First, the center vector C is obtained from the set R of relevant documents. Select n documents from the top of the search result of the related documents, and set it as S. One unselected document is selected from S, a document vector D is obtained, similarity sim (C, D) with the center vector C is calculated, and the calculation result is put in the priority order queue Q. The priority order queue Q is semi-aligned (decided) by the value of sim (C, D). Further, the queue size is managed so as to be a compatible document size. When the maximum value of the elements in the priority order queue Q reaches α, the document in the priority order queue Q is set as the nonconforming document N, and the extraction of the nonconforming document is completed. Even if the similarity is calculated for all (n) documents in S, if the maximum value of the elements in the priority order queue Q is not less than or equal to α, m documents are obtained from the related document search results. Is further expanded to increase the number of documents in S from n to n + m, and the same process as described above is repeated for the document vector D of a document not selected in S.

このような文書検索部２２０の各機能は文書検索プログラム１２０に基づいてＣＰＵ１０１が実行する処理により実現される。 Each function of the document search unit 220 is realized by processing executed by the CPU 101 based on the document search program 120.

（文書検索処理）
次に、文書検索装置による文書検索処理の手順について説明する。図３は、この文書検索処理の手順を示すフローチャートである。この処理は、ＣＰＵ１０１が文書検索プログラム１２０を実行することにより行われる。 (Document search process)
Next, a procedure for document search processing by the document search apparatus will be described. FIG. 3 is a flowchart showing the procedure of the document search process. This process is performed by the CPU 101 executing the document search program 120.

図３に示すように、まず、ＣＰＵ１０１は、たとえば図４に示すような検索画面を入出力部２３０により表示する（ステップＳ３０１）。そして、検索語を入力し、検索実行ボタン４０１を押下する（ステップＳ３０２）。これにより、ＣＰＵ１０１は、検索語に基づいて関連文書の検索を実行し（ステップＳ３０３）、その関連文書の検索結果を入出力部２３０に表示する（ステップＳ３０４）。これにより、数千や数万になる関連文書の検索結果のうち、関連度の高いものから所定件数分表示される。このとき、検索画面はたとえば図５に示すような画面になる。 As shown in FIG. 3, first, the CPU 101 displays a search screen as shown in FIG. 4, for example, by the input / output unit 230 (step S301). Then, a search word is input and the search execution button 401 is pressed (step S302). As a result, the CPU 101 searches for the related document based on the search word (step S303), and displays the search result of the related document on the input / output unit 230 (step S304). As a result, a predetermined number of documents are displayed starting from those having a high degree of relevance among the retrieval results of thousands or tens of thousands of related documents. At this time, the search screen is, for example, a screen as shown in FIG.

検索者は検索結果を確認して、より良い検索結果を得るために検索結果の文書の内容を確認し、入出力部２３０を操作して、自身が求める（適合する）文書に○（図５参照）をつけて適合文書の指定を行い、入力後に検索実行ボタン５０１を押下する（ステップＳ３０５）。 The searcher confirms the search result, confirms the content of the document of the search result in order to obtain a better search result, operates the input / output unit 230, and adds a ○ to the document that he wants (matches) (FIG. 5). The reference document is designated, and the relevant document is designated. After the input, the search execution button 501 is pressed (step S305).

これにより、ＣＰＵ１０１は、○がついている関連文書を適合文書として指定し、その適合文書に基づいて関連語を抽出し、その関連語に基づいて関連文書の検索を実行する（ステップＳ３０６）。その後、その関連文書の検索結果を入出力部２３０により表示する（ステップＳ３０７）。すると、検索画面はたとえば図６に示すような画面になり、図５に示すような画面で適合文書として指定した文書が検索上位に移動する。また、それに合わせて適合文書と関連する文書が検索結果の上位に出現するようになる。ＣＰＵ１０１は、関連文書の検索結果の適合性をさらに良くするために、入出力部２３０に対する検索者の操作に応じて適合文書を指定して、その後検索実行ボタン６０１を押下し、再び適合性フィードバック検索を開始する（ステップＳ３０８）。その後、フィルタリングに必要な適合文書数が得られたか否かを判断する（ステップＳ３０９）。 As a result, the CPU 101 designates the related document with a circle as a matching document, extracts a related word based on the matching document, and executes a search for the related document based on the related word (step S306). Thereafter, the search result of the related document is displayed by the input / output unit 230 (step S307). Then, the search screen becomes a screen as shown in FIG. 6, for example, and a document designated as a matching document on the screen as shown in FIG. In accordance with this, the document related to the conforming document appears at the top of the search result. In order to further improve the relevance of the search result of the related document, the CPU 101 designates the relevant document according to the searcher's operation on the input / output unit 230, and then presses the search execution button 601 to relevance feedback again. The search is started (step S308). Thereafter, it is determined whether or not the number of conforming documents necessary for filtering has been obtained (step S309).

通常、２〜３回の適合性フィードバック検索を行えば、フィルタリングに必要な適合文書数は得られる。適合文書数が多いほど、正確なフィルタリングを行うことができ、実用的には７つ程度の適合文書数で満足のいくフィルタリング結果が得られる。なお、ＤＢ２１０に検索者が求める文書がもともと３文書しかない場合には、適合性フィードバック検索を何回行っても、フィルタリングに必要な適合文書数は多く得られない。 Normally, the number of conforming documents necessary for filtering can be obtained by performing the conformity feedback search two to three times. As the number of matching documents increases, more accurate filtering can be performed, and practically, a satisfactory filtering result can be obtained with the number of matching documents of about seven. If there are only three documents originally requested by the searcher in the DB 210, the number of conforming documents necessary for filtering cannot be obtained no matter how many times the conformity feedback search is performed.

ステップＳ３０９においてフィルタリングに必要な適合文書数が得られていない場合（ステップＳ３０９：Ｎｏ）は、ステップＳ３０８へ戻り処理を続行する。ステップＳ３０９においてフィルタリングに必要な適合文書数が得られた場合（ステップＳ３０９：Ｙｅｓ）は、入出力部２３０に検索上位に適合文書が記された検索結果（図７に示す画面を参照）が表示される（ステップＳ３１０）。この状態で、フィルタリングボタン７０１が押されると（ステップＳ３１１）、適合文書を入力データとして検索者が要求しない（すなわち適合しない）不適合文書を前述の不適合文書の抽出法に従って検索結果から抽出する（ステップＳ３１２）。抽出された不適合文書および適合文書を学習データとして分類用パラメータを生成する（ステップＳ３１３）。そこで生成された分類用パラメータの妥当性を検証（プレフィルタリングを実行）する（ステップＳ３１４）。ここでは、確実に、適合文書と不適合文書とに分類されるように分類用パラメータの補正が実行される。また、妥当でない分類用パラメータが検出された場合には、その分類用パラメータは削除される。そして、妥当性が検証された分類用パラメータを検索語として再検索を実行する（ステップＳ３１５）。その検索結果の関連文書をフィルタリングを実行し（ステップＳ３１６）、その再検索結果を表示する（ステップＳ３１７）。これにより、検索画面は図８に示すような画面になる。 If the number of conforming documents necessary for filtering is not obtained in step S309 (step S309: No), the process returns to step S308 and continues. When the number of conforming documents necessary for filtering is obtained in step S309 (step S309: Yes), the search result (see the screen shown in FIG. 7) in which the conforming documents are described in the upper search unit is displayed in the input / output unit 230. (Step S310). In this state, when the filtering button 701 is pressed (step S311), the non-conforming document that the searcher does not request (ie, does not conform) using the conforming document as input data is extracted from the retrieval result according to the above-described incompatible document extraction method (step S311). S312). Classification parameters are generated using the extracted nonconforming document and conforming document as learning data (step S313). Therefore, the validity of the generated classification parameter is verified (pre-filtering is executed) (step S314). Here, the correction of the classification parameter is executed so as to be surely classified into the conforming document and the nonconforming document. If an invalid classification parameter is detected, the classification parameter is deleted. Then, the search is executed again using the classification parameter whose validity is verified as a search word (step S315). The related document of the search result is filtered (step S316), and the re-search result is displayed (step S317). Thereby, the search screen becomes a screen as shown in FIG.

このとき、図７に示すような画面の検索結果には、適合文書よりも不適合文書のほうが多く含まれるのが普通であるが、図８に示すような画面の再検索結果には、適合文書と関連しない文書は含まれていない。図８に示す画面には、図５、図６、図７の画面で指定した適合文書がフィルタリング結果一覧として必ず現れる。 At this time, the search results on the screen as shown in FIG. 7 usually include more non-conforming documents than the conforming documents, but the re-search results on the screen as shown in FIG. Documents not related to are not included. In the screen shown in FIG. 8, the conforming documents specified in the screens of FIGS. 5, 6, and 7 always appear as a filtering result list.

このような処理により、図４に示すような画面は、検索実行ボタン４０１が押されるとデータが図２に示すようにａ→ｂ→ｃ→ｄ→ｅと流れ、図５に示すような画面になる。より良い検索結果を得るために図５および図６に示すような画面において、検索者が検索結果に○をつけると、適合文書が指定されて適合性フィードバック検索が行われる。このとき、データは図２に示すようにｆ→ｇ→ｈ→ｃ→ｄ→ｅと流れる。その後、図６に示すような画面は十分な適合文書が得られると図７に示すような画面になる。この図７に示すような画面は、フィルタリングボタン７０１が押されるとデータがｉ→ｊ→ｋ→ｌ→ｃ→ｄ→ｍ→ｎと流れ、図８に示すような画面になる。 As a result of such processing, when the search execution button 401 is pressed, the screen as shown in FIG. 4 flows from a → b → c → d → e as shown in FIG. 2, and the screen as shown in FIG. become. In order to obtain a better search result, when the searcher puts a circle on the search result on the screens as shown in FIGS. 5 and 6, the relevant document is designated and the suitability feedback search is performed. At this time, the data flows in the order of f → g → h → c → d → e as shown in FIG. After that, the screen as shown in FIG. 6 becomes the screen as shown in FIG. 7 when a sufficient conforming document is obtained. When the filtering button 701 is pressed, the screen as shown in FIG. 7 flows as i → j → k → l → c → d → m → n, resulting in the screen as shown in FIG.

次に、ステップＳ３１４のプレフィルタリングの処理について説明する。ここでは、フィルタリングを線形分類により行う場合を例にとり説明する。 Next, the pre-filtering process in step S314 will be described. Here, a case where filtering is performed by linear classification will be described as an example.

フィルタリング向けの分類器ｆ（ｘ）は、分類用パラメータｗ＝｛ｗ１，ｗ２，・・・
，ｗｎ｝と、文書ベクトルｘ＝｛ｘ１，ｘ２，・・・，ｘｎ｝により、
Σｗｉ×ｘｉ＋β ・・・（１）
（βはしきい値）の形式で表現され、（適合文書か非適合文書か判定したい）被フィルタリング文書の文書ベクトルｘに対して、
ｆ（ｘ）＞０・・・（２）
の場合に、ｘは適合文書になり、
ｆ（ｘ）≦０・・・（３）
の場合に、ｘは不適合文書になる。 The classifier f (x) for filtering uses the classification parameter w = {w1, w2,.
, Wn} and the document vector x = {x1, x2,..., Xn}
Σwi × xi + β (1)
(Β is a threshold value) is expressed in the form of a document vector x of a document to be filtered (which is to be determined as a conforming document or a nonconforming document).
f (x)> 0 (2)
Then x becomes a conforming document,
f (x) ≦ 0 (3)
In this case, x becomes a nonconforming document.

なお、分類用パラメータは、単語の重みと単語のペアで表現される。以後、分類用パラメータｗｉが単語そのものを表す場合は、便宜上単にｗｉと表記し、単語の重みはｖａｌｕｅ（ｗｉ）と表記することにする。 The classification parameters are expressed by word weights and word pairs. Hereinafter, when the classification parameter wi represents the word itself, it is simply expressed as wi for convenience, and the weight of the word is expressed as value (wi).

式（１）において、各ｗｉは分類用パラメータであり、学習により決定される。 In equation (1), each wi is a classification parameter and is determined by learning.

次に分類用パラメータの生成法を説明する。 Next, a method for generating classification parameters will be described.

（Ａ１）学習データｄｉ（ｄ１，ｄ２，ｄ３，・・・，ｄｎ）を用意する。 (A1) Learning data di (d1, d2, d3,..., Dn) is prepared.

（Ａ２）各ｄｉから形態素解析等により単語を取り出す。 (A2) A word is extracted from each di by morphological analysis or the like.

（Ａ３）各ｄｉを特徴づける単語を取り出すためにたとえば単語のｔｆ×ｉｄｆ値を計算し、上位ｎ個を取り出し集合Ｑに格納する（ｔｆはｔｅｒｍｆｒｅｑｕｅｎｃｙで単語が文書内に出現する頻度、ｉｄｆはｉｎｖｅｒｓｅｄｏｃｕｍｅｎt ｆｒｅｑｕｅｎｃｙでＮを文書数、ｄｆを単語が出現する文書の頻度とした場合、ｌｏｇ（Ｎ／ｄｆ）で表現される)。 (A3) In order to extract words characterizing each di, for example, the tf × idf value of the word is calculated, and the top n are extracted and stored in the set Q (tf is the frequency at which the word appears in the document with term frequency, idf Is expressed by log (N / df), where N is the number of documents and df is the frequency of the document in which the word appears in inverse document frequency.

（Ａ４）集合Ｑからたとえばｔｆ×ｉｄｆ値の大きい順に単語を取り出し、取り出した単語が不適合文書集合よりも適合文書集合により多く含まれる場合は正の分類用パラメータ、逆の場合に負の分類用パラメータとする。 (A4) For example, words are extracted from the set Q in descending order of tf × idf value. When the extracted words are included in the conforming document set more than the nonconforming document set, the positive classification parameter is used. It is a parameter.

（Ａ５）各分類用パラメータの重みを学習アルゴリズム（たとえば、線形ＳＶＭ、Ｆｉｓｈｅｒ判別式、ＢａｙｅｓのＢｉｎａｒｙＩｎｄｅｐｅｎｄｅｎｃｅＭｏｄｅｌ等のアルゴリズム）に基づき決定する。 (A5) The weight of each classification parameter is determined based on a learning algorithm (for example, an algorithm such as linear SVM, Fisher discriminant, Bayes' Binary Independence Model).

続いて、分類用パラメータの生成法を踏まえてプレフィルタリングの説明をする。 Next, pre-filtering will be described based on a method for generating classification parameters.

正の重みを持つ分類用パラメータをｗ（＋）１，ｗ（＋）２，・・・，ｗ（＋）ｉ、負の重みを持つ分類用パラメータをｗ（−）１，ｗ（−）２，・・・，ｗ（−）ｉと表現したとき、プレフィルタリング処理の前には、正の重みを持つ分類用パラメータｗ（＋）１，ｗ（＋）２，・・・，ｗ（＋）ｉを降順に整列し、負の重みを持つ分類用パラメータｗ（−）１，ｗ（−）２，・・・，ｗ（−）ｉを昇順に整列していると仮定する。このとき、各学習データｄｉと分類用パラメータｗｊに対して、次の処理をする。 Classification parameters having positive weights are w (+) 1, w (+) 2,..., W (+) i, and classification parameters having negative weights are w (−) 1, w (−). 2,..., W (−) i, before the pre-filtering process, the classification parameters w (+) 1, w (+) 2,. Assume that +) i are arranged in descending order and the classification parameters w (−) 1, w (−) 2,..., W (−) i having negative weights are arranged in ascending order. At this time, the following processing is performed on each learning data di and the classification parameter wj.

まず、文書のｓｃｏｒｅを初期化する。
ｓｃｏｒｅ（ｄ i）←０・・・（４） First, the score of the document is initialized.
score (d i) ← 0 (4)

ｄｉが分類用パラメータｗｊを含むならば、文書のｓｃｏｒｅに分類用パラメータｗｊの重みを加算する。
ｓｃｏｒｅ（ｄｉ）←ｓｃｏｒｅ（ｄｉ）＋ｖａｌｕｅ（ｗｊ）・・・（５） If di includes the classification parameter wj, the weight of the classification parameter wj is added to the score of the document.
score (di) ← score (di) + value (wj) (5)

次に、
ｓｃｏｒｅ（ｄｉ）＋ｂ・・・（６）
の値の正負を判定する。ｄｉが適合文書であるのに式（６）の値が負であるか、ｄｉが不適合文書であるのに式（６）の値が正である場合は、分類用パラメータｗ１，ｗ２，・・・，ｗｎに不適切なものがあることを意味する。 next,
score (di) + b (6)
The sign of the value is determined. When di is a conforming document, the value of expression (6) is negative, or when di is a nonconforming document and the value of expression (6) is positive, classification parameters w1, w2,. ·, Meaning that there is something inappropriate in wn.

分類用パラメータの補正は、正負の分類用パラメータを別々に行う。場合分けを簡単にするために、正の分類用パラメータを先に評価し、次に負の分類用パラメータを評価する。学習データとして与えられた適合文書ｘに対して、ｆ（ｘ）≦０になるのは、正の分類
用パラメータをすべて評価した後と負の分類用パラメータの評価中に起こる可能性がある。また、学習データとして与えられた不適合文書ｘに対して、ｆ（ｘ）＞０になるのは、
正の分類用パラメータの評価中と負の分類用パラメータをすべて評価した後に起きる可能性がある。分類用パラメータの補正は、適合文書を使用して負の分類用パラメータを補正し、不適合文書を使用して正の分類用パラメータを補正する。 The correction of classification parameters is performed separately for positive and negative classification parameters. In order to simplify the case classification, the positive classification parameter is evaluated first, and then the negative classification parameter is evaluated. For a conforming document x given as learning data, f (x) ≦ 0 may occur after evaluating all positive classification parameters and during evaluation of negative classification parameters. In addition, f (x)> 0 for a nonconforming document x given as learning data
May occur during evaluation of positive classification parameters and after evaluation of all negative classification parameters. The correction of the classification parameter is performed by correcting the negative classification parameter using the conforming document and correcting the positive classification parameter using the nonconforming document.

不適合文書ｘが正の分類用パラメータｗ（＋）ｉによるプレフィルタリングにおいて、ｆ（ｘ）＞０となる場合は、次の操作で分類用パラメータｗ（＋）ｉを補正する。 When f (x)> 0 in the pre-filtering with the non-conforming document x by the positive classification parameter w (+) i, the classification parameter w (+) i is corrected by the following operation.

（Ｂ１）分類用パラメータｗ（＋）ｉをｗから削除する。 (B1) The classification parameter w (+) i is deleted from w.

（Ｂ２）Ｑから分類用パラメータを追加し、追加した分類用パラメータの重みとしきい値βを再計算する。式（６）を計算し、正負を判定する。 (B2) A classification parameter is added from Q, and the weight and threshold value β of the added classification parameter are recalculated. Formula (6) is calculated and positive / negative is determined.

（Ｂ３）負の場合は、分類用パラメータの補正は終了し、０以上の場合は、（Ｂ１）に戻る。 (B3) If negative, correction of the classification parameter ends, and if it is 0 or more, the process returns to (B1).

次に、負の分類用パラメータｗ（−）ｉによるプレフィルタリングにおいて、適合文書ｘがｆ（ｘ）≦０となる場合は、次の操作で分類用パラメータｗ（−）ｉを補正する。 Next, in the pre-filtering with the negative classification parameter w (−) i, when the conforming document x is f (x) ≦ 0, the classification parameter w (−) i is corrected by the following operation.

（Ｃ１）分類用パラメータｗ（−）ｉをｗから削除する。 (C1) The classification parameter w (−) i is deleted from w.

（Ｃ２）Ｑから分類用パラメータを追加し、追加した分類用パラメータの重みとしきい値βを再計算する。式（６）を計算し、正負を判定する。 (C2) A classification parameter is added from Q, and the weight and threshold value β of the added classification parameter are recalculated. Formula (6) is calculated and positive / negative is determined.

（Ｃ３）正の場合は分類用パラメータの補正は終了し、０以下の場合は（Ｃ１）に戻る。 (C3) If the value is positive, the correction of the classification parameter is completed.

ここで、プレフィルタリングの処理手順について説明する。図９は、このプレフィルタリングの処理手順を示すフローチャートである。 Here, the pre-filtering processing procedure will be described. FIG. 9 is a flowchart showing the pre-filtering processing procedure.

図９に示すフローチャートにおいて、まず、学習データとして指定した適合文書を集合Ｒに入れる（ステップＳ９０１）。ｉ番目の負の分類用パラメータｗ（−）ｉを検索語としてデータベース２１０に対して検索する（ステップＳ９０２）。検索の結果得られる文書群から一文書ずつ取り出し、集合Ｒ内の文書と一致するかどうかを調べる（ステップＳ９０３）。検索結果と一致した集合Ｒ内の文書ｄｊについて、分類用パラメータを使用したスコアｓｃｏｒｅ（ｄｊ）を付与する（ステップＳ９０４）。 In the flowchart shown in FIG. 9, first, the conforming document designated as the learning data is put into the set R (step S901). The database 210 is searched using the i-th negative classification parameter w (−) i as a search word (step S902). One document is extracted from the document group obtained as a result of the search, and it is checked whether or not it matches the document in the set R (step S903). For the document dj in the set R that matches the search result, a score score (dj) using the classification parameter is assigned (step S904).

そして、ｓｃｏｒｅ（ｄｊ）は０以下か否かを判定する（ステップＳ９０５）。ここで、ｓｃｏｒｅ（ｄｊ）が０以下ならば（ステップＳ９０５：Ｙｅｓ）、分類用パラメータｗからｗ（−）ｉを削除し、集合Ｑから分類用パラメータを追加し、ｗ（−）ｉの代わりとする。そして、追加した分類用パラメータの重みとしきい値ｂを再計算する（ステップＳ９０６）。一方、ｓｃｏｒｅ（ｄｊ）が０以下でないならば（ステップＳ９０５：Ｎｏ）、ｊにｊ＋１を代入して（ステップＳ９０７）、ステップＳ９０４へ移行する。 Then, it is determined whether or not score (d j) is 0 or less (step S905). Here, if score (d j) is 0 or less (step S905: Yes), w (−) i is deleted from the classification parameter w, a classification parameter is added from the set Q, and w (−) i As an alternative. Then, the weight of the added classification parameter and the threshold value b are recalculated (step S906). On the other hand, if score (d j) is not 0 or less (step S905: No), j + 1 is substituted for j (step S907), and the process proceeds to step S904.

ステップＳ９０６の処理の後、すべてのｊについての処理が終了したか否かを判定する（ステップＳ９０８）。ここで、すべてのｊについての処理が終了していない場合（ステップＳ９０８：Ｎｏ）は、ステップＳ９０７へ移行する。一方、すべてのｊについての処理が終了した場合（ステップＳ９０８：Ｙｅｓ）は、続けてすべてのｉについての処理が終了したか否かを判定する（ステップＳ９０９）。ここで、すべてのｉについての処理が終了した場合（ステップＳ９０９：Ｙｅｓ）は、一連の処理が終了となる。一方、すべてのｉについての処理が終了していない場合（ステップＳ９０９：Ｎｏ）は、ｉにｉ＋１を代入して（ステップＳ９１０）、ステップＳ９０２へ移行する。 After the process of step S906, it is determined whether or not the process for all j has been completed (step S908). Here, when the processing for all j is not completed (step S908: No), the process proceeds to step S907. On the other hand, if the processing for all j has been completed (step S908: Yes), it is determined whether the processing for all i has been completed (step S909). Here, when the processing for all i is completed (step S909: Yes), the series of processing is completed. On the other hand, if the processing for all i has not been completed (step S909: No), i + 1 is substituted for i (step S910), and the process proceeds to step S902.

なお、図９に示したフローチャートでは、負のパラメータを用いたプレフィルタリングの処理を示したが、正のパラメータを用いた場合はその対象性から容易に想像できるため省略した。 In the flowchart shown in FIG. 9, the pre-filtering process using the negative parameter is shown, but the case where the positive parameter is used is omitted because it can be easily imagined from the objectivity.

以上説明したように、この発明にかかる文書検索装置、文書検索方法によれば、検索者が指定した適合文書に基づく関連文書の検索に際し、蓄積手段などで管理されている全文書からフィルタリングによって取り出した適合文書に、必ず学習データとして指定した適合文書が含まれるため、文書検索の精度を向上させることができる。また、ユーザはフィルタリングのために不適合文書の指定をせずともフィルタリングを実行することができるので、ユーザの操作効率を向上させることができる。 As described above, according to the document search device and document search method of the present invention, when searching for related documents based on a conforming document specified by a searcher, all documents managed by the storage means are extracted by filtering. Therefore, the accuracy of the document search can be improved because the relevant document is always included as the learning data. In addition, since the user can perform filtering without specifying a nonconforming document for filtering, the user's operation efficiency can be improved.

なお、本実施の形態で説明した文書検索方法は、あらかじめ用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The document search method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. Further, this program may be a transmission medium that can be distributed via a network such as the Internet.

以上のように、本発明にかかる文書検索装置、文書検索方法、および文書検索プログラムは、検索者が指定した適合文書に基づく関連文書の検索に有用であり、特に、精度の高い文書検索が必要な場合に適している。 As described above, the document search device, the document search method, and the document search program according to the present invention are useful for searching related documents based on a conforming document specified by a searcher, and in particular, highly accurate document search is required. Suitable for the case.

この発明の実施の形態にかかる文書検索装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the document search apparatus concerning embodiment of this invention. この発明の実施の形態にかかる文書検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document search device concerning embodiment of this invention. 文書検索処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a document search process. 検索画面の表示例を示す図である。It is a figure which shows the example of a display of a search screen. 検索画面の表示例を示す図である。It is a figure which shows the example of a display of a search screen. 検索画面の表示例を示す図である。It is a figure which shows the example of a display of a search screen. 検索画面の表示例を示す図である。It is a figure which shows the example of a display of a search screen. 検索画面の表示例を示す図である。It is a figure which shows the example of a display of a search screen. プレフィルタリングの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of pre filtering.

Explanation of symbols

１００文書検索装置
１０１ＣＰＵ
１０２メモリ
１０３バス
１０４磁気記憶装置
１０５入力装置
１０６表示装置
１０７記憶媒体
１０８記憶媒体読取装置
１０９通信制御装置
１１０ネットワーク
１２０文書検索プログラム
２０１検索語抽出部
２０２関連文書検索部
２０３適合文書指定部
２０４関連語抽出部
２０５不適合文書抽出部
２０６学習部
２０７プレフィルタリング部
２０８分類部
２１０データベース（ＤＢ）
２２０文書検索部
２３０入出力部

100 Document Retrieval Device 101 CPU
DESCRIPTION OF SYMBOLS 102 Memory 103 Bus 104 Magnetic storage apparatus 105 Input apparatus 106 Display apparatus 107 Storage medium 108 Storage medium reading apparatus 109 Communication control apparatus 110 Network 120 Document search program 201 Search word extraction part 202 Related document search part 203 Relevant document designation part 204 Related word Extraction unit 205 Non-conforming document extraction unit 206 Learning unit 207 Pre-filtering unit 208 Classification unit 210 Database (DB)
220 Document search unit 230 Input / output unit

Claims

An operation means for receiving an operation command;
Search word extraction means for extracting a search word as a search word in response to a searcher's input operation on the operation means;
Storage means for storing a plurality of digitized documents as a document group;
First related document search means for searching as a related document a document including the search word extracted by the search word extraction means from the document group stored in the storage means;
Display means for performing a display operation;
First display control means for causing the display means to display a search result by the first related document search means;
In accordance with a searcher's input operation to the operation means, a compatible document specifying means for specifying a compatible document that the searcher seeks from the plurality of related documents searched by the first related document search means;
Related word extraction means for extracting a search phrase as a related word based on the relevant document designated by the relevant document designation means;
Second related document search means for searching as a related document a document containing the related word extracted by the related word extraction means from the document group stored in the storage means;
A non-conforming document extracting unit that extracts a non-conforming document that a searcher does not seek based on the conforming document designated by the conforming document designating unit from the plurality of related documents retrieved by the second related document retrieving unit;
Learning means for generating classification parameters based on the conforming document designated by the conforming document designating means and the nonconforming document extracted by the nonconforming document extracting means;
Pre-filtering means for verifying the validity of the classification parameter generated by the learning means;
A third related document search unit that performs a re-search on the document group stored in the storage unit, using the classification parameter verified by the pre-filtering unit as a search term;
Classification means for classifying the conforming document based on a classification parameter whose validity has been verified by the pre-filtering means with respect to the re-search result by the third related document search means;
Second display control means for displaying the classification result by the classification means on the display means;
A document retrieval apparatus comprising:

The prefiltering unit corrects the classification parameter so that the conforming document and the nonconforming document are correctly classified into a conforming document and a nonconforming document according to the classification parameter generated by the learning unit. The document search device according to claim 1.

3. The document search apparatus according to claim 2, wherein when the pre-filtering unit detects an invalid classification parameter, the pre-filtering unit deletes the classification parameter.

A search term extraction step of extracting a search phrase as a search term in accordance with a searcher's input operation to an operation means for accepting an operation;
A first related document that retrieves a document including the search word extracted by the search word extraction step as a related document from a document group stored in a storage unit that stores a plurality of digitized documents as a document group. Search process;
A first search result display step for displaying a search result by the first related document search step;
A conforming document designating step of designating a conforming document requested by the searcher from the plurality of related documents retrieved by the first related document retrieving step in response to an input operation of the retriever to the operation means;
A related word extraction step of extracting a search word / phrase as a related word based on the relevant document designated by the relevant document designation step;
A second related document search step of searching as a related document a document including the related word extracted by the related word extraction step from the document group stored in the storage unit;
A non-conforming document extracting step of extracting non-conforming documents that a searcher does not seek based on the conforming document designated by the conforming document designating step from a plurality of the related documents retrieved by the second related document retrieving step;
A classification parameter generating step for generating a classification parameter based on the conforming document designated by the conforming document designating step and the nonconforming document extracted by the nonconforming document extracting step;
A pre-filtering step of verifying validity of the classification parameter generated by the classification parameter generation step;
A third related document search step for performing a re-search on the document group stored in the storage unit, using the classification parameter verified in the pre-filtering step as a search term;
A relevance document classification step of classifying the relevance document based on a classification parameter whose validity has been verified in the pre-filtering step with respect to a re-search result obtained by the third related document search step;
A second display step for displaying a classification result by the conforming document classification step;
A document retrieval method comprising:

The pre-filtering step corrects the classification parameter so that the conforming document and the nonconforming document are correctly classified into a conforming document and a nonconforming document based on the classification parameter generated by the classification parameter generating step. The document retrieval method according to claim 4, wherein:

6. The document search method according to claim 5, wherein when the pre-filtering step detects an invalid classification parameter, the classification parameter is deleted.

A document search program for causing a computer to execute the document search method according to any one of claims 4 to 6.