JP2005108158A

JP2005108158A - Document retrieval apparatus and method, and program for same

Info

Publication number: JP2005108158A
Application number: JP2003344448A
Authority: JP
Inventors: Koji Yamada; 孝司山田
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2003-10-02
Filing date: 2003-10-02
Publication date: 2005-04-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document retrieval apparatus capable of retrieving documents which cannot be structured and using a new retrieval method different from the conventional one. <P>SOLUTION: Each of a plurality of representative documents to be retrieved which have been extracted at a specified rate from a plurality of documents to be retrieved is divided into paragraphs, and a document structure of each of the representative documents to be retrieved is classified. Also, by using a sample document, retrieval condition data are created. Then, the kind of the document structure of an arbitrary document to be retrieved out of the plurality of documents to be retrieved is detected, a paragraph for judging retrieval in the arbitrary document to be retrieval is specified on the basis of a position and the kind of paragraph stating retrieval contents to be held by the retrieval condition data, full text retrieval of the paragraph stating the retrieval contents is performed by using a key word held by the retrieval condition data, and it is judged on the basis of the result of full text retrieval whether or not the arbitrary document to be retrieved is output as a retrieval result. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、複数の検索対象文書の中からサンプル文書に似た文書を検索する為の文書検索装置及び文書検索方法ならびにそのプログラムに関する。 The present invention relates to a document search apparatus, a document search method, and a program for searching for a document similar to a sample document from a plurality of search target documents.

従来、様々な文書検索の方法が開発、提案されている。そして、ユーザの希望しない検索結果が出力されてしまうといった、文書検索における不具合（以降、検索ノイズという）を減少させる技術として、例えば、大量の文書の中からユーザが望む文書を検索する装置において、章や節や段落などの文書要素によって構造化できる検索対象の文書を検索する際に、まずユーザから検索の条件を受付けて、その検索条件に基づいて文書中の文書要素を指定し、当該指定した文書要素を含む文書を検索結果として出力する技術が存在する（例えば、特許文献１参照）。
特開平６−３０９３６８号公報 Conventionally, various document search methods have been developed and proposed. Then, as a technique for reducing malfunctions in document search (hereinafter referred to as search noise) such that search results that the user does not want are output, for example, in an apparatus that searches a document desired by a user from a large number of documents, When searching for a search target document that can be structured by document elements such as chapters, sections, and paragraphs, first accept the search conditions from the user, specify the document elements in the document based on the search conditions, and specify There is a technique for outputting a document including a document element as a search result (see, for example, Patent Document 1).
JP-A-6-309368

しかしながら、上述の技術においては、検索対象の文書が構造化できない場合には検索結果を得ることができないので、構造化できる文書に限られてしまう。また、近年の情報化によって、大量の文書がデータ化されて大量に蓄積されているので、その大量の検索対象の文書の中からユーザの希望する文書を精度良く検索できる文書検索装置の開発が望まれている。
そこでこの発明は、構造化できない文書を検索することができ、また従来にはない新しい検索方法を用いた文書検索装置を提供することを目的としている。 However, in the above-described technique, if a search target document cannot be structured, a search result cannot be obtained. In addition, because of the recent computerization, a large amount of documents are converted into data and accumulated in large quantities. Therefore, a document retrieval apparatus that can accurately retrieve a document desired by a user from among a large amount of documents to be retrieved has been developed. It is desired.
SUMMARY OF THE INVENTION An object of the present invention is to provide a document search apparatus that can search for a document that cannot be structured, and that uses a new search method that has not existed before.

本発明は、上述の課題を解決すべくなされたもので、複数の検索対象文書の中からサンプル文書に似ている検索対象文書を検索する文書検索装置であって、前記複数の検索対象文書から抽出した複数の代表検索対象文書についてそれぞれ段落分けする第１の段落分け手段と、前記代表検索対象文書それぞれの段落分けに基づいて、前記代表検索対象文書それぞれの文書構造を分類する文書構造分類手段と、前記分類された前記文書構造の情報を前記文書構造の種類毎に複数記憶する文書構造記憶手段と、前記サンプル文書を段落分けする第２の段落分け手段と、前記サンプル文書の段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記サンプル文書の文書構造の種類を検出するサンプル文書構造検出手段と、前記複数の検索対象文書うちの任意の検索対象文書について段落分けする第３の段落分け手段と、前記第３の段落分け手段による段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記任意の検索対象文書の文書構造の種類を検出する検索対象文書構造検出手段と、前記サンプル文書の文書構造の種類と同じ文書構造の種類となる前記任意の検索対象文書を前記サンプル文書に似ている検索対象文書とする検索対象用文書決定手段とを備えることを特徴とする文書検索装置である。 The present invention has been made to solve the above-described problem, and is a document search apparatus for searching for a search target document similar to a sample document from a plurality of search target documents. First paragraph dividing means for dividing each of the plurality of representative search target documents, and document structure classification means for classifying the document structure of each representative search target document based on the respective paragraph divisions of the representative search target documents Document structure storage means for storing a plurality of classified document structure information for each type of document structure, second paragraph dividing means for dividing the sample document, and paragraphing of the sample document Sample document structure detection for detecting the document structure type of the sample document based on the result and the document structure information stored in the document structure storage means A third paragraph dividing means for dividing an arbitrary search target document among the plurality of search target documents, a result of the paragraph division by the third paragraph dividing means, and the document structure storage means; A search target document structure detecting means for detecting a document structure type of the arbitrary search target document based on the document structure information, and the document structure type being the same as the document structure type of the sample document A document search apparatus comprising: a search target document determination unit that sets an arbitrary search target document as a search target document similar to the sample document.

本発明によれば、第１の段落分け手段が複数の検索対象文書から抽出した複数の代表検索対象文書についてそれぞれ段落分けし、文書構造分類手段が代表検索対象文書それぞれの段落分けに基づいて、代表検索対象文書それぞれの文書構造を分類する。また第２の段落分け手段がサンプル文書を段落分けし、サンプル文書構造検出手段がサンプル文書の段落分けの結果と文書構造記憶手段で記憶している文書構造の情報とに基づいて、サンプル文書の文書構造の種類を検出する。そして、第３の段落分け手段が複数の検索対象文書うちの任意の検索対象文書について段落分けし、検索対象文書構造検出手段が第３の段落分け手段による段落分けの結果と文書構造記憶手段で記憶している文書構造の情報とに基づいて、任意の検索対象文書の文書構造の種類を検出し、さらに索対象用文書決定手段がサンプル文書の文書構造の種類と同じ文書構造の種類となる任意の検索対象文書をサンプル文書に似ている検索対象文書とする。これにより、段落分けと段落種類の決定を行なってサンプル文書や検索対象文書の文書構造を把握するので、どのような文書についてもその文書構造を特定できる。またサンプル文書と文章構造が一致する検索対象文書であって、さらにサンプル文書中の検索内容記載段落と段落位置が同じでかつ検索内容記載段落に記載されているキーワードが含まれる段落を保持している検索対象文書が検索結果として出力されるという新しい手法の文書検索装置を提供することができる。そして、この文書検索装置では、サンプル文書を複数用意すれば検索条件データが複数作成されるので、文書構造と検索内容記載段落とキーワードの組み合わせのパターンを複数利用して精度の良い文書検索ができる。 According to the present invention, the first paragraph dividing unit divides each of the plurality of representative search target documents extracted from the plurality of search target documents, and the document structure classification unit determines the paragraph based on each of the representative search target documents. The document structure of each representative search target document is classified. The second paragraph dividing unit divides the sample document into paragraphs, and the sample document structure detecting unit extracts the sample document based on the result of the paragraph division of the sample document and the document structure information stored in the document structure storage unit. Detect the type of document structure. Then, the third paragraph division means divides a paragraph for an arbitrary search target document among the plurality of search target documents, and the search target document structure detection means uses the result of the paragraph division by the third paragraph division means and the document structure storage means. Based on the stored document structure information, the type of the document structure of an arbitrary search target document is detected, and the search target document determination means has the same document structure type as that of the sample document. An arbitrary search target document is set as a search target document similar to the sample document. Thereby, the paragraph structure and the paragraph type are determined to grasp the document structure of the sample document or the search target document, so that the document structure can be specified for any document. In addition, a search target document having the same sentence structure as the sample document, and a paragraph containing the keyword described in the search content description paragraph and having the same paragraph position as the search content description paragraph in the sample document are retained. It is possible to provide a document search apparatus of a new technique in which a search target document is output as a search result. In this document search apparatus, if a plurality of sample documents are prepared, a plurality of search condition data are created. Therefore, a document search with high accuracy can be performed by using a plurality of combinations of document structure, search content description paragraphs, and keywords. .

本発明は、上述の課題を解決すべくなされたもので、複数の検索対象文書の中からサンプル文書に似ている検索対象文書を検索する文書検索装置であって、前記複数の検索対象文書から所定の割合で抽出した複数の代表検索対象文書についてそれぞれ段落分けする第１の段落分け手段と、前記代表検索対象文書それぞれの段落分けに基づいて、前記代表検索対象文書それぞれの文書構造を分類する文書構造分類手段と、前記分類された前記文書構造の情報を前記文書構造の種類毎に複数記憶する文書構造記憶手段と、前記サンプル文書を段落分けする第２の段落分け手段と、前記サンプル文書の段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記サンプル文書の文書構造の種類を検出するサンプル文書構造検出手段と、前記サンプル文書中において前記ユーザが指定した検索内容記載段落の段落位置と前記検索内容記載段落中に含まれるキーワードと前記検索内容記載段落中の記載内容を示す段落種類と前記サンプル文書の文書構造の種類とを保持する検索条件データを作成する検索条件データ作成手段と、前記検索条件データを記憶する検索条件データ記憶手段と、前記複数の検索対象文書うちの任意の検索対象文書について段落分けする第３の段落分け手段と、前記第３の段落分け手段による段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記任意の検索対象文書の文書構造の種類を検出する検索対象文書構造検出手段と、前記任意の検索対象文書の文書構造の種類を保持する前記検索条件データが前記検索条件データ記憶手段に記録されている場合に、その前記検索条件データを前記検索条件データ記憶手段から読み取る検索条件データ読み取り手段と、その読み取った検索条件データに保持される前記検索内容記載段落の位置と前記段落種類とに基づいて、前記任意の検索対象文書中における検索判定用段落を特定する段落特定手段と、前記検索条件データ読み取り手段で読み取った前記検索条件データの保持するキーワードを用いて前記検索内容記載段落の全文検索を行ない、その全文検索の結果に基づいて前記任意の検索対象文書を検索結果として出力するか否かを判定する検索手段とを備えることを特徴とする文書検索装置である。 The present invention has been made to solve the above-described problem, and is a document search apparatus for searching for a search target document similar to a sample document from a plurality of search target documents. Based on the first paragraph division means for dividing each of the plurality of representative search target documents extracted at a predetermined ratio and the respective paragraph divisions of the representative search target documents, the document structure of each of the representative search target documents is classified. Document structure classification means, document structure storage means for storing a plurality of classified document structure information for each type of document structure, second paragraph division means for dividing the sample document into paragraphs, and the sample document For detecting the document structure type of the sample document based on the result of the paragraphing of the document and the document structure information stored in the document structure storage means A document structure detection unit, a paragraph position of a search content description paragraph specified by the user in the sample document, a keyword included in the search content description paragraph, a paragraph type indicating the description content in the search content description paragraph, and Search condition data creation means for creating search condition data that holds the type of document structure of the sample document, search condition data storage means for storing the search condition data, and any search target among the plurality of search target documents Based on the third paragraph division means for dividing the document into paragraphs, the result of the paragraph division by the third paragraph division means, and the information on the document structure stored in the document structure storage means, the arbitrary search Search target document structure detecting means for detecting the document structure type of the target document, and the search condition holding the document structure type of the arbitrary search target document When data is recorded in the search condition data storage means, search condition data reading means for reading the search condition data from the search condition data storage means, and the search contents held in the read search condition data Based on the position of the described paragraph and the paragraph type, a paragraph specifying means for specifying a search determination paragraph in the arbitrary search target document, and a keyword held in the search condition data read by the search condition data reading means And a search means for performing a full-text search of the paragraph containing the search content and determining whether to output the arbitrary search target document as a search result based on a result of the full-text search. A document retrieval apparatus.

本発明によれば、複数の検索対象文書から所定の割合で抽出した複数の代表検索対象文書についてそれぞれ段落分けし、この段落分けに基づいて、代表検索対象文書それぞれの文書構造を分類して、文書構造の情報を文書構造の種類毎に文書構造記憶手段に記録する。ここで代表検索対象文書の複数の検索対象文書から抽出の割合は、複数の検索対象文書のいずれかの文書の内容と同類の内容の文書を、確率的に１つは特定できるだけの割合である。従って、文書構造記憶手段に記録される文書構造の情報は、複数の検索対象文書のいずれかの文書構造を示すものとなる。そして、サンプル文書を段落分けし、サンプル文書の文書構造の種類を検出する。またサンプル文書中においてユーザが指定した検索内容記載段落の段落位置と検索内容記載段落中に含まれるキーワードと検索内容記載段落中の記載内容を示す段落種類とサンプル文書の文書構造の種類とを保持する検索条件データを作成する。さらに検索の過程において、複数の検索対象文書うちの任意の検索対象文書を段落分けし、その任意の検索対象文書の文書構造の種類を検出する。そして、その検索対象文書の文書構造の種類を保持する前記検索条件データが前記検索条件データ記憶手段に記録されている場合に、その前記検索条件データを前記検索条件データ記憶手段から読み取り、その読み取った検索条件データに保持される検索内容記載段落の位置と段落種類とに基づいて、前記任意の検索対象文書中における検索判定用段落を特定する。つまり、サンプル文書の文書構造と一致し、かつサンプル文書の検索内容記載段落の位置と同じ段落位置の検索対象文書中の段落が検索内容記載段落の段落種類と一致する場合に、その検索対象文書中の段落を検索判定用段落と特定する。そして、検索条件データの保持するキーワードを用いて検索内容記載段落の全文検索を行ない、その全文検索の結果に基づいて検索対象文書を検索結果として出力する。これにより、段落分けと段落種類の決定を行なってサンプル文書や検索対象文書の文書構造を把握するので、どのような文書についてもその文書構造を特定できる。またサンプル文書と文章構造が一致する検索対象文書であって、さらにサンプル文書中の検索内容記載段落と段落位置が同じでかつ検索内容記載段落に記載されているキーワードが含まれる段落を保持している検索対象文書が検索結果として出力されるという新しい手法の文書検索装置を提供することができる。そして、この文書検索装置では、サンプル文書を複数用意すれば検索条件データが複数作成されるので、文書構造と検索内容記載段落とキーワードの組み合わせのパターンを複数利用して精度の良い文書検索ができる。 According to the present invention, each of the plurality of representative search target documents extracted from the plurality of search target documents at a predetermined ratio is divided into paragraphs, and based on this paragraph division, the document structure of each representative search target document is classified, Document structure information is recorded in the document structure storage means for each type of document structure. Here, the ratio of extraction from the plurality of search target documents of the representative search target document is such that one of the plurality of search target documents has the same content as the contents of any one of the plurality of search target documents. . Therefore, the document structure information recorded in the document structure storage means indicates one of the plurality of search target documents. Then, the sample document is divided into paragraphs, and the type of document structure of the sample document is detected. Also holds the paragraph position of the search content description paragraph specified by the user in the sample document, the keywords contained in the search content description paragraph, the paragraph type indicating the description content in the search content description paragraph, and the document structure type of the sample document Create search condition data. Further, in the search process, an arbitrary search target document among a plurality of search target documents is divided into paragraphs, and the type of document structure of the arbitrary search target document is detected. Then, when the search condition data holding the type of document structure of the search target document is recorded in the search condition data storage means, the search condition data is read from the search condition data storage means and read. Based on the position and paragraph type of the search content description paragraph held in the search condition data, the search determination paragraph in the arbitrary search target document is specified. In other words, if the paragraph in the search target document that matches the document structure of the sample document and the same paragraph position as the search content description paragraph of the sample document matches the paragraph type of the search content description paragraph, that search target document The paragraph inside is specified as the paragraph for search judgment. Then, a full text search is performed on the search content description paragraph using a keyword held in the search condition data, and a search target document is output as a search result based on the result of the full text search. Thereby, the paragraph structure and the paragraph type are determined to grasp the document structure of the sample document or the search target document, so that the document structure can be specified for any document. In addition, a search target document having the same sentence structure as the sample document, and a paragraph containing the keyword described in the search content description paragraph and having the same paragraph position as the search content description paragraph in the sample document are retained. It is possible to provide a document search apparatus of a new technique in which a search target document is output as a search result. In this document search apparatus, if a plurality of sample documents are prepared, a plurality of search condition data are created. Therefore, a document search with high accuracy can be performed by using a plurality of combinations of document structure, search content description paragraphs, and keywords. .

また本発明は、キーワードが前記検索内容記載段落中の文章を形態素解析して得られた所定の品詞の単語である。これにより、ユーザが検索内容記載段落を指定すればキーワードが自動で作成されるので、検索内容記載段落の記載内容と似ている段落を保持する検索対象文書の全文検索を行なう際にユーザがいちいちキーワードを入力する必要が無く、ユーザの労力を軽減することができる。 According to the present invention, the keyword is a word having a predetermined part of speech obtained by morphological analysis of the sentence in the search content description paragraph. As a result, since the keyword is automatically created when the user specifies a search content description paragraph, the user is required to perform a full text search of a search target document having a paragraph similar to the description content of the search content description paragraph. There is no need to input a keyword, and the user's labor can be reduced.

また本発明は、前記検索内容記載段落中のキーワードを用いた代表検索対象文書の全文検索の結果に基づいて、代表検索対象文書においてキーワードが含まれるキーワード包含段落を決定し、代表検索対象文書の文書構造の種類とキーワード包含段落の代表検索対象文書における段落位置との組み合わせと、検索条件データ記録手段で記憶している検索条件データの保持する文書構造の種類と検索内容記載段落の段落位置との組み合わせの比較に基づいて、キーワード包含段落を保持する代表検索対象文書を新しいサンプル文書の候補として決定する。そして、その候補からユーザに指定された文書を新サンプル文書とする。これにより、ユーザが予め指定したサンプル文書以外にも新サンプル文書についての検索条件データを文書検索装置が作成すれば、より詳細な検索を行なうことができる。 Further, the present invention determines a keyword inclusion paragraph including a keyword in the representative search target document based on a result of full text search of the representative search target document using the keyword in the search content description paragraph, The combination of the type of document structure and the paragraph position in the representative search target document of the keyword inclusion paragraph, the type of document structure held in the search condition data stored in the search condition data recording means, and the paragraph position of the paragraph describing the search content Based on the comparison of the combinations, the representative search target document holding the keyword inclusion paragraph is determined as a candidate for a new sample document. Then, a document designated by the user from the candidates is set as a new sample document. Accordingly, if the document search device creates search condition data for a new sample document other than the sample document designated in advance by the user, a more detailed search can be performed.

また本発明は、サンプル文書または代表検索対象文書いずれかの文書における各行の行種類を当該行における記載内容に基づいて決定し、行種類に基づいてサンプル文書または代表検索対象文書いずれかの文書における段落先頭位置を決定する。そして、サンプル文書または代表検索対象文書いずれかの文書において、段落先頭位置の行から次の段落先頭位置の行の前の行までを１つの段落と決定し、段落種類を、段落に含まれる行の行種類に基づいて決定する。これにより、段落分けした各段落の種類を決定することができる。 Further, the present invention determines the line type of each line in either the sample document or the representative search target document based on the description content in the line, and determines whether the sample document or the representative search target document is based on the line type. Determine the paragraph start position. Then, in either the sample document or the representative search target document, the line from the line at the beginning of the paragraph to the line before the line at the beginning of the next paragraph is determined as one paragraph, and the paragraph type is the line included in the paragraph. Determine based on the row type. Thereby, the kind of each paragraph divided into paragraphs can be determined.

また本発明は、複数の検索対象文書の中からサンプル文書に似ている検索対象文書を検索する文書検索装置の文書検索方法であって、前記複数の検索対象文書から所定の割合で抽出した複数の代表検索対象文書についてそれぞれ段落分けする第１の段落分け過程と、前記代表検索対象文書それぞれの段落分けに基づいて、前記代表検索対象文書それぞれの文書構造を分類する文書構造分類過程と、前記分類された前記文書構造の情報を前記文書構造の種類毎に文書構造記憶手段に記録する文書構造記録過程と、前記サンプル文書を段落分けする第２の段落分け過程と、前記サンプル文書の段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記サンプル文書の文書構造の種類を検出するサンプル文書構造検出過程と、前記サンプル文書中において前記ユーザが指定した検索内容記載段落の段落位置と前記検索内容記載段落中に含まれるキーワードと前記検索内容記載段落中の記載内容を示す段落種類と前記サンプル文書の文書構造の種類とを保持する検索条件データを作成する検索条件データ作成過程と、前記検索条件データを検索条件データ記憶手段に記録する検索条件データ記録過程と、前記複数の検索対象文書うちの任意の検索対象文書について段落分けする第３の段落分け過程と、前記第３の段落分け過程による段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記任意の検索対象文書の文書構造の種類を検出する検索対象文書構造検出過程と、前記任意の検索対象文書の文書構造の種類を保持する前記検索条件データが前記検索条件データ記憶手段に記録されている場合に、その前記検索条件データを前記検索条件データ記憶手段から読み取る検索条件データ読み取り過程と、その読み取った検索条件データに保持される前記検索内容記載段落の位置と前記段落種類とに基づいて、前記任意の検索対象文書中における検索判定用段落を特定する段落特定過程と、前記検索条件データ読み取り過程において読み取った前記検索条件データの保持するキーワードを用いて前記検索内容記載段落の全文検索を行ない、その全文検索の結果に基づいて前記任意の検索対象文書を検索結果として出力するか否かを判定する検索過程とを有することを特徴とする文書検索方法である。 The present invention is also a document search method of a document search apparatus for searching for a search target document similar to a sample document from a plurality of search target documents, wherein a plurality of documents extracted at a predetermined ratio from the plurality of search target documents. A first paragraph dividing process for dividing each of the representative search target documents, a document structure classification process for classifying the document structure of each of the representative search target documents based on the paragraph division of each of the representative search target documents, Document structure recording process for recording the classified document structure information in the document structure storage means for each document structure type, a second paragraph dividing process for dividing the sample document into paragraphs, and a paragraph division of the sample document Sample document structure for detecting the document structure type of the sample document based on the result of the document structure and the document structure information stored in the document structure storage means. The detection process, the paragraph position of the search content description paragraph specified by the user in the sample document, the keyword included in the search content description paragraph, the paragraph type indicating the description content in the search content description paragraph, and the sample document A search condition data creating process for creating search condition data for holding the document structure type, a search condition data recording process for recording the search condition data in a search condition data storage means, Based on a third paragraph dividing process for dividing an arbitrary document to be searched, a result of the paragraph dividing by the third paragraph dividing process, and information on the document structure stored in the document structure storage unit, A search target document structure detection process for detecting a document structure type of the arbitrary search target document, and a document structure type of the arbitrary search target document. When the search condition data held is recorded in the search condition data storage means, the search condition data reading process for reading the search condition data from the search condition data storage means, and holding in the read search condition data A paragraph specifying process for specifying a search determination paragraph in the arbitrary search target document based on the position of the search content description paragraph and the paragraph type, and the search condition read in the search condition data reading process A search step of performing a full-text search of the search content description paragraph using a keyword held in the data and determining whether or not to output the arbitrary search target document as a search result based on the result of the full-text search This is a document search method characterized by the above.

また本発明は、複数の検索対象文書の中からサンプル文書に似ている検索対象文書を検索する文書検索装置のコンピュータに実行させるプログラムであって、前記複数の検索対象文書から所定の割合で抽出した複数の代表検索対象文書についてそれぞれ段落分けする第１の段落分け処理と、前記代表検索対象文書それぞれの段落分けに基づいて、前記代表検索対象文書それぞれの文書構造を分類する文書構造分類処理と、前記分類された前記文書構造の情報を前記文書構造の種類毎に文書構造記憶手段に記録する文書構造記録処理と、前記サンプル文書を段落分けする第２の段落分け処理と、前記サンプル文書の段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記サンプル文書の文書構造の種類を検出するサンプル文書構造検出処理と、前記サンプル文書中において前記ユーザが指定した検索内容記載段落の段落位置と前記検索内容記載段落中に含まれるキーワードと前記検索内容記載段落中の記載内容を示す段落種類と前記サンプル文書の文書構造の種類とを保持する検索条件データを作成する検索条件データ作成処理と、前記検索条件データを検索条件データ記憶手段に記録する検索条件データ記録処理と、前記複数の検索対象文書うちの任意の検索対象文書について段落分けする第３の段落分け処理と、前記第３の段落分け処理による段落分けの結果と前記文書構造記憶手段で記憶している前記文書構造の情報とに基づいて、前記任意の検索対象文書の文書構造の種類を検出する検索対象文書構造検出処理と、前記任意の検索対象文書の文書構造の種類を保持する前記検索条件データが前記検索条件データ記憶手段に記録されている場合に、その前記検索条件データを前記検索条件データ記憶手段から読み取る検索条件データ読み取り処理と、その読み取った検索条件データに保持される前記検索内容記載段落の位置と前記段落種類とに基づいて、前記任意の検索対象文書中における検索判定用段落を特定する段落特定処理と、前記検索条件データ読み取り処理において読み取った前記検索条件データの保持するキーワードを用いて前記検索内容記載段落の全文検索を行ない、その全文検索の結果に基づいて前記任意の検索対象文書を検索結果として出力するか否かを判定する検索処理とをコンピュータに実行させるプログラムである。 The present invention also provides a program for causing a computer of a document search apparatus that searches for a search target document similar to a sample document from a plurality of search target documents to be extracted from the plurality of search target documents at a predetermined ratio. A first paragraph dividing process for dividing each of the plurality of representative search target documents, and a document structure classification process for classifying the document structure of each of the representative search target documents based on the paragraph division of each of the representative search target documents; A document structure recording process for recording the classified document structure information in a document structure storage unit for each document structure type, a second paragraph division process for dividing the sample document into paragraphs, Based on the result of the paragraphing and the information on the document structure stored in the document structure storage means, the document structure type of the sample document is determined. A sample document structure detection process to be performed, a paragraph position of a search content description paragraph specified by the user in the sample document, a keyword included in the search content description paragraph, and a paragraph indicating the description content in the search content description paragraph A search condition data creating process for creating search condition data that holds a type and a document structure type of the sample document; a search condition data recording process for recording the search condition data in a search condition data storage unit; A third paragraph dividing process for dividing an arbitrary search target document among the search target documents, a result of the paragraph division by the third paragraph dividing process, and information on the document structure stored in the document structure storage unit And a search target document structure detection process for detecting a document structure type of the arbitrary search target document, and the arbitrary search target A search condition data reading process for reading the search condition data from the search condition data storage means when the search condition data holding the document structure type of the document is recorded in the search condition data storage means; Paragraph specifying processing for specifying a search determination paragraph in the arbitrary search target document based on the position of the search content description paragraph and the paragraph type held in the read search condition data, and reading the search condition data Whether or not to perform a full-text search of the search content description paragraph using a keyword held in the search condition data read in the process, and whether to output the arbitrary search target document as a search result based on the result of the full-text search This is a program that causes a computer to execute search processing for determination.

以下、本発明の実施形態による文書検索装置を図面を参照して説明する。図１は同実施形態による文書検索装置の構成を示すブロック図である。この図において、符号１は大量の検索対処文書の中からユーザが希望する文書を検索する文書検索装置である。また２は例えば１０万や１００万の数の大量の検索対象文書を記憶する検索対象文書データベースである。そして、本実施形態においては文書検索装置１は検索対象文書データベース２に記録された検索対象文書からユーザの希望の文書を検索することとする。また文書検索装置１と検索対象文書データベース２とは通信ネットワークを介して接続されている。 Hereinafter, a document search apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the document search apparatus according to the embodiment. In this figure, reference numeral 1 denotes a document search apparatus for searching for a document desired by a user from a large number of search response documents. Reference numeral 2 denotes a search target document database that stores a large number of search target documents such as 100,000 or 1 million. In this embodiment, the document search apparatus 1 searches for a document desired by the user from the search target documents recorded in the search target document database 2. The document search apparatus 1 and the search target document database 2 are connected via a communication network.

また文書検索装置１において、符号１０１は検索条件学習用文書（代表検索対象文書）を記憶する検索条件学習用文書記憶部である。ここで検索条件学習用文書とは、検索対象文書データベース２に記録されている検索対象文書の中から、所定の割合で抽出した文書のことである。そして、検索対象文書データベース２から抽出する文書の割合は、検索条件学習用文書記憶部１０１内において、少なくとも検索対象文書データベース２に記録されている検索対象文書のいずれかの文書の内容と同類の内容の文書を、確率的に１つは保持できるだけの割合である。 In the document search apparatus 1, reference numeral 101 denotes a search condition learning document storage unit that stores a search condition learning document (representative search target document). Here, the search condition learning document is a document extracted at a predetermined rate from the search target documents recorded in the search target document database 2. The ratio of documents extracted from the search target document database 2 is similar to the content of at least one of the search target documents recorded in the search target document database 2 in the search condition learning document storage unit 101. Probably one of the content documents is a ratio that can be retained.

また１０２はユーザが用意したサンプル文書を記憶するサンプル文書記憶部である。サンプル文書は、ユーザが検索したいと希望する文書の内容の一部が記載された文書であり、ユーザはサンプル文書において、検索したい内容が記載されている特定の範囲を指定する。これにより、文書検索装置１はサンプル文書においてユーザが指定した範囲の文章の内容と同類の内容が記載された検索対象文書を後述の手法により検索対象文書データベース２から検索する。なお、本実施形態においては、サンプル文書においてユーザが指定する範囲は、サンプル文書内の段落である。 A sample document storage unit 102 stores a sample document prepared by the user. The sample document is a document in which a part of the content of the document that the user desires to search is described, and the user specifies a specific range in which the content to be searched is described in the sample document. As a result, the document search device 1 searches the search target document database 2 for a search target document in which the content similar to the content of the text in the range specified by the user in the sample document is described. In the present embodiment, the range specified by the user in the sample document is a paragraph in the sample document.

また１０３は段落解析部（第１の段落分け手段、第２の段落分け手段、第３の段落分け手段、行種類決定手段、段落先頭位置決定手段、段落決定手段）であり、検索条件学習用文書やサンプル文書や検索対象文書を解析して段落分けを行なう。また１０４は文書構造分類部（文書構造分類手段、サンプル文書構造検出手段、検索対象文書構造検出手段、段落種類決定手段）であり、段落解析部１０３における検索条件学習用文書の段落分けの結果に基づいて、検索条件学習用文書それぞれの文書構造を分類する。そして文書構造分類部１０４は、文書構造記憶部（文書構造記憶手段）１０５に異なる種類の文書構造の情報を記録する。なお文書構造は、１つの文書における段落の数と各段落の段落識別子と各段落の記載内容を示す段落種類とによって表される。そして文書構造の情報は１つの文書における各段落の段落識別子と各段落の段落種類とを保持している。 Reference numeral 103 denotes a paragraph analysis unit (first paragraph division means, second paragraph division means, third paragraph division means, line type determination means, paragraph head position determination means, paragraph determination means), which is used for search condition learning. The document, the sample document, and the search target document are analyzed to divide the paragraph. Reference numeral 104 denotes a document structure classification unit (document structure classification unit, sample document structure detection unit, search target document structure detection unit, paragraph type determination unit). Based on this, the document structure of each search condition learning document is classified. The document structure classification unit 104 records information on different types of document structures in the document structure storage unit (document structure storage unit) 105. The document structure is represented by the number of paragraphs in one document, the paragraph identifier of each paragraph, and the paragraph type indicating the description content of each paragraph. The document structure information holds the paragraph identifier of each paragraph and the paragraph type of each paragraph in one document.

また１０６は検索条件データ作成部（検索条件データ作成手段）であり、サンプル文書中においてユーザが指定した段落（以降、検索内容記載段落と呼ぶ）の位置、検索内容記載段落中に含まれるキーワード、サンプル文書の文書構造を指し示す文書構造番号、検索内容記載段落の種類、サンプル文書の識別情報とを保持する検索条件データを作成する。この検索条件データは、検索対象文書データベース２に記録されている大量の検索対象文書からユーザの希望する文書を検索する際に利用する。１０７は検索条件データを記憶する検索条件データ記憶部（検索条件データ記憶手段）である。 Reference numeral 106 denotes a search condition data creation unit (search condition data creation means), the position of a paragraph (hereinafter referred to as a search content description paragraph) specified by a user in a sample document, a keyword included in the search content description paragraph, Search condition data that holds the document structure number indicating the document structure of the sample document, the type of paragraph describing the search content, and the identification information of the sample document is created. This search condition data is used when searching a document desired by the user from a large number of search target documents recorded in the search target document database 2. Reference numeral 107 denotes a search condition data storage unit (search condition data storage means) that stores search condition data.

また１０８は検索条件データに保持されているキーワード（つまり、検索内容記載段落中のキーワード）を用いて、検索条件学習用文書を全文検索し、キーワードが含まれる検索条件学習用文書中の段落をキーワード包含段落と決定するキーワード包含段落決定部（キーワード包含段落決定手段）である。また１０９はキーワード包含段落を保持する検索条件学習用文書を、新しいサンプル文書の候補とするか否かを決定する新サンプル文書候補決定部（新サンプル文書候補決定手段）である。また１１０は新サンプル文書候補となった検索条件学習用文書のデータを記憶する新サンプル文書候補記憶部である。また１１１はユーザの指示に基づいて、新サンプル文書候補記憶部１１０に記録されている新サンプル文書候補の検索条件学習用文書からユーザによって指定された文書を、新サンプル文書としてサンプル文書記憶部１０２に記録する新サンプル文書指定部である。 Further, 108 performs a full-text search of the search condition learning document using a keyword (that is, a keyword in the search content description paragraph) held in the search condition data, and selects a paragraph in the search condition learning document including the keyword. A keyword inclusion paragraph determination unit (keyword inclusion paragraph determination means) that determines a keyword inclusion paragraph. Reference numeral 109 denotes a new sample document candidate determination unit (new sample document candidate determination means) that determines whether or not a search condition learning document holding a keyword inclusion paragraph is a candidate for a new sample document. Reference numeral 110 denotes a new sample document candidate storage unit that stores data of a search condition learning document that is a new sample document candidate. Reference numeral 111 designates a document specified by the user from the search condition learning document for the new sample document candidate recorded in the new sample document candidate storage unit 110 based on the user's instruction as a new sample document. Is a new sample document designation section to be recorded in

また１１２は検索対象文書データベース２で記憶している大量の検索対象文書を１つずつ読み込む検索対象文書読込み部である。また１１３は検索対象文書読込み部１１２が読み込んだ検索対象文書の文書構造と検索条件データ記憶部１０７に記録されている検索条件データと文書構造記憶部１０５に記録されている文書構造の情報とに基づいて、ユーザが検索を希望する検索対象文書を検索する検索処理部（検索条件データ読み取り手段、段落特定手段、検索手段、検索対象用文書決定手段）である。また１１４は検索結果として得られた検索対象文書の内容をモニタなどに出力する検索結果出力部である。 A search target document reading unit 112 reads a large number of search target documents stored in the search target document database 2 one by one. Reference numeral 113 denotes the document structure of the search target document read by the search target document reading unit 112, the search condition data recorded in the search condition data storage unit 107, and the document structure information recorded in the document structure storage unit 105. A search processing unit (search condition data reading means, paragraph specifying means, search means, search target document determination means) for searching for a search target document that the user desires to search based on. A search result output unit 114 outputs the contents of the search target document obtained as a search result to a monitor or the like.

そして、文書検索装置１は上述の各処理部を用いて、大きく分けて４つの処理フローによりユーザ希望の文書の検索を行なう。まず１つ目は検索条件学習用文書の文書構造を分類し、異なる種類の複数の文書構造の情報を文書構造記憶部１０５に記録する処理フローである。また２つ目はサンプル文書と文書構造の情報とに基づいて検索条件データを作成する処理フローである。また３つ目は新サンプル文書の候補を決定して、新サンプル文書の候補うちユーザから指定を受けた文書のみ、新しいサンプル文書とする処理フローである。４つ目は、検索条件データベース２からユーザの希望する検索対象文書を検索する処理フローである。これにより、ユーザ希望の検索結果が得られる。 The document search apparatus 1 uses the above-described processing units to search for a document desired by the user according to four processing flows. The first is a processing flow for classifying the document structure of the search condition learning document and recording information of a plurality of different types of document structures in the document structure storage unit 105. The second is a processing flow for creating search condition data based on the sample document and the document structure information. The third is a processing flow in which a candidate for a new sample document is determined and only a document designated by the user among the candidates for the new sample document is used as a new sample document. The fourth is a processing flow for searching a search target document desired by the user from the search condition database 2. Thereby, a search result desired by the user is obtained.

図２は検索条件学習用文書の例を示す図である。そしてこの図は、検索条件学習用文書ａ、検索条件学習用文書ｂ、検索条件学習用文書ｃの３つの文書を示している。そして検索条件学習用文書は、検索条件学習用文書記憶部１０１に、少なくとも検索対象文書データベース２に記録されている検索対象文書のいずれかの文書の内容と同類の内容の文書を、確率的に１つは保持できる数量だけ記録している。 FIG. 2 is a diagram showing an example of a search condition learning document. This figure shows three documents: a search condition learning document a, a search condition learning document b, and a search condition learning document c. The search condition learning document is probabilistically stored in the search condition learning document storage unit 101 as a document having the same content as at least one of the search target documents recorded in the search target document database 2. One records only the quantity that can be held.

また図３はサンプル文書を示す図である。そして、本実施形態において利用するサンプル文書はサンプル文書Ａ、サンプル文書Ｂの２つのサンプル文書である。なお、本実施形態においては、説明の便宜上、検索条件学習用文書ａ、ｂとそれぞれ内容が同じ文書をサンプル文書とした。そして、これらサンプル文書Ａ、Ｂがサンプル文書記憶部１０２に記録されている。
なお、本実施形態における検索対象文書と検索条件学習用文書とサンプル文書は、通常文の行、空行、記号付き単語の行、記号無し単語の行のいずれかに当てはまる行によって構成された文書であるものとする。ここで通常文とは述語がや句読点が含まれた文のことである。また記号付き単語の行とは、行の先頭が数字やアルファベットなどの記号で始まり、その後の文章が体言だけで構成された行のことを言う。また記号無し単語の行とは、行の先頭に数字やアルファベットなどの記号がなく、文章が体言だけで構成された行のことを言う。 FIG. 3 shows a sample document. The sample documents used in this embodiment are two sample documents, sample document A and sample document B. In the present embodiment, for convenience of explanation, a document having the same contents as the search condition learning documents a and b is used as a sample document. These sample documents A and B are recorded in the sample document storage unit 102.
Note that the search target document, the search condition learning document, and the sample document in this embodiment are documents configured by lines that fall into one of a normal sentence line, a blank line, a word line with a symbol, or a word line without a symbol. Suppose that Here, a normal sentence is a sentence including predicates and punctuation marks. A word line with a symbol means a line in which the beginning of the line starts with a symbol such as a numeral or alphabet, and the subsequent sentence is composed only of body words. Moreover, the line of a word without a symbol means a line in which a sentence is composed only of a body without a symbol such as a numeral or alphabet at the beginning of the line.

また図４は文書構造の情報を示す図である。この図で示すように文書構造記憶部１０５には、文書構造の情報として、文書構造番号と、当該文書構造番号で示される文書構造を構成する段落の番号と、各段落の種類とを対応付けて記憶している。そして、この文書構造のデータは文書構造分類部１０５によって作成され、文書構造記憶部１０５に記録される。 FIG. 4 is a diagram showing document structure information. As shown in this figure, the document structure storage unit 105 associates, as document structure information, a document structure number, the number of a paragraph constituting the document structure indicated by the document structure number, and the type of each paragraph. I remember. The document structure data is created by the document structure classification unit 105 and recorded in the document structure storage unit 105.

また図５は検索条件データの例を示す図である。この図で示すように検索条件データ記憶部１０７には検索条件データ番号と、サンプル文書番号と、文構造番号と、検索内容記載段落の位置と、検索内容記載段落の段落種類と、キーワードとが対応付けられて記録されている。そしてこの検索条件データは検索条件データ作成部１０６が作成して検索条件データ記憶部１０７に記録する。 FIG. 5 shows an example of search condition data. As shown in this figure, the search condition data storage unit 107 includes a search condition data number, a sample document number, a sentence structure number, a position of the search content description paragraph, a paragraph type of the search content description paragraph, and a keyword. Correspondingly recorded. This search condition data is created by the search condition data creation unit 106 and recorded in the search condition data storage unit 107.

次に、文書検索装置１において検索条件学習用文書の文書構造を分類し、文書構造の情報が文書構造記憶部１０５に記録されるまでの処理フローについて説明する。図６は文書構造の分類と文書構造の情報を記録する際の処理フローを示す図である。
まず、段落解析部１０３が検索条件学習用文書記憶部１０１に記録されている検索条件学習用文書を１つ読み取る（ステップＳ１０１）。ここで読み取った検索条件学習用文書は検索条件学習用文書ａであるとする。次に、段落解析部１０３は読み取った検索条件学習用文書ａの一番上の行を１つ取り出す（ステップＳ１０２）。次に、段落解析部１０３はステップＳ１０２で取り出した行が、通常文の行、空行、記号付き単語の行、記号無し単語の行のどの行種類かを解析し（ステップＳ１０３）、例えば、メモリ上などに記憶しておく。次に段落解析部１０３は検索条件学習用文書ａにおいて、ステップＳ１０１からステップＳ１０４で処理した行の次の行があるか否かを確認し（ステップＳ１０４）、次の行がある場合には、次の行についてステップＳ１０１からステップＳ１０３の処理を繰り返し、これを検索条件学習用文書ａ中の全ての行について行なう。 Next, a processing flow until the document structure of the search condition learning document is classified in the document search device 1 and the document structure information is recorded in the document structure storage unit 105 will be described. FIG. 6 is a diagram showing a processing flow when recording document structure classification and document structure information.
First, the paragraph analysis unit 103 reads one search condition learning document recorded in the search condition learning document storage unit 101 (step S101). It is assumed that the search condition learning document read here is a search condition learning document a. Next, the paragraph analysis unit 103 takes out the top line of the read search condition learning document a (step S102). Next, the paragraph analysis unit 103 analyzes which line type the line extracted in step S102 is a line of a normal sentence, a blank line, a line of a word with a symbol, or a line of a word without a symbol (step S103). Store it on a memory. Next, the paragraph analysis unit 103 checks whether or not there is a next line in the search condition learning document a after the lines processed in steps S101 to S104 (step S104). The processing from step S101 to step S103 is repeated for the next line, and this is performed for all lines in the search condition learning document a.

図７は段落解析部のメモリに保持されたデータを示す図である。この図において行番号は検索条件学習用文書ａの行の番号を示しており、行番号と解析結果の行種類とその繰り返し回数を記録している。例えば、検索条件学習用文書ａの４行目は通常文であり、この通常文が２回繰り返している（５行目も通常文である）。この場合５行目についての行種類は記録されず、４行目の繰り返し回数が１から２へと変更される。そして、このようにして段落解析部１０３は検索条件学習用文書の全ての行についての行種類を解析する。 FIG. 7 is a diagram showing data held in the memory of the paragraph analysis unit. In this figure, the line number indicates the line number of the search condition learning document a, and the line number, the line type of the analysis result, and the number of repetitions thereof are recorded. For example, the fourth line of the search condition learning document a is a normal sentence, and this normal sentence is repeated twice (the fifth line is also a normal sentence). In this case, the row type for the fifth row is not recorded, and the number of repetitions for the fourth row is changed from 1 to 2. In this way, the paragraph analysis unit 103 analyzes line types for all lines of the search condition learning document.

次に、段落解析部１０３はメモリ上に記録しているデータ（図６）に記録された、行番号１に対応する行種類と繰り返し回数とを取り出す（ステップＳ１０５）。そして、図６の示すメモリ上のデータから取り出した行番号１の情報により、行番号１の行が段落先頭位置に相当する行か否かを判定する（ステップＳ１０６）。この段落先頭位置に相当する行か否かの判定は、（１）その行番号の情報が、繰り返し回数が１となっている記号付き単語の行、（２）その行番号の情報が、繰り返し回数が１となっている記号無し単語の行、（３）その行番号の情報が、空行で次の行が通常文または繰り返し回数が２以上の単語の行、（４）その行番号の情報が、繰り返し回数が複数となっている記号なし単語の行、の（１）〜（４）のいずれかの行に当てはまるか否かで判定する。そして段落解析部１０３は、メモリ上において他の行番号に対応する情報が記録されているか否かを確認し（ステップＳ１０７）、ステップＳ１０５とステップＳ１０６を繰り返し、メモリ上に記録されている行番号全てについて、その行が段落先頭位置に相当する行か否かを判定する。 Next, the paragraph analysis unit 103 extracts the line type and the number of repetitions corresponding to the line number 1 recorded in the data (FIG. 6) recorded on the memory (step S105). Then, based on the information of line number 1 extracted from the data on the memory shown in FIG. 6, it is determined whether or not the line of line number 1 is a line corresponding to the paragraph head position (step S106). Whether the line corresponds to the head position of this paragraph is determined by (1) the information of the line number is the line of the word with a symbol whose repetition count is 1, and (2) the information of the line number is the repetition count. The line of the unsigned word in which 1 is 1, (3) the information of the line number is a blank line, the next line is a normal sentence, or the line of the word having a repetition count of 2 or more, (4) the information of the line number However, the determination is made based on whether or not it applies to any one of the lines (1) to (4) of the unsigned word line having a plurality of repetitions. Then, the paragraph analysis unit 103 checks whether information corresponding to other line numbers is recorded in the memory (step S107), repeats steps S105 and S106, and repeats the line numbers recorded in the memory. For all, it is determined whether or not the line corresponds to the paragraph head position.

ここで、検索学習用文書ａにおいて（１）に相当する行番号は、１、３、６、８、１１、１４である。また検索学習用文書ａにおいて（２）に相当する行番号は１６である。また（３）に相当する行番号は検索学習用文書ａには存在しない。また検索学習用文書ａにおいて（４）に相当する行番号は１８である。また、段落解析部１０３は図６の示すデータから取り出した行番号１に対応するデータは、必ず段落先頭位置に相当する行を示すデータであると判定するものとする。このようにして段落解析部１０３は検索学習用文書ａのどの行が段落先頭位置に相当する行かを判定し、次に、段落先頭位置の行から次の段落先頭位置の行の前の行までを１つの段落として、検索条件学習用文書ａにおける段落を決定する（ステップＳ１０８）。そして、段落解析部１０３は、１〜２行目を第１段落、３〜５行目を第２段落、６〜７行目を第３段落、８〜１０行目を第４段落、１１〜１３行目を第５段落、１４〜１５行目が第６段落、１６〜１７行目を第７段落、１８〜２０行目を第８段落とし、各段落番号とそれぞれの段落が保持する行番号と各行番号に対応する行種類とを、検索学習用文書ａの段落解析の結果として、文書構造分類部１０４に出力する。 Here, the line numbers corresponding to (1) in the search learning document a are 1, 3, 6, 8, 11, and 14. The line number corresponding to (2) is 16 in the search learning document a. The line number corresponding to (3) does not exist in the search learning document a. In the search learning document a, the line number corresponding to (4) is 18. The paragraph analysis unit 103 determines that the data corresponding to line number 1 extracted from the data shown in FIG. 6 is always data indicating a line corresponding to the paragraph head position. In this way, the paragraph analysis unit 103 determines which line of the search learning document a corresponds to the paragraph head position, and then from the line at the paragraph head position to the line before the line at the next paragraph head position. As a single paragraph, the paragraph in the search condition learning document a is determined (step S108). The paragraph analysis unit 103 includes the first and second lines as the first paragraph, the third and fifth lines as the second paragraph, the sixth and seventh lines as the third paragraph, and the eighth and tenth lines as the fourth paragraph, The 13th line is the 5th paragraph, the 14th to 15th lines are the 6th paragraph, the 16th to 17th lines are the 7th paragraph, the 18th to 20th lines are the 8th paragraphs, and each paragraph number and the line that each paragraph holds The number and the line type corresponding to each line number are output to the document structure classification unit 104 as a result of the paragraph analysis of the search learning document a.

次に、文書構造分類部１０４は検索条件学習用文書ａの各段落番号とそれぞれの段落に対応する行番号とそれら各行番号に対応する行種類に基づいて、各段落の段落種類を決定する（ステップＳ１０９）。ここで、文書構造分類部１０４は段落種類を決定するにあたり、通常文、記号無し単語行、記号付き単語行、空行の順番で、段落内の各行の行種類に当てはまるか否かを確認して行き、最初に当てはまった行種類を段落種類とする。例えば、検索学習用文書ａの第１段落に含まれる行は１行目と２行目であり、２行目の行種類が通常文であるので、第１段落の段落種類は通常文である。このようにして、文書構造分類部１０４は検索条件学習用文書ａの各段落について段落種類を決定する。
また、文書構造分類部１０４は段落先頭位置に相当する行に基づいて段落識別子を生成する。本実施形態においては、段落識別子は段落先頭位置の行の最初の文字が数字か否かを示す情報である。例えば検索条件学習用文書ａの第１段落であれば段落識別子は「数字」、検索条件学習用文書ａの第７段落であれば段落識別子は「なし」である。なお、例えば段落識別子は、段落番号であっても良いし、段落タイトル（先頭位置に相当する行の最初の単語）であっても良い。 Next, the document structure classification unit 104 determines the paragraph type of each paragraph based on each paragraph number of the search condition learning document a, the line number corresponding to each paragraph, and the line type corresponding to each line number ( Step S109). Here, when determining the paragraph type, the document structure classification unit 104 confirms whether the line type of each line in the paragraph applies in the order of a normal sentence, a word line without a symbol, a word line with a symbol, and a blank line. Go to the first line type and make it the paragraph type. For example, the lines included in the first paragraph of the search learning document a are the first and second lines, and since the line type of the second line is a normal sentence, the paragraph type of the first paragraph is a normal sentence. . In this way, the document structure classification unit 104 determines the paragraph type for each paragraph of the search condition learning document a.
The document structure classification unit 104 generates a paragraph identifier based on the line corresponding to the paragraph head position. In the present embodiment, the paragraph identifier is information indicating whether or not the first character of the line at the paragraph head position is a number. For example, in the first paragraph of the search condition learning document a, the paragraph identifier is “number”, and in the seventh paragraph of the search condition learning document a, the paragraph identifier is “none”. For example, the paragraph identifier may be a paragraph number or a paragraph title (first word in a line corresponding to the head position).

そして、文書構造分類部１０４は検索条件学習用文書ａの各段落の段落識別子とそれぞれの段落の段落種類とを対応付けて、文書構造番号１の情報として文書構造記憶部１０５に記録する（ステップＳ１１０）。なお、ステップＳ１０９で記録した情報は図４に示した、文書構造番号１に対応する情報である。またステップＳ１０９の後、文書構造分類部１０４は段落解析部１０３に検索条件学習用文書ａについての文書構造分類の記録が終了した旨を通知する。すると、段落解析部１０３は他の検索条件学習用文書が検索条件学習用文書記憶部１０１に記録さているか否かを確認し（ステップＳ１１１）、他の検索条件学習用文書が記録されている場合にはその文書につてステップＳ１０１の処理を始める。そして、ステップＳ１０１〜ステップＳ１１０の処理が全ての検索条件学習用文書記憶部１０１に記録されている検索条件学習用文書について行なわれる。なお上述のステップＳ１０１〜ステップＳ１１１の処理はユーザが検索開始の指示を行なう以前に行なわれて、複数の文書構造のデータが文書構造記憶部１０５に記録される。また図４において文書構造番号２に対応する情報は検索条件学習用文書ｂの文書構造を示すものである。 Then, the document structure classification unit 104 associates the paragraph identifier of each paragraph of the search condition learning document a with the paragraph type of each paragraph, and records it in the document structure storage unit 105 as information of document structure number 1 (step S110). The information recorded in step S109 is information corresponding to the document structure number 1 shown in FIG. Further, after step S109, the document structure classification unit 104 notifies the paragraph analysis unit 103 that the recording of the document structure classification for the search condition learning document a has been completed. Then, the paragraph analysis unit 103 confirms whether another search condition learning document is recorded in the search condition learning document storage unit 101 (step S111), and another search condition learning document is recorded. In step S101, the processing of step S101 is started. Then, the processing from step S101 to step S110 is performed on the search condition learning documents recorded in all the search condition learning document storage units 101. Note that the processes in steps S101 to S111 described above are performed before the user issues a search start instruction, and data of a plurality of document structures is recorded in the document structure storage unit 105. In FIG. 4, information corresponding to the document structure number 2 indicates the document structure of the search condition learning document b.

次に、文書検索装置１においてサンプル文書と文構造の情報とに基づいて検索条件データが作成されるまでの処理について説明する。図８は検索条件データ作成処理のフローを示す図である。
まず、ユーザより検索開始の指示が文書検索装置１に入力されると、段落解析部１０３はサンプル文書記憶部１０２に記録されているサンプル文書を１つ読み取る（ステップＳ２０１）。ここで、ステップＳ２０１で段落解析部１０３が読み取ったサンプル文書をサンプル文書Ａとする。次に、段落解析部１０３は上述のステップＳ１０２からステップＳ１０８の処理と同様の処理手法を用いてサンプル文書Ａの各段落を決定する（ステップＳ２０２）。次に文書構造分類部１０４が上述のステップＳ１０９の処理と同様の処理手法を用いてサンプル文書Ａの各段落の段落種類を決定する（ステップＳ２０３）。そして、文書構造分類部１０４はサンプル文書Ａの各段落の段落識別子とそれぞれの段落の段落種類とを検索条件データ作成部１０６に通知する（ステップＳ２０４）。なおサンプル文書Ａの段落は１〜８段落まであり、各段落の段落種類は、第１段落：通常文、第２段落：通常文、第３段落：通常文、第４段落：通常文、第５段落：通常文、第６段落：通常文、第７段落：通常文、第８段落：記号無し単語、である。 Next, a process until the search condition data is created based on the sample document and the sentence structure information in the document search apparatus 1 will be described. FIG. 8 shows a flow of search condition data creation processing.
First, when a search start instruction is input from the user to the document search apparatus 1, the paragraph analysis unit 103 reads one sample document recorded in the sample document storage unit 102 (step S201). Here, the sample document read by the paragraph analysis unit 103 in step S201 is referred to as a sample document A. Next, the paragraph analysis unit 103 determines each paragraph of the sample document A by using the same processing technique as the processing from step S102 to step S108 described above (step S202). Next, the document structure classification unit 104 determines the paragraph type of each paragraph of the sample document A using the same processing method as the processing in step S109 described above (step S203). Then, the document structure classification unit 104 notifies the search condition data creation unit 106 of the paragraph identifier of each paragraph of the sample document A and the paragraph type of each paragraph (step S204). There are 1 to 8 paragraphs in the sample document A. The paragraph types of each paragraph are as follows: 1st paragraph: normal sentence, 2nd paragraph: normal sentence, 3rd paragraph: normal sentence, 4th paragraph: normal sentence, 5th paragraph: normal sentence, 6th paragraph: normal sentence, 7th paragraph: normal sentence, 8th paragraph: unsigned word.

次に検索条件データ作成部１０６は、サンプル文書Ａの各段落の段落識別子とそれぞれの段落の段落種類から、サンプル文書Ａの文書構造は、第１段落は段落識別子が「数字」で段落種類が「通常文」、第２段落は段落識別子が「数字」で段落種類が「通常文」、第３段落は段落識別子が「数字」で段落種類が「通常文」、第４段落は段落識別子が「数字」で段落種類が「通常文」、第５段落は段落識別子が「数字」で段落種類が「通常文」、第６段落は段落識別子が「数字」で段落種類が「通常文」、第７段落は段落識別子が「なし」で段落種類が「通常文」、第８段落は段落識別子が「なし」で段落種類が「記号なし単語」、の構造であること判断し、文書構造記憶部１０５（図４）からサンプル文書Ａの文書構造と同じ文書構造の情報を示す文書構造番号を読み取る（ステップＳ２０５）。ここで、サンプル文書Ａの文書構造の番号は「１」である。次に、検索条件データ作成部１０６はサンプル文書Ａの各段落の中から、検索内容の記述されている段落の指定をユーザから受付ける。なお、ユーザから指定された段落はサンプル文書Ａの第３段落とする。そして、ユーザの指定した段落をサンプル文書Ａにおける検索内容記載段落とする（ステップＳ２０６）。次に検索条件データ作成部１０６は検索内容記載段落の文章を形態素解析し、予め設定された品詞に相当する単語を検索用のキーワードと決定する（ステップＳ２０７）。なお本実施形態においてはキーワードとする単語の品詞を名詞とする。そして、検索条件データ作成部１０６はサンプル文書Ａの第３段落の「従来」、「辞書」、「使用頻度」の単語をキーワードとする。 Next, the search condition data creation unit 106 determines the document structure of the sample document A from the paragraph identifier of each paragraph of the sample document A and the paragraph type of each paragraph. “Normal sentence”, the second paragraph has the paragraph identifier “number” and the paragraph type “normal sentence”, the third paragraph has the paragraph identifier “number”, the paragraph type “normal sentence”, and the fourth paragraph has the paragraph identifier “Number”, paragraph type “Normal Text”, paragraph 5 “Paragraph” “Number”, paragraph type “Normal Sentence”, paragraph 6 “paragraph identifier“ Number ”, paragraph type“ Normal Sentence ”, It is determined that the seventh paragraph has a structure with the paragraph identifier “None” and the paragraph type “Normal Text”, and the eighth paragraph has a structure with the paragraph identifier “None” and the paragraph type “Word without a symbol”, and stores the document structure. Information of the same document structure as that of the sample document A from the part 105 (FIG. 4) Reading the document structure number indicated (step S205). Here, the document structure number of the sample document A is “1”. Next, the search condition data creation unit 106 receives, from the user, designation of a paragraph in which search contents are described from among the paragraphs of the sample document A. Note that the paragraph specified by the user is the third paragraph of the sample document A. Then, the paragraph specified by the user is set as a search content description paragraph in the sample document A (step S206). Next, the search condition data creation unit 106 performs morphological analysis on the sentence in the search content description paragraph, and determines a word corresponding to a preset part of speech as a search keyword (step S207). In this embodiment, the part of speech of a word used as a keyword is a noun. The search condition data creation unit 106 uses the words “conventional”, “dictionary”, and “usage frequency” in the third paragraph of the sample document A as keywords.

次に、検索条件データ作成部１０６は検索内容記載段落のサンプル文書Ａにおける段落位置を検出する（ステップＳ２０８）。この時、まず検索条件データ作成部１０６はサンプル文書Ａの段落数を３で割り、割り切れる場合にはその数の段落数ずつサンプル文書Ａを上部、中部、下部と分ける。また検索条件データ作成部１０６はサンプル文書Ａの段落数を３で割り、得られた値が小数点以下を含む数値である場合には、小数点第1位を四捨五入して自然数を得て、その自然数の段落数ずつサンプル文書Ａを上から上部、中部とし、残った段落を下部とする。このようにして、検索条件データ作成部１０６は検索内容記載段落のサンプル文書Ａにおける位置がサンプル文書Ａにおける上部、中部、下部のどの段落位置にあるかを検出する。 Next, the search condition data creation unit 106 detects the paragraph position in the sample document A of the search content description paragraph (step S208). At this time, the search condition data creation unit 106 first divides the number of paragraphs of the sample document A by 3, and if it is divisible, divides the sample document A by the number of paragraphs into the upper part, the middle part, and the lower part. In addition, the search condition data creation unit 106 divides the number of paragraphs of the sample document A by 3, and if the obtained value is a numerical value including a decimal point, the natural number is obtained by rounding off the first decimal place. The sample document A is set from the top to the top and the middle, and the remaining paragraphs are set to the bottom. In this way, the search condition data creation unit 106 detects whether the position in the sample document A of the search content description paragraph is in the upper, middle or lower paragraph position in the sample document A.

そして、検索条件データ作成部１０６はサンプル文書Ａの識別情報として任意の記号「Ａ」を割り当て、そのサンプル文書Ａの識別情報と、サンプル文書Ａの文書構造を示す文書構造番号と、サンプル文書Ａにおける検索内容記載段落の位置の情報と、検索内容記載段落の段落種類と、検索内容記載段落から得たキーワードとを１つの検索条件データとして、その検索条件データのデータ番号「１」に対応付けて検索条件データ記憶部１０７に記録する（ステップＳ２０９）。またステップ２０９の後、検索条件データ作成部１０６は段落解析部１０３にサンプル文書Ａについての検索条件データの作成が終了した旨を通知する。すると、段落解析部１０３は他のサンプル文書がサンプル文書記憶部１０２に記録さているか否かを確認し（ステップＳ２１０）、他のサンプル文書が記録されている場合にはその文書についてステップＳ２０１の処理以降の処理を始める。そして、ステップＳ２０１〜ステップＳ２０９の処理がサンプル文書記憶部１０２に記録されている全てのサンプル文書について行なわれる。
以上、ステップＳ２０１からステップＳ２０９の処理によって検索条件データが作成される。 Then, the search condition data creation unit 106 assigns an arbitrary symbol “A” as the identification information of the sample document A, the identification information of the sample document A, the document structure number indicating the document structure of the sample document A, and the sample document A The search content description paragraph position information, the search content description paragraph type, and the keyword obtained from the search content description paragraph are associated with the data number “1” of the search condition data as one search condition data. And recorded in the search condition data storage unit 107 (step S209). After step 209, the search condition data creation unit 106 notifies the paragraph analysis unit 103 that the creation of the search condition data for the sample document A has been completed. Then, the paragraph analysis unit 103 checks whether or not another sample document is recorded in the sample document storage unit 102 (step S210). If another sample document is recorded, the process of step S201 is performed on the document. The subsequent processing is started. Then, the processes in steps S201 to S209 are performed for all the sample documents recorded in the sample document storage unit 102.
As described above, the search condition data is created by the processing from step S201 to step S209.

次に新サンプル文書の候補を決定して、ユーザから指定を受けた新サンプル文書候補のみ、新しいサンプル文書とする文書検索装置１の処理フローについて説明する。図９は新サンプル文書作成の処理フローを示す図である。
上述の検索条件データの作成がサンプル文書記憶部１０２に記録されている全てのサンプル文書について終了すると、検索条件データ作成部１０６はキーワード包含段落決定部１０８に新サンプル文書候補の決定の処理の開始を指示する。するとキーワード包含段落決定部１０８は検索条件学習用文書記憶部１０１から任意の検索条件学習用文書を１つ読み込む（ステップＳ３０１）。ここでステップＳ３０１で読み込んだ検索条件学習用文書を検索条件学習用文書ｄとする。次に、キーワード包含段段落決定部１０８は検索条件データ記憶部１０７に記録されている検索条件データを１つ読み込む（ステップＳ３０２）。ここで、ステップＳ３０２で読み込んだ検索条件データは、図５で示す検索条件データのうち、サンプル文書「Ａ」を用いて作成された検索条件データ番号が「１」に対応する検索条件データであるとする。そして、キーワード包含段落決定部１０８はステップＳ３０２で読み込んだ検索条件データに含まれるキーワードに基づいて、ステップＳ３０１で読み込んだ検索条件学習用文書ｄを全文検索する（ステップＳ３０３）。 Next, a process flow of the document search apparatus 1 that determines a candidate for a new sample document and sets only a new sample document candidate designated by the user as a new sample document will be described. FIG. 9 is a diagram showing a processing flow for creating a new sample document.
When the creation of the search condition data described above is completed for all the sample documents recorded in the sample document storage unit 102, the search condition data creation unit 106 starts the process of determining a new sample document candidate in the keyword inclusion paragraph determination unit 108. Instruct. Then, the keyword inclusion paragraph determination unit 108 reads one arbitrary search condition learning document from the search condition learning document storage unit 101 (step S301). Here, the search condition learning document read in step S301 is set as a search condition learning document d. Next, the keyword inclusion stage paragraph determination unit 108 reads one search condition data recorded in the search condition data storage unit 107 (step S302). Here, the search condition data read in step S302 is the search condition data corresponding to the search condition data number “1” created using the sample document “A” in the search condition data shown in FIG. And Then, the keyword inclusion paragraph determination unit 108 performs a full-text search on the search condition learning document d read in step S301 based on the keyword included in the search condition data read in step S302 (step S303).

ここで、この全文検索の手法は、例えば、検索条件学習用文書中においてキーワードと一致する単語が含まれているか否かを検索する方法であったり、キーワードが検索条件学習用文書中に出現する頻度を検出する方法であったり様々であり、キーワード包含段落決定部１０８は公知の全文検索の手法を用いてキーワードによる検索条件学習用文書の全文検索を行なう。なお、全文検索の文献としては、特開平８−４４７７１に技術が公開されている。そしてキーワード包含段落決定部１０８は、全文検索の結果、検索条件学習用文書ｄにおけるキーワードを包含した文章の行を記憶し、次に、上述のステップＳ１０２からステップＳ１０９の処理と同様の処理手法を用いて検索条件学習用文書ｄの各段落の決定と、それら各段落の段落種類を決定とを行なって、検索条件学習用文書ｄの文書構造を検出する（ステップＳ３０４）。またキーワード包含段落決定部１０８は、全文検索の結果、検索条件学習用文書ｄにおけるキーワードを包含した文章の行を含む検索条件学習用文書ｄの段落をキーワード包含段落と決定する（ステップＳ３０５）。そして、キーワード包含段落決定部１０８はステップＳ３０１で読み込んだ検索条件学習用文書ｄと、その検索条件学習用文書ｄにおけるキーワード包含段落と、検索条件学習用文書ｄの文書構造の情報と、ステップＳ３０２で読み込んだ検索条件データの保持する文書構造番号とを新サンプル文書候補決定部１０９に通知する。 Here, this full-text search method is, for example, a method for searching whether or not a word matching the keyword is included in the search condition learning document, or the keyword appears in the search condition learning document. There are various methods for detecting the frequency, and the keyword inclusion paragraph determination unit 108 performs a full-text search of the search condition learning document using the keywords using a known full-text search technique. As a full-text search document, Japanese Patent Laid-Open No. 8-44771 discloses a technique. Then, the keyword inclusion paragraph determination unit 108 stores the sentence line including the keyword in the search condition learning document d as a result of the full-text search, and then performs the same processing method as the processing from step S102 to step S109 described above. The determination of each paragraph of the search condition learning document d and the determination of the paragraph type of each paragraph are performed to detect the document structure of the search condition learning document d (step S304). Further, as a result of the full-text search, the keyword inclusion paragraph determination unit 108 determines the paragraph of the search condition learning document d including the sentence line including the keyword in the search condition learning document d as the keyword inclusion paragraph (step S305). Then, the keyword inclusion paragraph determination unit 108 reads the search condition learning document d read in step S301, the keyword inclusion paragraph in the search condition learning document d, the document structure information of the search condition learning document d, and step S302. The new sample document candidate determination unit 109 is notified of the document structure number held in the search condition data read in step.

次に、新サンプル文書候補決定部１０９はキーワード包含段落決定部１０８から通知を受けた検索条件学習用文書ｄの文書構造の情報とその検索条件学習用文書ｄにおけるキーワード包含段落とを用いて、キーワード包含段落の段落位置（上部、中部、下部）を検出する（ステップＳ３０６）。このキーワード包含段落の段落位置の検出は、上述のステップＳ２０８の処理と同様である。そして、新サンプル文書候補決定部１０９は、検索条件データ記憶部１０７において、検索条件学習用文書ｄの文書構造を示す文書構造番号と検索条件学習用文書ｄにおけるキーワード包含段落の段落位置とを保持する検索条件データがあるか否かを検索する（ステップＳ３０７）。そして、新サンプル文書候補決定部１０９はステップＳ３０７において、検索条件学習用文書ｄの文書構造を示す文書構造番号と検索条件学習用文書ｄにおけるキーワード包含段落の段落位置とを保持する検索条件データが検索条件データ記憶部１０７に記録されていない場合に、当該キーワード包含段落を段落に含んでいる検索条件学習用文書ｄを新サンプル文書の候補として、新サンプル文書候補記憶部１１０に記録する（ステップＳ３０８）。このようにして、検索条件学習用文書記憶部１０１に記録されている検索条件学習用文書のうち、ユーザの用意したサンプル文書の検索内容記載段落の内容と同様の内容を保持している検索条件学習用文書が、新しいサンプル文書として候補に上げらて新サンプル文書候補記憶部１１０に記録される。 Next, the new sample document candidate determination unit 109 uses the information on the document structure of the search condition learning document d received from the keyword inclusion paragraph determination unit 108 and the keyword inclusion paragraph in the search condition learning document d. The paragraph position (upper, middle, lower) of the keyword-containing paragraph is detected (step S306). The detection of the paragraph position of the keyword-containing paragraph is the same as the processing in step S208 described above. Then, the new sample document candidate determination unit 109 holds the document structure number indicating the document structure of the search condition learning document d and the paragraph position of the keyword inclusion paragraph in the search condition learning document d in the search condition data storage unit 107. Whether there is search condition data to be searched is searched (step S307). In step S307, the new sample document candidate determination unit 109 stores search condition data that holds the document structure number indicating the document structure of the search condition learning document d and the paragraph position of the keyword-containing paragraph in the search condition learning document d. If not stored in the search condition data storage unit 107, the search condition learning document d including the keyword inclusion paragraph in the paragraph is recorded as a new sample document candidate in the new sample document candidate storage unit 110 (step S308). In this way, among the search condition learning documents recorded in the search condition learning document storage unit 101, a search condition that holds the same content as the content of the search content description paragraph of the sample document prepared by the user The learning document is selected as a new sample document and recorded in the new sample document candidate storage unit 110.

またステップＳ３０７において、検索条件学習用文書ｄの文書構造を示す文書構造番号と検索条件学習用文書ｄにおけるキーワード包含段落の段落位置とを保持する検索条件データが検索条件データ記憶部１０７に記録されている場合、新サンプル文書候補決定部１０９は別の検索条件データについて処理を行なうようにキーワード包含段落決定部１０８について通知する。そしてキーワード包含段落決定部１０８は検索条件データ記憶部１０７にステップＳ３０２で読み込んだ検索条件データ以外のデータがあるか否かを確認する（ステップＳ３０９）。そして、別の検索条件データが記録されている場合には、その検索条件データを利用して、検索条件学習用文書ｄについてのステップＳ３０２からステップＳ３０６の処理を繰り返す。 In step S307, search condition data holding the document structure number indicating the document structure of the search condition learning document d and the paragraph position of the keyword-containing paragraph in the search condition learning document d is recorded in the search condition data storage unit 107. If so, the new sample document candidate determination unit 109 notifies the keyword inclusion paragraph determination unit 108 to process another search condition data. Then, the keyword inclusion paragraph determination unit 108 checks whether or not there is data other than the search condition data read in step S302 in the search condition data storage unit 107 (step S309). If another search condition data is recorded, the search condition data is used to repeat the processing from step S302 to step S306 for the search condition learning document d.

次に、ステップＳ３０７で新サンプル文書の候補の検索条件学習用文書のデータを検索条件学習用文書記憶部１０１に記録が終了すると、キーワード包含段落決定部１０８に次の検索条件学習用文書についての処理を行なうよう指示する。すると、キーワード包含段落決定部１０８は他の検索条件学習用文書が検索条件学習用文書記憶部１０１に記録さているか否かを確認し（ステップＳ３１０）、他の検索条件学習用文書が記録されている場合にはその文書についてステップＳ３０１の処理以降の処理を始める。そして、ステップＳ３０１〜ステップＳ３０８の処理が検索条件学習用文書１０１に記録されている全てのサンプル文書について行なわれる。
次に、新サンプル文書指定部１１１が新サンプル文書候補記憶部１１０に記録されている新サンプル文書候補を例えばモニタなどに表示し、ユーザから新サンプル文書とする文書の選択を受付ける。そして新サンプル文書１１１はユーザから指定された新しいサンプル文書をサンプル文書記憶部１０２に登録する。そして文書検索装置１は新サンプル文書について、ステップＳ２０１〜Ｓ２０９の処理により、新サンプル文書を用いて検索条件データの作成を行なう。なお、新サンプル文書についてステップＳ２０１〜Ｓ２０９の処理を行って検索条件データの作成を行なった後は、新条件データ作成部１０６はキーワード包含段落決定部１０８に新サンプル文書候補の決定の処理の開始を指示せず、処理を終了する。 Next, when the data of the search condition learning document candidate new sample document candidate is recorded in the search condition learning document storage unit 101 in step S307, the keyword inclusion paragraph determination unit 108 stores the next search condition learning document. Instruct to perform processing. Then, the keyword inclusion paragraph determination unit 108 checks whether or not other search condition learning documents are recorded in the search condition learning document storage unit 101 (step S310), and other search condition learning documents are recorded. If yes, the processing after the processing in step S301 is started for the document. Then, the processing from step S301 to step S308 is performed for all sample documents recorded in the search condition learning document 101.
Next, the new sample document designating unit 111 displays the new sample document candidates recorded in the new sample document candidate storage unit 110 on a monitor, for example, and accepts selection of a document to be a new sample document from the user. The new sample document 111 registers a new sample document designated by the user in the sample document storage unit 102. Then, the document search apparatus 1 creates search condition data for the new sample document by using the new sample document by the processes in steps S201 to S209. Note that after the processing of steps S201 to S209 is performed on the new sample document to create the search condition data, the new condition data creation unit 106 starts the process of determining a new sample document candidate in the keyword inclusion paragraph determination unit 108. Is not instructed, and the process is terminated.

次に、検索条件データベースからユーザの希望する検索対象文書を検索する処理フローについて説明する。図１０は検索対象文書の検索時の処理フローを示す図である。
検索条件データ作成部１０６は全てのサンプル文書および新サンプル文書に関して検索条件データを作成すると、次に、検索対象文書読込み部１１２に検索本処理を行なうよう指示する。すると検索対象文書読込み部１１２は通信ネットワークを介して検索対象文書データベース２から１つ検索対象文書のデータを読み込み（ステップＳ４０１）、段落解析部１０３に転送する。次に段落解析部１０３は検索対象文書を上述のステップＳ１０２からステップＳ１０９の処理と同様の処理手法を用いて検索対象文書の各段落の決定と、それら各段落の段落種類を決定とを行なって、検索対象文書の文書構造を検出する（ステップＳ４０２）。そして、段落解析部１０３はステップＳ４０２で検出した検索対象文書の文書構造の情報を検索処理部１１３に通知する。 Next, a processing flow for searching for a search target document desired by the user from the search condition database will be described. FIG. 10 is a diagram showing a processing flow when searching for a search target document.
When the search condition data creating unit 106 creates search condition data for all sample documents and new sample documents, the search condition data creating unit 106 next instructs the search target document reading unit 112 to perform the search main process. Then, the search target document reading unit 112 reads one search target document data from the search target document database 2 via the communication network (step S 401), and transfers it to the paragraph analysis unit 103. Next, the paragraph analysis unit 103 determines each paragraph of the search target document and determines the paragraph type of each paragraph using the same processing technique as the processing from step S102 to step S109 described above. The document structure of the search target document is detected (step S402). Then, the paragraph analysis unit 103 notifies the search processing unit 113 of the document structure information of the search target document detected in step S402.

次に、検索処理部１１３は、文書構造記憶部１０５（図４）から検索対象文書の文書構造の情報と同じ文書構造の情報を示す文書構造番号を読み取る（ステップＳ４０３）。ここで、検索対象文書の文書構造を示す文書構造番号は「１」であるとする。次に、検索処理部１１３はステップＳ４０３で読み取った文書構造番号を保持している検索条件データを検索条件データ記憶部１０７から読み取る（ステップＳ４０４）。また検索処理部１１３は上述のステップＳ２０８の処理と同様にして検索対象文書の各段落が上部、中部、下部のどの段落位置に含まれる段落かを検出する。そして検索処理部１１３は、ステップＳ４０４で読み取った検索条件データの保持する段落位置と同じ段落位置の検索対象文書の各段落を抽出する（ステップＳ４０５）。また検索処理部１１３は、ステップＳ４０５で読み取った検索対象文書の段落の段落種類が、ステップＳ４０４で読み取った検索条件データの保持する段落種類と同じか否かを確認する（ステップＳ４０６）。そしてステップＳ４０６において、ステップＳ４０５で読み取った検索対象の段落の段落種類が、ステップＳ４０４で読み取った検索条件データの保持する段落種類と同じである場合には、検索処理部１１３はその検索対照文書の段落を検索判定用段落と決定する（ステップＳ４０７）。 Next, the search processing unit 113 reads a document structure number indicating the same document structure information as the document structure information of the search target document from the document structure storage unit 105 (FIG. 4) (step S403). Here, it is assumed that the document structure number indicating the document structure of the search target document is “1”. Next, the search processing unit 113 reads the search condition data holding the document structure number read in step S403 from the search condition data storage unit 107 (step S404). In addition, the search processing unit 113 detects in the upper, middle, and lower paragraph positions each paragraph of the search target document in the same manner as the processing in step S208 described above. Then, the search processing unit 113 extracts each paragraph of the search target document at the same paragraph position as the paragraph position held in the search condition data read in step S404 (step S405). In addition, the search processing unit 113 checks whether or not the paragraph type of the paragraph of the search target document read in step S405 is the same as the paragraph type held in the search condition data read in step S404 (step S406). In step S406, if the paragraph type of the paragraph to be searched read in step S405 is the same as the paragraph type held in the search condition data read in step S404, the search processing unit 113 selects the search reference document. The paragraph is determined as a search determination paragraph (step S407).

次に、検索処理部１１３は検索対象文書における検索判定用段落をステップＳ４０４で読み取った検索条件データの保持するキーワードを用いて全文検索する（ステップＳ４０８）。この全文検索の手法は、上述のステップＳ３０３の説明で記載したように、従来から行なわれている手法を用いる。そして、検索処理部１１３は検索判定用段落を全文検索して、当該検索判定用段落に記載の文章の単語がキーワードと一致したり、その検索判定用段落においてキーワードの出現頻度が高い文章が含まれていたりするかを確認する（ステップＳ４０９）。なおこのステップＳ４０９の処理は全文検索の手法によって異なる。そして、検索処理部１１３は全文検索の結果、検索判定用段落に記載の文章の単語がキーワードと一致している場合などには、その検索判定用段落を保持している検索対象文書を検索結果として検索結果出力部１１４に転送し（ステップＳ４１０）、検索結果出力部１１４が転送された検索対象文書を検索結果として例えば文書検索装置１に備えられたモニタなどに出力する（ステップＳ４１１）。 Next, the search processing unit 113 performs a full text search using the keyword held in the search condition data read in step S404 for the search determination paragraph in the search target document (step S408). This full-text search method uses a conventional method as described in the description of step S303. Then, the search processing unit 113 performs a full text search of the search determination paragraph, and a word of a sentence described in the search determination paragraph matches a keyword, or a sentence having a high keyword appearance frequency is included in the search determination paragraph. It is confirmed whether it has been (step S409). The processing in step S409 differs depending on the full text search method. Then, as a result of the full-text search, when the word of the sentence described in the search determination paragraph matches the keyword, the search processing unit 113 searches the search target document holding the search determination paragraph as a search result. Is transferred to the search result output unit 114 (step S410), and the search target document transferred by the search result output unit 114 is output as a search result to, for example, a monitor provided in the document search apparatus 1 (step S411).

またステップＳ４０６において、ステップＳ４０５で読み取った検索対象の段落の段落種類が、ステップＳ４０４で読み取った検索条件データの保持する段落種類と同じでない場合には、検索処理部１１３は他の検索対象文書を検索対象文書データベース２から読み込むよう、検索対象文書読込み部１１２に通知する。またステップＳ４０９において、検索判定用段落に記載の文章の単語がキーワードと一致していない場合などにおいては、検索処理部１１３はの検索対象文書を検索対象文書データベース２から読み込むよう、検索対象文書読込み部１１２に通知する。またステップＳ４１０で検索結果を出力した場合にも、検索処理部１１３は他の検索対象文書を検索対象文書データベース２から読み込むよう、検索対象文書読込み部１１２に通知する。そして、文書検索装置１は検索対象文書データベース２に記録されている大量の検索対象文書全てについて検索処理を行なう。これによりユーザは、希望する内容の検索対象文書を得ることができる。 In step S406, if the paragraph type of the search target paragraph read in step S405 is not the same as the paragraph type held in the search condition data read in step S404, the search processing unit 113 selects another search target document. The search target document reading unit 112 is notified to read from the search target document database 2. Further, in step S409, when the word of the sentence described in the search determination paragraph does not match the keyword, the search processing unit 113 reads the search target document so that the search target document is read from the search target document database 2. Notification to the unit 112. Even when the search result is output in step S410, the search processing unit 113 notifies the search target document reading unit 112 to read another search target document from the search target document database 2. Then, the document search apparatus 1 performs a search process for all the large numbers of search target documents recorded in the search target document database 2. As a result, the user can obtain a search target document having a desired content.

なお、図１における処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ実行することにより、文書検索装置１が上述の各処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 1 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read and executed by a computer system, whereby the document search apparatus 1 is recorded. May perform each of the processes described above. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

本発明の実施形態による文書検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the document search apparatus by embodiment of this invention. 検索条件学習用文書の例を示す図である。It is a figure which shows the example of the document for search condition learning. サンプル文書を示す図である。It is a figure which shows a sample document. 文書構造の情報を示す図である。It is a figure which shows the information of a document structure. 検索条件データの例を示す図である。It is a figure which shows the example of search condition data. 文書構造の分類とその情報を記録する際の処理フローを示す図である。It is a figure which shows the processing flow at the time of recording the classification | category of a document structure, and its information. 段落解析部のメモリに保持されたデータを示す図である。It is a figure which shows the data hold | maintained at the memory of the paragraph analysis part. 検索条件データ作成処理のフローを示す図である。It is a figure which shows the flow of search condition data creation processing. 新サンプル文書作成の処理フローを示す図である。It is a figure which shows the processing flow of new sample document preparation. 検索対象文書の検索時の処理フローを示す図である。It is a figure which shows the processing flow at the time of the search of a search object document.

Explanation of symbols

１・・・文書検索装置、２・・・検索対象文書データベース、１０１・・・検索条件学習用文書記憶部、１０２・・・サンプル文書記憶部、１０３・・・段落解析部、１０４・・・文書構造解析部、１０５・・・文書構造記憶部、１０６・・・検索条件データ作成部、１０７・・・検索条件データ記憶部、１０８・・・キーワード包含段落決定部、１０９・・・新サンプル文書決定部、１１０・・・新サンプル文書候補記憶部、１１１・・・新サンプル文書指定部、１１２・・・検索対象文書読込み部、１１３・・・検索処理部、１１４・・・検索結果出力部
DESCRIPTION OF SYMBOLS 1 ... Document search apparatus, 2 ... Search object document database, 101 ... Document storage part for search condition learning, 102 ... Sample document storage part, 103 ... Paragraph analysis part, 104 ... Document structure analysis unit 105 ... Document structure storage unit 106 ... Search condition data creation unit 107 ... Search condition data storage unit 108 ... Keyword inclusion paragraph determination unit 109 ... New sample Document determination unit, 110 ... New sample document candidate storage unit, 111 ... New sample document designation unit, 112 ... Search target document reading unit, 113 ... Search processing unit, 114 ... Output of search results Part

Claims

A document search device for searching a search target document similar to a sample document from a plurality of search target documents,
First paragraph dividing means for dividing each of the plurality of representative search target documents extracted from the plurality of search target documents;
Document structure classification means for classifying the document structure of each of the representative search target documents based on the paragraph classification of each of the representative search target documents;
Document structure storage means for storing a plurality of information on the classified document structure for each type of the document structure;
Second paragraph dividing means for dividing the sample document into paragraphs;
Sample document structure detection means for detecting the document structure type of the sample document based on the result of the paragraph division of the sample document and the information of the document structure stored in the document structure storage means;
A third paragraph dividing means for dividing a paragraph about an arbitrary search target document among the plurality of search target documents;
A search target document for detecting a document structure type of the arbitrary search target document based on a result of the paragraph division by the third paragraph division unit and information on the document structure stored in the document structure storage unit Structure detection means;
A search target document determining unit that sets the arbitrary search target document having the same document structure type as the sample document as the search target document similar to the sample document;
A document search apparatus comprising:

A document search device for searching a search target document similar to a sample document from a plurality of search target documents,
First paragraph dividing means for dividing each of a plurality of representative search target documents extracted from the plurality of search target documents at a predetermined ratio;
Document structure classification means for classifying the document structure of each of the representative search target documents based on the paragraph classification of each of the representative search target documents;
Document structure storage means for storing a plurality of information on the classified document structure for each type of the document structure;
Second paragraph dividing means for dividing the sample document into paragraphs;
Sample document structure detection means for detecting the document structure type of the sample document based on the result of the paragraph division of the sample document and the information of the document structure stored in the document structure storage means;
The paragraph position of the search content description paragraph specified by the user in the sample document, the keyword included in the search content description paragraph, the paragraph type indicating the description content in the search content description paragraph, and the document structure of the sample document Search condition data creating means for creating search condition data that holds the type,
Search condition data storage means for storing the search condition data;
A third paragraph dividing means for dividing a paragraph about an arbitrary search target document among the plurality of search target documents;
A search target document for detecting a document structure type of the arbitrary search target document based on a result of the paragraph division by the third paragraph division unit and information on the document structure stored in the document structure storage unit Structure detection means;
Search condition data for reading the search condition data from the search condition data storage means when the search condition data holding the type of document structure of the arbitrary search target document is recorded in the search condition data storage means Reading means;
Paragraph specifying means for specifying a search determination paragraph in the arbitrary search target document based on the position of the search content description paragraph held in the read search condition data and the paragraph type;
The full text search of the paragraph containing the search content is performed using a keyword held in the search condition data read by the search condition data reading means, and the arbitrary search target document is output as a search result based on the result of the full text search Search means for determining whether to do,
A document search apparatus comprising:

3. The document search apparatus according to claim 2, wherein the keyword is a word having a predetermined part of speech obtained by morphological analysis of a sentence in the search content description paragraph.

A keyword inclusion paragraph determining means for determining a keyword inclusion paragraph in which the keyword is included in the representative search target document based on a result of a full text search of the representative search target document using the keyword in the search content description paragraph;
The combination of the type of document structure of the representative search target document and the paragraph position in the representative search target document of the keyword inclusion paragraph, and the document structure held by the search condition data stored in the search condition data recording means New sample document candidate determination means for determining the representative search target document holding the keyword-inclusive paragraph as a new sample document candidate based on a comparison of the combination of the type and the paragraph position of the search content description paragraph;
The document search apparatus according to claim 2, further comprising:

Line type determining means for determining the line type of each line in the document of either the sample document or the representative search target document based on the description content in the line;
Paragraph start position determining means for determining a paragraph start position in either the sample document or the representative search target document based on the line type;
Paragraph determining means for determining, from the sample document or the representative search target document, one paragraph from the line at the paragraph head position to the line before the line at the next paragraph head position;
Paragraph type determining means for determining the paragraph type based on the line type of a line included in the paragraph;
5. The document search apparatus according to claim 2, further comprising:

A document search method of a document search device for searching a search target document similar to a sample document from a plurality of search target documents,
A first paragraph dividing process of dividing each of the plurality of representative search target documents extracted from the plurality of search target documents at a predetermined ratio;
A document structure classification process for classifying the document structure of each of the representative search target documents based on the paragraph classification of each of the representative search target documents;
A document structure recording step of recording the classified document structure information in a document structure storage unit for each type of the document structure;
A second paragraphing process for dividing the sample document into paragraphs;
A sample document structure detection step of detecting a document structure type of the sample document based on the result of the paragraph division of the sample document and the information of the document structure stored in the document structure storage unit;
The paragraph position of the search content description paragraph specified by the user in the sample document, the keyword included in the search content description paragraph, the paragraph type indicating the description content in the search content description paragraph, and the document structure of the sample document Search condition data creation process for creating search condition data that holds the type,
A search condition data recording process for recording the search condition data in a search condition data storage means;
A third paragraph dividing process of dividing a plurality of search target documents into arbitrary search target documents;
A search target document for detecting a document structure type of the arbitrary search target document based on a result of the paragraph division by the third paragraph division process and information on the document structure stored in the document structure storage unit Structure detection process;
Search condition data for reading the search condition data from the search condition data storage means when the search condition data holding the type of document structure of the arbitrary search target document is recorded in the search condition data storage means Reading process,
A paragraph specifying process for specifying a search determination paragraph in the arbitrary search target document based on the position of the search content description paragraph held in the read search condition data and the paragraph type;
The full text search of the paragraph containing the search content is performed using a keyword held in the search condition data read in the search condition data reading process, and the arbitrary search target document is output as a search result based on the result of the full text search A search process for determining whether to
A document search method characterized by comprising:

A program for causing a computer of a document search device to search for a search target document similar to a sample document from a plurality of search target documents,
A first paragraph dividing process for dividing each of the plurality of representative search target documents extracted from the plurality of search target documents at a predetermined ratio;
A document structure classification process for classifying the document structure of each of the representative search target documents based on the paragraph classification of each of the representative search target documents;
A document structure recording process for recording the classified document structure information in a document structure storage unit for each type of the document structure;
A second paragraph dividing process for dividing the sample document into paragraphs;
A sample document structure detection process for detecting a document structure type of the sample document based on the result of the paragraph division of the sample document and the information of the document structure stored in the document structure storage unit;
The paragraph position of the search content description paragraph specified by the user in the sample document, the keyword included in the search content description paragraph, the paragraph type indicating the description content in the search content description paragraph, and the document structure of the sample document Search condition data creation processing for creating search condition data that holds the type,
A search condition data recording process for recording the search condition data in a search condition data storage means;
A third paragraph-separating process for paragraph-separating an arbitrary search target document among the plurality of search target documents;
A search target document for detecting a document structure type of the arbitrary search target document based on the result of the paragraph division by the third paragraph division processing and the information of the document structure stored in the document structure storage unit Structure detection processing;
Search condition data for reading the search condition data from the search condition data storage means when the search condition data holding the type of document structure of the arbitrary search target document is recorded in the search condition data storage means Reading process,
Paragraph specifying processing for specifying a search determination paragraph in the arbitrary search target document based on the position and paragraph type of the search content description paragraph held in the read search condition data;
A full text search of the paragraph containing the search content is performed using a keyword held in the search condition data read in the search condition data reading process, and the arbitrary search target document is output as a search result based on the result of the full text search Search processing for determining whether or not to perform,
A program that causes a computer to execute.