JP2005196668A

JP2005196668A - Query answering device and query answering system

Info

Publication number: JP2005196668A
Application number: JP2004004491A
Authority: JP
Inventors: Yoshitaka Hamaguchi; 佳孝濱口
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2004-01-09
Filing date: 2004-01-09
Publication date: 2005-07-21

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a proper answer without impairing processing efficiency. <P>SOLUTION: A query answering device comprises a searched document processing part 40 for analyzing a searched document, and creating a document index information for searching the searched document and a word information obtained by associating a document identifier, the attribute and an appearance position with an answer word for giving the answering to the query, and a search processing part 50 for acquiring a search word as a cornerstone of the query, and acquiring the searched document identifier fitted to the search word in reference to the word information, and an answer processing part 60 for extracting an answer word in reference to the word information on the basis of the document identifier, the search word and the kind of query, at a client 20 side. The document identifier and the search word from the search processing part at a client side are managed in corresponding to the query, and a session management part 1300 outputting the search word and the document identifier to the query extracting part is mounted at a server 30 side. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、質問文の回答を、予め保持する情報から取得する質問応答装置および質問応答システムに関するものである。 The present invention relates to a question answering apparatus and a question answering system for obtaining an answer to a question sentence from information held in advance.

質問文の回答を、予め保持する情報から取得して応答を行う質問応答システムが特許文献１に示されている。
該特許文献１の発明によれば、質問文の形態素を解析して、質問内容の要となる検索語と質問種別とを得る質問解析処理と、予め保持する文書集合（被検索文書）から、検索語が示されている文書を検索する文書検索処理と、検索で得た文書において、質問種別に応じた単語に対し検索語・出現頻度に基づいて点数を付与し、該点数に基づいて回答すべき単語を決定する回答抽出処理とを行い、質問内容に応じた回答を取得する。 Japanese Patent Application Laid-Open No. 2004-151867 discloses a question answering system that obtains an answer to a question sentence from information held in advance and responds.
According to the invention of Patent Document 1, a morpheme of a question sentence is analyzed, a question analysis process for obtaining a search word and a question type that are the main contents of a question, and a document set (searched document) held in advance, A document search process for searching a document in which a search term is indicated, and in the document obtained by the search, a score is given to the word according to the question type based on the search word and the appearance frequency, and the answer is based on the score An answer extraction process for determining a word to be performed is performed, and an answer corresponding to the question content is acquired.

例えば、被検索文書のある文書が「燃料電池について」と示されて保持されており、質問文として「燃料を使う電池は何？」と示されている文面が入力された場合で説明する。
質問解析処理では、質問文の形態素解析を行い、この解析で区切った単語を取得する。すなわち質問解析処理では、「燃料を使う電池は何？」と形態素の解析を行い、「燃料」や「電池」などの単語を取得する。 For example, a description will be given of a case in which a document with a search target document is held as “About fuel cell” and a text indicating “What is a battery using fuel?” Is input as a question sentence.
In the question analysis process, morphological analysis of the question sentence is performed, and words separated by this analysis are acquired. That is, in the question analysis process, morphemes are analyzed as “What is the battery that uses fuel?” And words such as “fuel” and “battery” are acquired.

文書検索処理は、取得した各単語を検索語として、該検索語が被検索文に含まれているか否か調べる。すなわち文書検索処理では、被検索文を「燃料電池について」と形態素解析する。これにより被検索文書における「燃料電池」は「燃料」および「電池」に区切られ、この区切った単語と、検索語として取得した単語とが一致することで、質問文に対応する被検索文書が特定される。 The document search process uses each acquired word as a search word, and checks whether the search word is included in the search target sentence. In other words, in the document search process, the searched sentence is morphologically analyzed as “about fuel cell”. As a result, the “fuel cell” in the searched document is divided into “fuel” and “battery”, and the searched word corresponding to the question sentence is obtained by matching the divided word with the word acquired as the search word. Identified.

回答抽出処理では、特定した被検索文書の各単語に対し、質問種別や検索語・出現頻度に基づいて点数を付与し、付与した点数に基づいて、質問文の回答候補の単語を抽出する。
特開２００２−１３２８１１号公報 In the answer extraction process, a score is assigned to each word of the specified document to be searched based on the question type and the search word / appearance frequency, and the answer candidate word of the question sentence is extracted based on the given score.
Japanese Patent Laid-Open No. 2002-132911

ところで、「燃料電池について」と示されている被検索文書において、「燃料電池について」と形態素解析を行うより、「燃料電池について」と形態素解析を行うほうが、被検索文書の文面の内容から適切である。然るに、この被検索文書から得る質問文の回答は、「燃料電池」であるべきである。 By the way, in the searched document shown as “About fuel cell”, the morphological analysis of “About fuel cell” is better than the morphological analysis of “Search about fuel cell”. Is appropriate. However, the answer to the question sentence obtained from the searched document should be “fuel cell”.

このような問題を解決すべく、文書検索処理における被検索文書の形態素解析方法と、回答抽出処理における被検索文書の形態素解析方法を、それぞれの処理に合わせて個別に行うことが考えられるが、クライアント・サーバモデルにおいて、クライアント側でセッション管理を行うには、それぞれの形態素解析手法で得た解析結果をクライアント・サーバ間でデータ通信を行う必要がある。このデータ通信は、文書数やその文書の長さに比例して増大することから、データ通信量の増加に伴って処理効率の低減を招く恐れがある。 In order to solve such a problem, it is conceivable to individually perform the morphological analysis method of the searched document in the document search processing and the morphological analysis method of the searched document in the answer extraction processing according to each processing. In the client-server model, in order to perform session management on the client side, it is necessary to perform data communication between the client and the server on the analysis result obtained by each morphological analysis method. Since this data communication increases in proportion to the number of documents and the length of the document, there is a risk that the processing efficiency may be reduced as the amount of data communication increases.

従って前記した課題に鑑み、本発明の目的は適切な回答を、処理効率を低減することなく得る質問応答装置および質問応答システムを提供することにある。 Therefore, in view of the above-described problems, an object of the present invention is to provide a question answering apparatus and a question answering system that can obtain an appropriate answer without reducing processing efficiency.

本発明は、以上の点を解決するために、次の構成を採用する。
〈構成１〉
本発明の質問応答装置は、被検索文書を受入れて、該被検索文書の形態素を解析する被検索文書処理部と、質問文から質問内容の要となる検索語を取得して、該検索語に適合する文書を取得する検索処理部と、取得した文書の各単語に対して得点を付与し、該得点に基づいて質問文の回答となる単語を抽出する回答処理部とを備えた質問応答装置において、前記被検索文書処理部は、受入れた被検索文書の形態素を解析して得る各単語に、当該被検索文書の文書識別子を関連付けた文書索引情報を生成する第１の文書解析部と、前記第１の文書解析部で得た単語に基づいて、質問文の回答対象となり得る単語を適宜生成し、生成した単語を、生成元の単語と置換し、これらの単語を回答単語とし該回答単語に対し、その属性と、被検索文書における出現位置と、当該被検索文書の文書識別子を関連付けた単語情報を生成する第２の文書解析部とを備え、前記検索処理部は、質問文の形態素を解析して得る各単語に基づいて、質問文の回答対象となり得る単語を適宜生成し、生成した単語を質問文の形態素を解析して取得した単語に追加して、これらの単語を検索語として出力する質問解析部と、文書索引情報を参照し、前記検索語に適合する文書の文書識別子を取得する文書検索部とを備え、回答処理部は、前記単語情報を参照し、前記文書検索部からの文書識別子と前記質問解析部からの検索語とに基づいて、該文書識別子に関連付けられた単語から、質問文の回答となる単語を抽出する回答抽出部とを備えることを特徴とする。 The present invention adopts the following configuration in order to solve the above points.
<Configuration 1>
The question answering device of the present invention receives a search target document, analyzes a search target document processing unit that analyzes a morpheme of the search target document, acquires a search term that is a key point of the question content from the question sentence, Response processing unit including a search processing unit that acquires a document that conforms to the above and a response processing unit that assigns a score to each word of the acquired document and extracts a word that is an answer to the question sentence based on the score In the apparatus, the searched document processing unit includes: a first document analyzing unit that generates document index information in which each word obtained by analyzing a morpheme of an accepted searched document is associated with a document identifier of the searched document; Based on the words obtained by the first document analysis unit, words that can be answered in the question sentence are generated as appropriate, the generated words are replaced with the original words, and these words are used as answer words. For the answer word, the attribute and the searched document And a second document analysis unit that generates word information that associates the document identifier of the searched document with the search processing unit, based on each word obtained by analyzing the morpheme of the question sentence A question analysis unit that appropriately generates words that can be answered in a question sentence, adds the generated words to words obtained by analyzing the morphemes of the question sentence, and outputs these words as search words; and a document index A document search unit that refers to information and obtains a document identifier of a document that matches the search term, the answer processing unit refers to the word information, and the document identifier from the document search unit and the question analysis unit And an answer extracting unit for extracting a word to be an answer to the question sentence from the word associated with the document identifier based on the search word from.

〈構成２〉
構成１における前記質問解析部は、前記質問文に対し、前記第１の文書解析部における形態素解析と同じ形態素解析を行う質問文形態素解析処理部と、該質問文形態素解析処理で取得する各単語を複合し、質問文の回答対象となり得る複合語を生成すると共に、文書の揺れを解消する検索単語後処理部と、該検索単語後処理部で生成した複合語を前記質問文形態素解析処理部で取得する単語に追加する検索単語追加処理部とを備えることを特徴とする。 <Configuration 2>
The question analysis unit in the first configuration includes a question sentence morpheme analysis processing unit that performs the same morphological analysis as the morpheme analysis in the first document analysis unit on the question sentence, and each word acquired in the question sentence morpheme analysis process A search word post-processing unit that generates a compound word that can be an answer target of a question sentence and eliminates fluctuations in the document, and a compound word generated by the search word post-processing part is converted into the question sentence morphological analysis processing unit And a search word addition processing unit for adding to a word acquired in step (1).

〈構成３〉
構成１における前記第２の文書解析部は、前記第１の文書解析部で得た単語に基づいて、質問文の回答対象となり得る複合語を生成すると共に、文書の揺れを解消する単語後処理部と、該単語後処理部で生成した複合語を、複合元の単語と置換する単語置換処理部とを備えることを特徴とする。 <Configuration 3>
The second document analysis unit in the configuration 1 generates a compound word that can be an answer target of the question sentence based on the word obtained by the first document analysis unit, and also performs word post-processing for eliminating the shaking of the document And a word replacement processing unit that replaces the compound word generated by the word post-processing unit with the compound source word.

〈構成４〉
本発明の質問応答システムは、クライアントおよびサーバ間にネットワークが設けられたシステムであり、前記サーバは、被検索文書を受入れて、該被検索文書の形態素を解析する被検索文書処理部と、質問文から質問内容の要となる検索語を得て、該検索語に適合する文書を取得する検索処理部と、取得した文書の各単語に対して得点を付与し、該得点に基づいて質問文の回答となる単語を抽出する回答処理部とを備えており、前記クライアントは、利用者から質問文を取得する入力部と、質問文の回答を出力する出力部とを備えた質問応答システムにおいて、前記被検索文書処理部は、受入れた被検索文書の形態素を解析して得る各単語に、当該被検索文書の文書識別子を関連付けた文書索引情報を生成する第１の文書解析部と、前記第１の文書解析部で得た単語に基づいて、質問文の回答対象となり得る単語を適宜生成し、生成した単語を、生成元の単語と置換し、これらの単語を回答単語とし該回答単語に対し、その属性と、被検索文書における出現位置と、当該被検索文書の文書識別子を関連付けた単語情報を生成する第２の文書解析部とを備え、前記検索処理部は、質問文の形態素を解析して得る各単語に基づいて、質問文の回答対象となり得る単語を適宜生成し、生成した単語を質問文の形態素を解析して取得した単語に追加して、これらの単語を検索語として出力する質問解析部と、文書索引情報を参照し、前記検索語に適合する文書の文書識別子を取得する文書検索部とを備え、回答処理部は、前記単語情報を参照し、前記文書検索部からの文書識別子と前記質問解析部からの検索語とに基づいて、該文書識別子に関連付けられた単語から、質問文の回答となる単語を抽出する回答抽出部とを備え、前記クライアントは、前記質問解析部からの検索語と、前記文書検索部からの文書識別子とを、入力部で取得する質問文に対応付けて管理し、該検索語と文書識別子とを前記質問抽出部へ出力するセッション管理部を備えることを特徴とする。 <Configuration 4>
A question answering system of the present invention is a system in which a network is provided between a client and a server, and the server accepts a searched document and analyzes a searched document processing unit that analyzes a morpheme of the searched document, A search processing unit that obtains a search word that is the key to the contents of a question from a sentence, acquires a document that matches the search word, and assigns a score to each word of the acquired document, and a question sentence based on the score In the question answering system, the client includes an input processing unit that extracts a question sentence from the user, and an output unit that outputs the answer to the question sentence. The searched document processing unit generates a document index information in which each word obtained by analyzing a morpheme of an accepted searched document is associated with a document identifier of the searched document; and First Based on the words obtained by the document analysis unit, the words that can be answered in the question sentence are generated as appropriate, the generated words are replaced with the original words, and these words are used as the answer words. A second document analysis unit that generates word information that associates the attribute, the appearance position in the searched document, and the document identifier of the searched document, and the search processing unit analyzes a morpheme of the question sentence Based on each word obtained, generate words that can be answered in the question sentence as appropriate, add the generated words to the words obtained by analyzing the morphemes of the question sentence, and output these words as search words And a document search unit that obtains a document identifier of a document that matches the search term by referring to the document index information, and the answer processing unit refers to the word information from the document search unit. Document identifier and quality An answer extraction unit that extracts a word as an answer to a question sentence from words associated with the document identifier based on a search word from the analysis unit, and the client includes the search word from the question analysis unit And a session management unit that manages the document identifier from the document search unit in association with a question sentence acquired by the input unit, and outputs the search term and the document identifier to the question extraction unit. And

〈構成５〉
構成４に記載の前記セッション管理部は、取得する文書識別子に関連付けられた被検索文書を表示して、利用者に所望の被検索文書を選定させ、その選定された被検索文書に関連付けられている文書識別子を前記回答抽出部へ出力することを特徴とする。 <Configuration 5>
The session management unit according to Configuration 4 displays a search target document associated with a document identifier to be acquired, causes a user to select a desired search target document, and associates the search target document with the selected search target document. The document identifier is output to the answer extraction unit.

〈構成６〉
構成４に記載の前記質問解析部は、前記質問文に対し、前記第１の文書解析部における形態素解析と同じ形態素解析を行う質問文形態素解析処理部と、該質問文形態素解析処理で取得する各単語を複合し、質問文の回答対象となり得る複合語を生成すると共に、文書の揺れを解消する検索単語後処理部と、該検索単語後処理部で生成した複合語を前記質問文形態素解析処理部で取得する単語に追加する検索単語追加処理部とを備えることを特徴とする。 <Configuration 6>
The question analysis unit according to Configuration 4 acquires the question sentence by the question sentence morpheme analysis processing unit that performs the same morphological analysis as the morpheme analysis in the first document analysis unit, and the question sentence morpheme analysis process. Each word is compounded to generate a compound word that can be an answer target of the question sentence, and a search word post-processing unit that eliminates the fluctuation of the document, and the compound word generated by the search word post-processing part is analyzed by the question sentence morpheme analysis And a search word addition processing unit to be added to the words acquired by the processing unit.

〈構成７〉
構成４に記載の前記第２の文書解析部は、前記第１の文書解析部で得た単語に基づいて、質問文の回答対象となり得る複合語を生成すると共に、文書の揺れを解消する単語後処理部と、該単語後処理部で生成した複合語を、複合元の単語と置換する単語置換処理部とを備えることを特徴とする。 <Configuration 7>
The second document analysis unit according to Configuration 4 generates a compound word that can be an answer target of a question sentence based on the word obtained by the first document analysis unit, and a word that eliminates shaking of the document A post-processing unit and a word replacement processing unit that replaces the compound word generated by the word post-processing unit with a compound source word are provided.

本発明の質問応答装置によれば、質問文の要となる検索語を取得するため行う形態素解析と、被検索文書の形態素解析とを同じ方法で行い、被検索文書の形態素解析で得た単語に基づいて、質問文の回答となり得る単語を生成し、生成した単語を生成元の単語と置換して得た回答単語を参照し、前記検索単語と合致する回答単語を質問文の回答として抽出することにより、質問文の形態素解析と同じ方法で被検索文書の形態素解析を行い、その解析結果を用いて回答単語を抽出するための形態素解析を行うことから、質問文の要となる検索語に対応する被検索文書を取得するための形態素解析を行った後に、該被検索文書から回答単語を抽出するための他の方法で形態素解析を行う必要がないことから、効率的に適切な回答を得ることができる。 According to the question answering device of the present invention, the morpheme analysis performed for acquiring the search word that is the key of the question sentence and the morpheme analysis of the searched document are performed in the same method, and the word obtained by the morphological analysis of the searched document Based on the above, a word that can be an answer to a question sentence is generated, an answer word obtained by replacing the generated word with the original word is referred to, and an answer word that matches the search word is extracted as an answer to the question sentence Therefore, the morphological analysis of the searched document is performed in the same way as the morphological analysis of the question sentence, and the morphological analysis for extracting the answer word using the analysis result is performed. Since there is no need to perform morpheme analysis by another method for extracting the response word from the searched document after performing the morphological analysis to acquire the searched document corresponding to Can get .

更に、本発明の質問応答システムによれば、被検索文書の解析を行い、被検索文書検索のための文書索引情報と、質問文に対する回答のための回答単語に対し文書識別子と属性と出現位置とを関連付けた単語情報とを作成する被検索文書処理部と、質問文の要となる検索語を取得し、該検索語に適合する被検索文書の文書識別子を単語情報を参照して取得する検索処理部と、文書識別子と検索語と質問種別とに基づいて、単語情報を参照して回答単語を抽出する回答処理部をサーバに備え、サーバ側の検索処理部からの文書識別子および検索語を質問文に対応付けて管理し、該検索語と文書識別子とを前記質問抽出部へ出力するセッション管理部をクライアント側に設けることにより、被検索文書の文書識別子の取得と、回答単語の抽出のための形態素解析手法をそれぞれ行って、検索処理部による結果のうち各被検索文書中での各検索語の位置をクライアント・サーバ間でデータ通信を行う必要がないことから、すなわち文書識別子および検索語のみを通信するだけでよいことから、データ通信量の増加を招くことなく、効率的に適切な回答を得ることができる。 Further, according to the question answering system of the present invention, the searched document is analyzed, the document index information for searching the searched document, and the document identifier, attribute, and appearance position for the answer word for answering the question sentence To-be-searched document processing unit for creating the word information that associates the search term, and the search word that is the key to the question sentence is acquired, and the document identifier of the searched document that matches the search word is acquired by referring to the word information The server includes an answer processing unit that extracts an answer word by referring to word information based on the document identifier, the search word, and the question type, and the document identifier and the search word from the server-side search processing unit. Is associated with a question sentence, and a session management unit that outputs the search word and document identifier to the question extraction unit is provided on the client side, thereby obtaining the document identifier of the searched document and extracting the answer word No The morpheme analysis method of each is performed, and the position of each search word in each searched document among the results of the search processing unit does not need to be communicated between the client and the server, that is, the document identifier and the search word Therefore, it is possible to obtain an appropriate answer efficiently without increasing the amount of data communication.

以下、本発明の実施形態をクライアント・サーバ構成の質問応答システムの例を、図を用いて詳細に説明する。 Hereinafter, an example of a question answering system having a client / server configuration will be described in detail with reference to the drawings.

本発明の質問応答システム１０は、図１に示すように、サーバ群２０と、クライアント３０とを備えており、該サーバ群２０およびクライアント３０が図示しない伝送路としてのネットワークに接続されており、該ネットワークを介してサーバ群２０およびクライアント３０間でデータ通信が行なわれる。 As shown in FIG. 1, the question answering system 10 of the present invention includes a server group 20 and a client 30, and the server group 20 and the client 30 are connected to a network as a transmission path (not shown). Data communication is performed between the server group 20 and the client 30 via the network.

サーバ群２０は、被検索文書処理部４０と、検索処理部５０と、回答処理部６０とを備えている。
被検索文書処理部４０は、被検索文書１００を受入れる文書投入部２００と、受入れた文書の解析を行う第１の文書解析部３００と、該第１の文書解析部３００の解析結果に基づいて、単語編集を行うための解析を行う第２の解析部４００とを備える。
検索処理部５０は、第１の文書解析処理部３００での形態素解析で取得する単語と、該単語の出現頻度と、当該被検索文書を検索するためのインデックス（文書識別子）とを関連付けて文書索引情報として保持する文書索引情報蓄積部５００と、クライアント３０からの質問文を解析して質問内容の要となる単語（単語群）を検索語として取得する質問解析部７００と、文書索引情報蓄積部５００を参照し、検索語に対応する単語に関連付けられている文書識別子を取得する文書検索部８００とを備えている。 The server group 20 includes a searched document processing unit 40, a search processing unit 50, and an answer processing unit 60.
The searched document processing unit 40 is based on a document input unit 200 that receives the searched document 100, a first document analysis unit 300 that analyzes the received document, and an analysis result of the first document analysis unit 300. And a second analysis unit 400 that performs analysis for word editing.
The search processing unit 50 associates a word acquired by morpheme analysis in the first document analysis processing unit 300, an appearance frequency of the word, and an index (document identifier) for searching the search target document. A document index information accumulating unit 500 that holds as index information, a question analyzing unit 700 that analyzes a question sentence from the client 30 and obtains a word (a word group) that is a key of the question content as a search term, and a document index information accumulating unit A document search unit 800 that refers to the unit 500 and acquires a document identifier associated with a word corresponding to the search term is provided.

回答処理部６０は、第２の文書解析部４００での解析結果に基づいて、被検索文書の文書識別子と、後述する回答単語と、その属性と、その出現位置とが関連付けられた単語情報を保持する単語情報蓄積部６００と、該単語情報蓄積部６００の単語情報を参照してクライアントからの文書識別子および検索語に基づいて、質問文の回答として回答単語を抽出する回答抽出部９００とを備える。
クライアント３０は、質問文１１００を受入れる入力部１２００と、該入力部１２００で受入れた質問文と質問解析部７００で得る検索語と文書識別子とに基づいて、サーバ群２０との遣り取りの管理を行うセッション管理部１３００と、回答抽出部９００からの回答を利用者に提示するための出力部１４００とを備える。 Based on the analysis result in the second document analysis unit 400, the response processing unit 60 obtains word information in which the document identifier of the search target document, an answer word to be described later, its attribute, and its appearance position are associated. A word information storage unit 600 that holds the information, and an answer extraction unit 900 that refers to the word information stored in the word information storage unit 600 and extracts an answer word as an answer to a question sentence based on a document identifier and a search word from a client. Prepare.
The client 30 manages communication with the server group 20 based on the input unit 1200 that receives the question text 1100, the question text received by the input unit 1200, the search word obtained by the question analysis unit 700, and the document identifier. A session management unit 1300 and an output unit 1400 for presenting an answer from the answer extraction unit 900 to the user are provided.

次に、各部を詳細に説明する。
文書投入部２００が受入れる被検索文書１００は、例えば新聞、論文、特許文献、Ｗｅｂ（World Wide Web）ページに示された文書（文書群）などである。この被検索文書１００に質問文１１００の回答としての情報（単語）が含まれているとき、質問応答システム１０は、的確な回答を得ることができる。 Next, each part will be described in detail.
The search target document 100 received by the document input unit 200 is, for example, a newspaper, a paper, a patent document, a document (document group) shown on a Web (World Wide Web) page, or the like. When the searched document 100 includes information (words) as an answer to the question sentence 1100, the question answering system 10 can obtain an accurate answer.

文書投入部２００は、被検索文書１００がディスクで保持管理されているファイルなどでは、被検索文書として指定したファイルの読み込み処理を行う。また、被検索文書１００がＷｅｂページなどに示されていれば、文書投入部２００は指定されたＵＲＬ（Uniform Resource Locator）にＨＴＴＰ（Hyper Text Transfer Protocol）などの手段でアクセスして被検索文書１００を取得する。また、被検索文書１００が紙などの媒体に印刷されているときには、文書投入部２００はＯＣＲ（Optical Character Reader）と同様な機能で被検索文書１００の印刷内容を取得する。 The document input unit 200 reads a file designated as a search target document, such as a file in which the search target document 100 is held and managed on a disc. If the searched document 100 is indicated on a Web page or the like, the document input unit 200 accesses a specified URL (Uniform Resource Locator) by means such as HTTP (Hyper Text Transfer Protocol) to search the searched document 100. To get. In addition, when the searched document 100 is printed on a medium such as paper, the document input unit 200 acquires the print contents of the searched document 100 with the same function as an OCR (Optical Character Reader).

第１の文書解析部３００は、形態素解析処理部３１０を備えており、該形態素解析処理部３１０は、文書投入部２００で取得する被検索文書１００の形態素を解析し、該解析で得る単語もしくは特定の品詞の単語を取得する。取得した各単語は後述する文書検索部８００において検索対象となり得る単語であり、該単語に複数の文書（被検索文書１００）を識別するための文書識別子と該文書における出現頻度とが形態素解析処理部３１０で対応付けられる。この対応付けられた情報は、文書索引情報として文書索引情報蓄積部５００で保持される。 The first document analysis unit 300 includes a morpheme analysis processing unit 310. The morpheme analysis processing unit 310 analyzes the morpheme of the searched document 100 acquired by the document input unit 200, and obtains words or words obtained by the analysis. Get a word with a specific part of speech. Each acquired word is a word that can be searched in the document search unit 800 described later, and a document identifier for identifying a plurality of documents (searched document 100) and an appearance frequency in the document are morphological analysis processing. Corresponding in section 310. This associated information is held in the document index information storage unit 500 as document index information.

第２の解析部４００は、第１の文書解析部３００で形態素解析した文章を受け取ると、質問応答システム１０の答えとして適切な回答となり得る単語を必要に応じて生成する。第２の解析部４００は、生成した単語や第１の文書解析部３００で形態素を解析して得た単語などの回答単語に、文書識別子、属性、文書中における出現位置などの関連付けを行う。関連付けられた情報は、単語情報として単語情報蓄積部６００で保持される。 When the second analysis unit 400 receives the text analyzed by the first document analysis unit 300, the second analysis unit 400 generates a word that can be an appropriate answer as an answer of the question answering system 10 as necessary. The second analysis unit 400 associates a document identifier, an attribute, an appearance position in a document, and the like with a response word such as a generated word or a word obtained by analyzing a morpheme with the first document analysis unit 300. The associated information is held in the word information storage unit 600 as word information.

ところで、第２の解析部４００は、単語後処理部４１０と単語置換処理部４２０とを備えており、単語後処理部４１０は、形態素解析処理部３１０で形態素単位で区切った文書を受けると、係り受けや文字種の区切れなどから、複数の単語を一つに纏めることが適当である場合、一つに纏めた単語を生成する。例えば形態素解析で「○×電気」「工業」「株式会社」に各単語が区切られた場合、「株式会社」は、社名の接頭語もしくは接尾語になると考えられるので、「株式会社」の前または後ろで社名となり得る範囲に含まれる単語を１語に纏め、「○×電気工業株式会社」「株式会社○×電気工業」などの単語を生成する。
また、単語後処理部４１０は、例えば「コンピュータ」および「コンピューター」などの表記のゆれと称される違いを解消すべく、「コンピュータ」に統一する正規化処理を行う。 By the way, the second analysis unit 400 includes a word post-processing unit 410 and a word replacement processing unit 420. When the word post-processing unit 410 receives the document divided by the morpheme analysis processing unit 310, When it is appropriate to combine a plurality of words into one from the dependency or character type separation, the combined words are generated. For example, if each word is divided into “○ × Electricity”, “Industry”, and “Inc.” in the morphological analysis, “Inc.” is considered to be the prefix or suffix of the company name. Alternatively, words included in a range that can be a company name in the back are combined into one word, and words such as “◯ × Electric Industry Co., Ltd.” and “O Corporation” are generated.
In addition, the word post-processing unit 410 performs a normalization process to unify “computer” so as to eliminate a difference called “computer” and “computer”.

単語置換処理部４２０は、単語後処理部４１０で作成された単語を、作成元になった単語と置き換える。すなわち、単語置換処理部４２０は、先程の例において「○×電気」「工業」「株式会社」の各単語を、生成した「○×電気工業株式会社」もしくは「株式会社○×電気工業」に、重複しないように置き換える処理を行いう。単語置換処理部４２０は、置換した単語と、置換処理の必要がない単語とを回答単語として、該回答単語に対し、その属性と、出現位置と、文書識別子とを関連付け、これを単語情報として生成する。生成した単語情報は、単語情報蓄積部６００で保持される。 The word replacement processing unit 420 replaces the word created by the word post-processing unit 410 with the word that is the creation source. That is, the word replacement processing unit 420 converts the words “Ox Electric”, “Industry”, and “Inc.” into the generated “Ox Electric Industry Co., Ltd.” or “Ox Electric Co., Ltd.” in the previous example. Execute the replacement process so as not to overlap. The word replacement processing unit 420 associates the replaced word and the word that does not need to be replaced as an answer word, associates the answer word with an attribute, an appearance position, and a document identifier, and uses this as word information. Generate. The generated word information is held in the word information storage unit 600.

質問解析部７００は、入力部１２００で受付けた利用者からの質問文１１００より、質問内容の要となるべき単語を検索語として抽出する処理を行う。この検索語は、文書索引情報蓄積部５００の文書索引情報や単語情報蓄積部６００の単語情報などの何れにも含まれるような単語（単語群）である。ところで、質問解析部７００は、サーバ群２０側もしくはクライアント３０側に実装される。本実施例においては、質問解析部７００をサーバ群２０側に実装する例で説明する。 The question analysis unit 700 performs a process of extracting, as a search term, a word that should be the key of the question content from the question sentence 1100 received from the user by the input unit 1200. This search term is a word (word group) that is included in both the document index information in the document index information storage unit 500 and the word information in the word information storage unit 600. Incidentally, the question analysis unit 700 is mounted on the server group 20 side or the client 30 side. In this embodiment, an example in which the question analysis unit 700 is mounted on the server group 20 side will be described.

前記した質問解析部７００は、質問文形態素解析処理部７１０と、検索単語後処理部７２０と、検索単語追加処理部７３０とを備える。質問文形態素解析処理部７１０は、入力部１２００で取得する質問文１１００の形態素を解析し、該解析で得る単語もしくは特定の品詞の単語を取得する。すなわち、質問文形態素解析処理部７１０は、前記した形態素解析処理部３１０と同様の機能である。 The above-described question analysis unit 700 includes a question sentence morpheme analysis processing unit 710, a search word post-processing unit 720, and a search word addition processing unit 730. The question sentence morpheme analysis processing unit 710 analyzes the morpheme of the question sentence 1100 acquired by the input unit 1200 and acquires a word obtained by the analysis or a word of a specific part of speech. That is, the question sentence morpheme analysis processing unit 710 has the same function as the morpheme analysis processing unit 310 described above.

検索単語後処理部７２０は、質問文形態素解析処理部７１０で形態素単位で区切った質問文を受けると、係り受けや文字種の区切れなどから、複数の単語を一つに纏めることが適当である場合、一つに纏めた単語を生成する。すなわち、検索単語後処理部７２０は、前記した単語後処理部４１０と同様の機能である。 When the query word post-processing unit 720 receives the question sentence divided in units of morphemes by the question sentence morpheme analysis processing unit 710, it is appropriate to combine a plurality of words into one due to dependency and character type separation. In this case, a single word is generated. That is, the search word post-processing unit 720 has the same function as the word post-processing unit 410 described above.

検索単語追加処理部７３０は、検索単語後処理部７２０で生成された単語を、作成元になった単語に追加する処理を行う。例えば、検索単語後処理部７２０で「燃料」および「電池」に基づいて「燃料電池」が生成されると、検索単語追加処理部７３０は、「燃料」「電池」に「燃料電池」を追加する。検索単語追加処理部７３０は、「燃料」「電池」「燃料電池」の単語群を検索語として、文書検索部８００およびセッション管理部１３００へ出力する。 The search word addition processing unit 730 performs processing for adding the word generated by the search word post-processing unit 720 to the word that is the creation source. For example, when a “fuel cell” is generated based on “fuel” and “battery” in the search word post-processing unit 720, the search word addition processing unit 730 adds “fuel cell” to “fuel” and “battery”. To do. The search word addition processing unit 730 outputs the word group of “fuel”, “battery”, and “fuel cell” to the document search unit 800 and the session management unit 1300 as search words.

文書検索部８００は、検索単語追加処理部７３０からの検索語を取得すると、文書索引情報蓄積部５００の文書索引情報を参照して、検索語に対応する単語を検索する。この検索は、従来から知られた文書全般における出現頻度や、文書に含まれる各文章での出現頻度、疑問文中での出現頻度などに基づく一般的な手法で行うことができる。文書検索部８００は、検索して得た単語に関連付けられている文書識別子を取得すると、該文書識別子をセッション管理部１３００へ出力する。 When the document search unit 800 acquires the search word from the search word addition processing unit 730, the document search unit 800 refers to the document index information in the document index information storage unit 500 and searches for a word corresponding to the search word. This search can be performed by a general technique based on the appearance frequency in general known documents, the appearance frequency in each sentence included in the document, the appearance frequency in the question sentence, and the like. When the document search unit 800 acquires a document identifier associated with the word obtained by the search, the document search unit 800 outputs the document identifier to the session management unit 1300.

質問解析部７００の質問文形態素解析処理部７１０は、形態素解析処理部３１０と同様の機能であることから、質問解析部７００から取得する単語（検索語）には、形態素解析処理部３１０で同様に形態素を解析して得る単語が含まれている可能性がある。すなわち、被検索文書１００を解析する形態素解析処理部３１０と、質問文１１００を解析する質問文形態素解析処理部７１０とは、形態素の解析機能が同じであることから、被検索文書１００および質問文１１００に同じ単語が含まれている場合、質問解析部７００は形態素解析処理部３１０で形態素解析して得る単語が含まれた検索語を出力する。 Since the question sentence morpheme analysis processing unit 710 of the question analysis unit 700 has the same function as the morpheme analysis processing unit 310, the word (search word) acquired from the question analysis unit 700 is the same as the morpheme analysis processing unit 310. May contain words obtained by analyzing morphemes. That is, the morpheme analysis processing unit 310 that analyzes the searched document 100 and the question sentence morpheme analysis processing unit 710 that analyzes the question sentence 1100 have the same morpheme analysis function. When the same word is included in 1100, the question analysis unit 700 outputs a search word including a word obtained by morphological analysis by the morpheme analysis processing unit 310.

また、検索単語後処理部７２０で生成した単語が検索単語追加処理部７３０で追加されることから、質問解析部７００が出力する検索語には、形態素解析処理部３１０が出力しない単語、すなわち文書索引情報蓄積部５００に含まれない単語を含んでいる。従って、このような単語は、文書索引情報蓄積部５００の文書索引情報において、出現頻度が０となり、この出現頻度を利用してこの単語が文書検索に影響を与えない文書の検索を行うことができる。これにより、文書検索部８００では、検索単語追加処理部７３０で追加された単語の影響を受けることなく文書検索することができ、後述する回答抽出部９００では検索単語追加処理部７３０で追加された単語を加味した検索を行うことができる。 Further, since the word generated by the search word post-processing unit 720 is added by the search word addition processing unit 730, the search word output by the question analysis unit 700 is a word that is not output by the morphological analysis processing unit 310, that is, a document A word that is not included in the index information storage unit 500 is included. Therefore, such a word has an appearance frequency of 0 in the document index information of the document index information storage unit 500, and it is possible to use this appearance frequency to search for a document that does not affect the document search. it can. As a result, the document search unit 800 can perform a document search without being affected by the words added by the search word addition processing unit 730, and the answer extraction unit 900 described later added by the search word addition processing unit 730. It is possible to perform a search that includes words.

セッション管理部１３００は、文書検索部８００において検索して得た単語に関連付けられている文書識別子と、質問解析部７００からの検索語と入力部１２００で取得する質問文１１００とを関連付けた管理を行う。例えば文書検索部８００で検索語に基づいて得た文書識別子がＷｅｂのＵＲＬである場合、セッション管理部１３００はそのＵＲＬに対しアクセスを行ない、その内容を図示しない表示入力部に表示し、利用者に所望の識別子を選択させるべく、利用者からの入力を受付けるために、質問文１１００と検索語と文書識別子との管理を行う。 The session management unit 1300 performs management in which the document identifier associated with the word obtained by the search by the document search unit 800, the search word from the question analysis unit 700, and the question sentence 1100 acquired by the input unit 1200 are associated with each other. Do. For example, when the document identifier obtained based on the search word in the document search unit 800 is a Web URL, the session management unit 1300 accesses the URL and displays the content on a display input unit (not shown). In order to allow the user to select a desired identifier, management of the question sentence 1100, the search word, and the document identifier is performed in order to receive input from the user.

このセッション管理部１３００はクライアント３０に備えられており、複数の質問文１１００による複数の文書識別子や複数の検索語などの複雑な論理演算をクライアントで行なわれることから、サーバ群２０で利用者毎の管理を行う必要がない。 The session management unit 1300 is provided in the client 30, and a complicated logical operation such as a plurality of document identifiers and a plurality of search terms by a plurality of question sentences 1100 is performed by the client. There is no need to manage.

回答抽出部９００は、セッション管理部１３００からの回答を抽出対象の文書を示す識別子と、質問種別と、検索語とを取得すると、単語情報蓄積部６００を参照して、質問種別に応じた属性の単語を単語情報から抽出する。質問種別は例えば人名、地名、技術名などであり、従来から知られた質問文１１００を解析して得ることができる。この方法以外に、質問者が入力端末を用いて指定してもよいし、また所定の質問種別のみ（例えば人名のみ）を回答するシステムであれば、必然にその対象のみを回答してもよい。 When the answer extraction unit 900 obtains an identifier indicating a document from which an answer from the session management unit 1300 is extracted, a question type, and a search word, the answer extraction unit 900 refers to the word information storage unit 600 and attributes according to the question type. Are extracted from word information. The question type is, for example, a person name, a place name, or a technical name, and can be obtained by analyzing a conventionally known question sentence 1100. In addition to this method, the questioner may specify using the input terminal, or if the system answers only a predetermined question type (for example, only the name of the person), only the target may necessarily be answered. .

また、回答抽出部９００は、セッション管理部１３００からの検索語と、回答候補の単語との文書中における位置関係に基づいて回答の重要度を求め、質問文１１００との関連性の高い回答を抽出する。ところで、検索語には、単語情報蓄積部６００に含まれない単語を含んでいる可能性があり、すなわち単語置換処理部４２０での置換により、置換されてしまう単語は、単語情報蓄積部６００に蓄積されない。しかし、このような検索語を用いて質問文１１００と関連性の高い回答を得る際、回答候補の単語との位置関係の評価に用いないことから、回答抽出部９００での回答抽出に影響を及ぼすことはない。 In addition, the answer extraction unit 900 obtains the importance of the answer based on the positional relationship in the document between the search term from the session management unit 1300 and the answer candidate word, and gives an answer highly relevant to the question sentence 1100. Extract. By the way, the search word may include a word that is not included in the word information storage unit 600, that is, a word that is replaced by the replacement in the word replacement processing unit 420 is stored in the word information storage unit 600. Not accumulated. However, when an answer highly relevant to the question sentence 1100 is obtained using such a search word, it is not used for evaluation of the positional relationship with the answer candidate word. There is no effect.

出力部１４００は、回答抽出部９００で求めた回答を受けると、該回答にセッション管理部１３００で管理する質問文を対応付けて利用者に表示するなどの利便性の考慮が行われる。 When the output unit 1400 receives the answer obtained by the answer extraction unit 900, the output unit 1400 considers convenience such as associating the answer with a question sentence managed by the session management unit 1300 and displaying it to the user.

次に質問応答システム１０の動作を、被検索文書１００および質問文１１００の具体的な例を用いて詳細に説明する。
被検索文書１００として、図２に示すように、Ｄ１およびＤ２の２つの文書例で説明を行う。Ｄ１には″○×太郎：「燃料電池の開発について」水素と酸素による燃料電池は、・・・・″と示されており、Ｄ２には″水素と酸素を燃料として発電するる電池を○×太郎（●大教授）が開発した。″と示されている。 Next, the operation of the question answering system 10 will be described in detail using specific examples of the searched document 100 and the question sentence 1100.
The search target document 100 will be described using two document examples D1 and D2, as shown in FIG. D1 says “Oxaro:“ Development of fuel cells ”Hydrogen and oxygen fuel cells are indicated as“ ... ”, and D2 is a battery that generates electricity using hydrogen and oxygen as fuel. × Developed by Taro (● Professor) ″ Is shown.

先ず、被検索文書１００としてのＤ１およびＤ２が、文書索引情報蓄積部５００および単語情報蓄積部６００で保持されるまでの質問応答システム１０の動作を説明する。
文書投入部２００で取得した文書Ｄ１および文書Ｄ２が第１の文書解析部３００の形態素解析処理部３１０へ送られると、該形態素解析処理部３１０は、文書Ｄ１および文書Ｄ２を形態素単位に区切る。文書Ｄ１を例に説明すると、文書Ｄ１は″○× 太郎：「燃料電池の開発について」水素と酸素による燃料電池は、・・・・″に区切られる。 First, the operation of the question answering system 10 until D1 and D2 as the search target document 100 are held in the document index information storage unit 500 and the word information storage unit 600 will be described.
When the document D1 and the document D2 acquired by the document input unit 200 are sent to the morpheme analysis processing unit 310 of the first document analysis unit 300, the morpheme analysis processing unit 310 divides the document D1 and the document D2 into morpheme units. Taking document D1 as an example, document D1 is divided into “Taro XX:“ Development of fuel cell ”. Fuel cell with hydrogen and oxygen is divided into“ ... ”.

このとき、形態素解析処理部３１０は、図３に示すように、区切った単語の品詞を求めて、その自立語を取得する。すなわち、形態素解析処理部３１０は、「○×」「太郎」「燃料」「電池」「開発」「水素」「酸素」を取得する。本実施例では、説明の便宜上、自立語のみを取得した例で説明するが、これに限る必要はない。 At this time, as shown in FIG. 3, the morphological analysis processing unit 310 obtains the part of speech of the divided word and acquires the independent word. That is, the morphological analysis processing unit 310 acquires “◯ ×”, “Taro”, “fuel”, “battery”, “development”, “hydrogen”, and “oxygen”. In this embodiment, for the sake of convenience of explanation, an example in which only independent words are acquired will be described. However, the present invention is not limited to this.

その後、形態素解析処理部３１０は、取得した各単語の文書中における出現回数を出現頻度として求める。すなわち、「燃料」および「電池」の出現回頻度は２回であり、それ以外の単語の出現頻度は１回である。形態素解析処理部３１０は、出現頻度とその単語とを関連付けると共に、文書識別子「Ｄ１」も関連付けた後、それを文書索引情報として文書索引情報蓄積部５００に保持させる。形態素解析処理部３１０は、文書Ｄ２についてもで前記した処理を行うと、図４に示す文書索引情報を生成し、該文書索引情報を文書索引情報蓄積部５００で保持させるべく、該文書索引情報を文書索引情報蓄積部５００へ出力する。 After that, the morphological analysis processing unit 310 obtains the appearance frequency of each acquired word in the document as the appearance frequency. That is, the appearance frequency of “fuel” and “battery” is twice, and the appearance frequency of other words is one. The morphological analysis processing unit 310 associates the appearance frequency with the word and also associates the document identifier “D1”, and then causes the document index information storage unit 500 to store the document identifier “D1” as the document index information. When the morphological analysis processing unit 310 performs the above-described processing for the document D2, the document index information is generated so that the document index information shown in FIG. Is output to the document index information storage unit 500.

また、形態素解析処理部３１０は、形態素解析を行った結果を第２の解析部４００の単語後処理部４１０へ出力する。単語後処理部４１０は、形態素解析処理部３１０で取得した単語を取得すると、取得した単語を組合わせてて、回答として適切な単語を生成する。例えば、文書Ｄ１の解析で「○×」「太郎」「燃料」「電池」「開発」「水素」「酸素」を取得した単語後処理部４１０は、例えば、「姓＋名」の関係に基づいて「○×」および「太郎」から「○×太郎」なる単語を生成し、「燃料」および「電池」から「燃料電池」なる単語を生成する。すなわち、単語後処理部４１０は、「名詞」＋「名詞」で「複合語の名詞」としての単語を作成する。 Further, the morpheme analysis processing unit 310 outputs the result of the morpheme analysis to the word post-processing unit 410 of the second analysis unit 400. When the word post-processing unit 410 acquires the word acquired by the morphological analysis processing unit 310, the word post-processing unit 410 combines the acquired words and generates an appropriate word as an answer. For example, the word post-processing unit 410 that acquired “Ox”, “Taro”, “Fuel”, “Battery”, “Development”, “Hydrogen”, and “Oxygen” in the analysis of the document D1 is based on, for example, the relationship of “last name + first name”. Then, the word “Ox Taro” is generated from “Ox” and “Taro”, and the word “Fuel cell” is generated from “Fuel” and “Battery”. That is, the word post-processing unit 410 creates a word as “a compound noun” by “noun” + “noun”.

また、単語後処理部４１０は、文書Ｄ１に示す「（中略）酸素による燃料電池は・・・」の助詞の係り受け、すなわち「〜による〜は」に基づいて、複数の単語を複合する範囲を決定してもよい。その他に、単語後処理部４１０は、名詞＋「電池」などの場合では電池の種類を示す知識を予め保持しておき、それを利用して複合語を作成してもよい。 Further, the word post-processing unit 410 is based on the particle of “(Omitted) Oxygen fuel cell is ...” shown in the document D1, that is, a range where a plurality of words are combined. May be determined. In addition, in the case of noun + “battery”, the word post-processing unit 410 may hold knowledge indicating the type of battery in advance and use it to create a compound word.

単語置換処理部４２０では、単語後処理部４１０で生成した「○×太郎」や「燃料電池」などの単語と、それ以外の「○×」「太郎」「燃料」「電池」「開発」「水素」「酸素」元の単語とを取得すると、元の単語を複合語で置換する。すなわち、元の単語の「○×」および「太郎」は、「○×太郎」に置き換えられ、元の単語の「燃料」および「電池」は「燃料電池」に置き換えられる。これにより、単語置換処理部４２０は、元の単語を複合語で置換して、「○×太郎」「燃料電池」「開発」「水素」「酸素」の各単語を回答単語として取得する。
単語置換処理部４２０は、文書Ｄ２についてもで前記した処理を行うと、図５に示すように、回答単語に属性と、文書における出現位置と、その文書識別子とを関連付けた単語情報を単語情報蓄積部６００に保持させるべく、該単語情報を単語情報蓄積部６００へ出力する。 In the word substitution processing unit 420, words such as “XX Taro” and “fuel cell” generated by the word post-processing unit 410 and other words “XX”, “Taro”, “fuel”, “battery”, “development”, “ When the original words “hydrogen” and “oxygen” are acquired, the original words are replaced with compound words. That is, the original words “Ox” and “Taro” are replaced with “Ox Taro”, and the original words “fuel” and “battery” are replaced with “fuel cell”. As a result, the word replacement processing unit 420 replaces the original word with a compound word, and acquires each word of “Taro Taro”, “fuel cell”, “development”, “hydrogen”, and “oxygen” as an answer word.
When the word replacement processing unit 420 performs the above-described processing for the document D2, as shown in FIG. 5, word information that associates the attribute with the answer word, the appearance position in the document, and the document identifier is converted into word information. The word information is output to the word information storage unit 600 to be stored in the storage unit 600.

次に、クライアント３０の入力部１２００で利用者が人名を問う「燃料電池の開発者」を質問文として入力し、その質問文の回答を得るまでの質問応答システム１０の動作を説明する。
クライアント３０の入力部１２００で入力された質問文は、ネットワークを介してクライアント３０の質問解析部７００とセッション管理部１３００へ送られる。このとき、セッション管理部１３００には、入力部１２００で取得する「人名を問う」質問種別も送られる。 Next, the operation of the question answering system 10 will be described until a “fuel cell developer” whose user asks a person's name is input as a question text at the input unit 1200 of the client 30 and an answer to the question text is obtained.
The question text input by the input unit 1200 of the client 30 is sent to the question analysis unit 700 and the session management unit 1300 of the client 30 via the network. At this time, the “question of person name” question type acquired by the input unit 1200 is also sent to the session management unit 1300.

質問解析部７００では、先ず質問文形態素解析処理部７１０で質問文の形態素を解析する。すなわち質問文形態素解析処理部７１０は、図６に示すように、質問文を「燃料電池の開発者」に区切り、自立語を取得する。この形態素解析は、前記した形態素解析処理部３１０における形態素の解析方法と同様である。 In the question analysis unit 700, first, the question sentence morpheme analysis processing unit 710 analyzes the morpheme of the question sentence. That is, the question sentence morphological analysis processing unit 710 divides the question sentence into “fuel cell developers” and obtains independent words as shown in FIG. This morpheme analysis is the same as the morpheme analysis method in the morpheme analysis processing unit 310 described above.

質問文形態素解析処理部７１０で得た単語を取得した検索単語後処理部７２０は、適宜取得した単語に基づいて複合語の生成を行う。この複合語の生成は、前記した単語後処理部４１０での処理と同様の方法で行なわれる。すなわち、「燃料」および「電池」から「燃料電池」が生成され、「開発」および「者」から「開発者」が生成される。そして、検索単語後処理部７２０は、表記のゆれを解消する正規化処理を行う。 The search word post-processing unit 720 that acquired the word obtained by the question sentence morphological analysis processing unit 710 generates a compound word based on the appropriately acquired word. The generation of the compound word is performed by the same method as the processing in the word post-processing unit 410 described above. That is, a “fuel cell” is generated from “fuel” and “battery”, and a “developer” is generated from “development” and “person”. And the search word post-processing part 720 performs the normalization process which eliminates the fluctuation | variation of a description.

その後、検索単語追加処理部７３０は、検索単語後処理部７２０で生成された単語を、質問文形態素解析処理部７１０で取得した単語群に追加する。すなわち、検索単語追加処理部７３０は、「燃料」「電池」「開発」「者」に「燃料電池」および「開発者」を追加する。検索単語追加処理部７３０で「燃料電池」および「開発者」が追加された単語群は、検索語として文書検索部８００およびセッション管理部１３００へ出力される。 Thereafter, the search word addition processing unit 730 adds the word generated by the search word post-processing unit 720 to the word group acquired by the question sentence morphological analysis processing unit 710. That is, the search word addition processing unit 730 adds “fuel cell” and “developer” to “fuel”, “battery”, “development”, and “person”. The word group to which “fuel cell” and “developer” are added by the search word addition processing unit 730 is output to the document search unit 800 and the session management unit 1300 as a search word.

質問解析部７００からの検索語を取得した文書検索部８００は、文書索引情報蓄積部５００を参照して、検索語に対応する単語が図４に示す検索情報に含まれるか否かを調べ、検索語と一致する単語があれば、その単語に関連付けられている文書識別子をセッション管理部１３００へ出力する。すなわち、文書検索部８００は、検索語の「燃料」「電池」「開発」が検索情報に含まれていると判断し、その単語が関連付けられている文書識別子Ｄ１およびＤ２を取得し、該文書識別子をセッション管理部１３００へ出力する。 The document search unit 800 that has acquired the search word from the question analysis unit 700 refers to the document index information storage unit 500 to check whether or not the word corresponding to the search word is included in the search information shown in FIG. If there is a word that matches the search word, the document identifier associated with the word is output to the session management unit 1300. That is, the document search unit 800 determines that the search terms “fuel”, “battery”, and “development” are included in the search information, acquires the document identifiers D1 and D2 associated with the words, The identifier is output to the session management unit 1300.

セッション管理部１３００は、取得した文書識別子と共に質問文を図示しない表示入力装置で表示する。利用者は、表示内容に基づいて所望の文書識別子を選定する。このとき、例えば文書識別子にＵＲＬなどが関連付けられている場合、セッション管理部１３００は、そのＵＲＬにアクセスを行ない、Ｗｅｂページに示されている内容の全部または一部を表示入力装置に表示する。これにより、利用者は表示される内容に基づいて、所望の文書識別子を選定することができる。従って、セッション管理部１３００が出力する文書識別子は、質問文の回答のための単語（回答単語）を含むであろうと、利用者が任意に選定した文書識別子である。 The session management unit 1300 displays the question text together with the acquired document identifier on a display input device (not shown). The user selects a desired document identifier based on the display content. At this time, for example, when a URL or the like is associated with the document identifier, the session management unit 1300 accesses the URL and displays all or part of the contents shown on the Web page on the display input device. Thereby, the user can select a desired document identifier based on the displayed contents. Accordingly, the document identifier output by the session management unit 1300 is a document identifier arbitrarily selected by the user so as to include a word (answer word) for answering the question sentence.

その後、利用者により選択された文書識別子と、質問解析部７００からの検索語と、質問種別とが、セッション管理部１３００からネットワークを介して回答抽出部９００へ出力される。 Thereafter, the document identifier selected by the user, the search term from the question analysis unit 700, and the question type are output from the session management unit 1300 to the answer extraction unit 900 via the network.

ところで、従来の質問応答システムにおけるセッション管理では、質問文の要となる検索語に適合する文書識別子を取得するために、形態素解析を行った結果をサーバとクライアント間でデータ通信を行う。つまり、形態素解析で区切った各単語の情報をサーバとクライアント間で通信する。更に、文書識別子を取得した後、質問文の回答を抽出するために、該文書識別子に関連付けられている被検索文書に対し、改めて形態素の解析を行った結果をサーバとクライアント間でデータ通信を行うことから、被検索文書の文面や質問文が長くなるほど、データ通信量が増加し、処理効率が低下していた。 By the way, in the session management in the conventional question answering system, in order to acquire a document identifier that matches the search term that is the key to the question sentence, the result of the morphological analysis is performed between the server and the client. That is, information on each word separated by morphological analysis is communicated between the server and the client. Further, after obtaining the document identifier, in order to extract the answer to the question sentence, the result of the morpheme analysis performed again on the searched document associated with the document identifier is transmitted between the server and the client. As a result, the longer the text of the search target document and the question sentence, the greater the amount of data communication and the lower the processing efficiency.

一方、本発明のセッション管理では、文書識別子の取得処理および回答単語の抽出処理毎で形態素解析を行う必要がなく、しかも形態素解析で区切った各単語の情報を取得する必要がない。すなわち本発明のセッション管理では、文書識別子と検索語のみをデータ通信すればよいことから、従来と比較してデータ通信量を低減することができる。 On the other hand, in the session management of the present invention, it is not necessary to perform morpheme analysis for each document identifier acquisition process and answer word extraction process, and it is not necessary to acquire information of each word separated by morpheme analysis. That is, in the session management of the present invention, only the document identifier and the search word need be communicated, so the amount of data communication can be reduced compared to the conventional case.

文書識別子、検索語、質問種別を取得した回答抽出部９００は、図５に示す単語情報を参照し、文書識別子に関連付けられている回答単語において、質問種別に対応するＷ１１およびＷ１２の「○×太郎」を回答として抽出する。また、回答抽出部９００は、単語情報において、文書識別子に関連付けられている回答単語から、検索語と一致する「燃料電池」、「開発」、「燃料」「電池」の抽出、すなわち図５におけるＲ１１、Ｒ１２、Ｒ１３、Ｒ１５、Ｒ１８、Ｒ１９およびＲ２０を抽出する。回答抽出部９００は、抽出した各回答単語に対し、従来から知られた方法により、属性、文書中の出現位置関係や質問種別などに基づいて点数を付与する。点数が付与された各回答単語は回答抽出部９００から出力部１４００へ送られる The answer extraction unit 900 that has acquired the document identifier, the search word, and the question type refers to the word information shown in FIG. 5 and, in the answer word associated with the document identifier, W11 and W12 “XX” corresponding to the question type. "Taro" is extracted as an answer. Further, the answer extraction unit 900 extracts “fuel cell”, “development”, “fuel”, and “battery” that match the search word from the answer word associated with the document identifier in the word information, that is, in FIG. R11, R12, R13, R15, R18, R19 and R20 are extracted. The answer extraction unit 900 assigns a score to each extracted answer word based on the attribute, the appearance position relationship in the document, the question type, and the like by a conventionally known method. Each answer word given the score is sent from the answer extraction unit 900 to the output unit 1400.

点数が付与された各回答単語を取得した出力部１４００は、例えば点数が高い順に回答単語を表示する。このとき、出力部１４００は、回答単語と共に質問文などを一緒に表示してもよい。 The output unit 1400 that has acquired each answer word given a score displays the answer words in descending order of the score, for example. At this time, the output unit 1400 may display a question sentence and the like together with the answer word.

前記したように、本発明の質問応答システム１０によれば、「燃料電池」と纏まった単語を含む文書Ｄ１においても、また「燃料」および「電池」と個別の単語を含む文書Ｄ２においても、質問文の要となる単語、すなわち「燃料電池」が含まれる文書識別子を文書検索部８００で取得することができ、取得した文書識別子と検索語および質問種別に基づいて回答抽出部９００で質問文の回答としての適切な回答単語を抽出することができる。 As described above, according to the question answering system 10 of the present invention, both in the document D1 including the words “fuel cell” and in the document D2 including the individual words “fuel” and “cell”, The document search unit 800 can acquire a document identifier including a key word of the question sentence, that is, “fuel cell”, and the answer extraction unit 900 can obtain the question sentence based on the acquired document identifier, the search word, and the question type. It is possible to extract an appropriate answer word as an answer.

次に、質問文が例えば「○×太郎先生が開発したもの」であり、その質問種別が名詞である場合を説明する。
質問文は、質問解析部７００の質問文形態素解析処理部７１０で「○× 太郎先生が開発したもの」に形態素解析され、自立語として「○×」「太郎」「先生」「開発」「もの」の単語が取得される。検索単語後処理部７２０は、「○×」および「太郎」に基づいて、複合語として「○×太郎」を生成する。生成された複合語は、検索単語追加処理部７３０で取得されると、該検索単語追加処理部７３０は、「○×」「太郎」「先生」「開発」「もの」「○×太郎」から成る単語群（検索語）を生成する。 Next, a case will be described in which the question sentence is, for example, “developed by Professor XX × Taro” and the question type is a noun.
The question sentence is analyzed by the question sentence morphological analysis processing section 710 of the question analysis section 700 into “what was developed by Professor Taro X”, and as independent words “○ ×”, “Taro”, “Teacher”, “Development”, “Things” Is obtained. The search word post-processing unit 720 generates “○ × Taro” as a compound word based on “◯ ×” and “Taro”. When the generated compound word is acquired by the search word addition processing unit 730, the search word addition processing unit 730 reads from “XX”, “Taro”, “Teacher”, “Development”, “Things”, and “XX Taro”. A word group (search word) is generated.

文書検索部８００は、文書索引情報蓄積部５００を参照して、図４に示す文書検索情報において、検索語と一致する単語に関連付けられた文書識別子を取得する。すなわち文書検索部８００はＤ１およびＤ２を取得し、該文書識別子をセッション管理部１３００へ出力する。文書識別子を取得したセッション管理部１３００は、文書検索部８００から文書識別子を取得し、質問解析部７００からの検索語を取得し、入力部１２００から質問文および質問種別を取得すると、文書識別子、検索語および質問種別を回答抽出部９００へ出力する。 The document search unit 800 refers to the document index information storage unit 500 and acquires the document identifier associated with the word that matches the search word in the document search information shown in FIG. That is, the document search unit 800 acquires D1 and D2, and outputs the document identifier to the session management unit 1300. Having acquired the document identifier, the session management unit 1300 acquires the document identifier from the document search unit 800, acquires the search word from the question analysis unit 700, and acquires the question sentence and the question type from the input unit 1200. The search term and question type are output to the answer extraction unit 900.

回答抽出部９００は、単語情報蓄積部６００を参照して、取得する文書識別子、検索語および質問種別に基づく回答単語の抽出を行う。すなわち、回答抽出部９００は、単語情報蓄積部６００の単語情報において、文書識別子が関連付けられている回答単語において、質問種別が名詞である「燃料電池」「水素」「酸素」「燃料」「電池」の抽出、つまり図５におけるＲ１１、Ｒ１３、Ｒ１４、Ｒ１５、Ｒ１６、Ｒ１７、Ｒ１８およびＲ１９を抽出する。これにより、回答抽出部９００は、「燃料」および「電池」が纏まった「燃料電池」を質問文の回答として抽出することができる。 The answer extraction unit 900 refers to the word information storage unit 600 and extracts answer words based on the acquired document identifier, search word, and question type. That is, in the word information of the word information storage unit 600, the answer extraction unit 900 includes “fuel cell”, “hydrogen”, “oxygen”, “fuel”, “battery” in which the question type is a noun in the answer word associated with the document identifier. ", That is, R11, R13, R14, R15, R16, R17, R18 and R19 in FIG. 5 are extracted. Thereby, the answer extraction unit 900 can extract “fuel cell” in which “fuel” and “battery” are collected as an answer to the question sentence.

回答抽出部９００は、抽出した各回答単語に対し、従来から知られた方法により、属性、文書中の出現位置関係や質問種別などに基づいて点数を付与し、点数を付与した回答単語を質問文の回答として出力部１４００へ出力する。
ところで、本実施例では名詞および動詞の双方の品詞を有する単語は質問文の回答としては適切ではない場合が多いので、このような回答単語は、抽出の対象から除外している。また、回答として相応しくない回答単語を除外する他に、回答抽出部９００で付与する点数を低く抑えてもよい。 The answer extraction unit 900 assigns a score to each extracted answer word based on the attribute, the appearance position relationship in the document, the question type, and the like by a conventionally known method. It outputs to the output part 1400 as a reply of a sentence.
By the way, in this embodiment, since words having both parts of speech of nouns and verbs are often not appropriate as answers to question sentences, such answer words are excluded from extraction targets. In addition to excluding answer words that are not suitable as answers, the number of points assigned by the answer extraction unit 900 may be kept low.

以上説明したように、本発明の質問応答システム１０によれば、被検索文書１００を形第１の文書解析部３００で形態素解析して文書索引情報を作成し、第２の解析部４００で複合語を作成し、該複合語による置換を行い単語情報を作成する。そして本発明の質問応答システム１０によれば、質問解析部７００で質問文１１００を形態素解析して複合語を作成し、その複合語を追加した検索語に基づいて、文書検索部８００で文書索引情報の単語に関連付けられた文書識別子を取得し、回答抽出部９００で文書識別子に関連付けられている回答単語において、質問種別や検索語に基づいて抽出を行うことにより、被検索文書の文面に質問文の要となる単語を直接含まなくても、質問文の回答候補となる文書を示す識別子を取得することができ、その取得する文書識別子に関連付けられた単語情報に基づいて、適切な回答単語を抽出することができる。 As described above, according to the question answering system 10 of the present invention, the document index information is created by the morphological analysis of the search target document 100 by the first document analysis unit 300, and the second analysis unit 400 combines it. A word is created and replaced with the compound word to create word information. According to the question answering system 10 of the present invention, the question analysis unit 700 generates a compound word by morphologically analyzing the question sentence 1100, and the document search unit 800 uses the document index based on the search word to which the compound word is added. A document identifier associated with an information word is acquired, and an answer word associated with the document identifier is extracted by an answer extraction unit 900 based on a question type or a search word. Even if it does not directly contain the word that is the key of the sentence, it is possible to obtain an identifier indicating the document that is a candidate for the answer to the question sentence, and an appropriate answer word based on the word information associated with the obtained document identifier Can be extracted.

実施例はクライアント３０側でセッション管理を行う例で説明したが、セッション管理をサーバ群２０側で行うようにしてもよい。 In the embodiment, the session management is performed on the client 30 side. However, the session management may be performed on the server group 20 side.

本発明の質問応答システムのブロック図である。It is a block diagram of the question answering system of the present invention. 被検索文書の文面の例を示す図である。It is a figure which shows the example of the text of a to-be-searched document. 被検索文書の形態素解析例を示す図である。It is a figure which shows the morphological analysis example of a to-be-searched document. 文書索引情報を示す図である。It is a figure which shows document index information. 単語情報を示す図である。It is a figure which shows word information. 質問文の形態素解析例を示す図である。It is a figure which shows the morphological analysis example of a question sentence.

Explanation of symbols

１０質問応答システム
２０サーバ群
３０クライアント
４０被検索文書処理部
５０検索処理部
６０回答処理部
１００被検索文書
２００文書投入部
３００第１の文書解析部
３１０形態素解析処理部
４００第２の解析部
４１０単語後処理部
４２０単語置換処理部
５００文書索引情報蓄積部
６００単語情報蓄積部
７００質問解析部
７１０質問文形態素解析処理部
７２０検索単語後処理部
７３０検索単語追加処理部
８００文書検索部
９００回答抽出部
１１００質問文
１２００入力部
１３００セッション管理部
１４００出力部
DESCRIPTION OF SYMBOLS 10 Question answering system 20 Server group 30 Client 40 Searched document process part 50 Search process part 60 Answer process part 100 Searched document 200 Document input part 300 1st document analysis part 310 Morphological analysis process part 400 2nd analysis part 410 Word post-processing unit 420 Word replacement processing unit 500 Document index information storage unit 600 Word information storage unit 700 Question analysis unit 710 Question sentence morpheme analysis processing unit 720 Search word post-processing unit 730 Search word addition processing unit 800 Document search unit 900 Answer extraction Part 1100 question sentence 1200 input part 1300 session management part 1400 output part

Claims

A search target document processing unit that accepts a search target document and analyzes a morpheme of the search target document, and a search that acquires a search term that is the key to the contents of a question from a question sentence and acquires a document that matches the search term In a question answering device comprising a processing unit and an answer processing unit that assigns a score to each word of the acquired document and extracts a word that becomes an answer to a question sentence based on the score,
The searched document processing unit includes:
A first document analysis unit that generates document index information in which each word obtained by analyzing a morpheme of an accepted searched document is associated with a document identifier of the searched document;
Based on the words obtained by the first document analysis unit, words that can be answered in the question sentence are generated as appropriate, the generated words are replaced with the original words, and these words are used as answer words. A second document analysis unit that generates word information that associates a word with an attribute, an appearance position in the searched document, and a document identifier of the searched document;
The search processing unit
Based on each word obtained by analyzing the morpheme of the question sentence, a word that can be an answer target of the question sentence is appropriately generated, and the generated word is added to the word obtained by analyzing the morpheme of the question sentence. A question analysis unit that outputs words as search terms;
A document search unit that refers to document index information and obtains a document identifier of a document that matches the search term,
The answer processing department
An answer that refers to the word information and extracts a word that becomes an answer to a question sentence from words associated with the document identifier based on a document identifier from the document search unit and a search word from the question analysis unit A question answering apparatus comprising: an extraction unit.

The question analysis unit, a question sentence morpheme analysis processing unit for performing the same morpheme analysis as the morpheme analysis in the first document analysis unit for the question sentence,
Compounding each word acquired in the question sentence morpheme analysis process, generating a compound word that can be an answer target of the question sentence, and a search word post-processing unit that eliminates shaking of the document,
The question answering device according to claim 1, further comprising: a search word addition processing unit that adds a compound word generated by the search word post-processing unit to a word acquired by the question sentence morphological analysis processing unit.

The second document analysis unit generates a compound word that can be an answer target of the question sentence based on the word obtained by the first document analysis unit, and a word post-processing unit that eliminates the shaking of the document;
The question answering apparatus according to claim 1, further comprising: a word replacement processing unit that replaces the compound word generated by the word post-processing unit with a compound source word.

A system in which a network is established between a client and a server,
The server receives a search target document, analyzes a search target document processing unit that analyzes a morpheme of the search target document, obtains a search term that is a key point of a question content from a question sentence, and obtains a document that matches the search term. A search processing unit for acquiring, and a response processing unit for assigning a score to each word of the acquired document and extracting a word that becomes an answer to the question sentence based on the score,
In the question answering system, the client includes an input unit that obtains a question sentence from a user, and an output unit that outputs an answer to the question sentence.
The searched document processing unit includes:
A first document analysis unit that generates document index information in which each word obtained by analyzing a morpheme of an accepted searched document is associated with a document identifier of the searched document;
Based on the words obtained by the first document analysis unit, words that can be answered in the question sentence are generated as appropriate, the generated words are replaced with the original words, and these words are used as answer words. A second document analysis unit that generates word information that associates a word with an attribute, an appearance position in the searched document, and a document identifier of the searched document;
The search processing unit
Based on each word obtained by analyzing the morpheme of the question sentence, a word that can be an answer target of the question sentence is appropriately generated, and the generated word is added to the word obtained by analyzing the morpheme of the question sentence. A question analysis unit that outputs words as search terms;
A document search unit that refers to document index information and obtains a document identifier of a document that matches the search term,
The answer processing department
An answer that refers to the word information and extracts a word that becomes an answer to a question sentence from words associated with the document identifier based on a document identifier from the document search unit and a search word from the question analysis unit An extraction unit,
The client manages a search word from the question analysis unit and a document identifier from the document search unit in association with a question sentence acquired by an input unit, and extracts the query word and the document identifier from the question A question answering system comprising: a session management unit that outputs to a unit.

The session management unit displays a search target document associated with a document identifier to be acquired, causes a user to select a desired search target document, and sets the document identifier associated with the selected search target document 5. The question answering system according to claim 4, wherein the question answering system outputs the answer to the answer extracting unit.

The question analysis unit, a question sentence morpheme analysis processing unit for performing the same morpheme analysis as the morpheme analysis in the first document analysis unit for the question sentence,
Compounding each word acquired in the question sentence morpheme analysis process, generating a compound word that can be an answer target of the question sentence, and a search word post-processing unit that eliminates shaking of the document,
5. The question answering system according to claim 4, further comprising: a search word addition processing unit that adds a compound word generated by the search word post-processing unit to a word acquired by the question sentence morphological analysis processing unit.

The second document analysis unit generates a compound word that can be an answer target of the question sentence based on the word obtained by the first document analysis unit, and a word post-processing unit that eliminates the shaking of the document;
5. The question answering system according to claim 4, further comprising: a word replacement processing unit that replaces the compound word generated by the word post-processing unit with a compound source word.