JPWO2009113494A1

JPWO2009113494A1 - Question answering system capable of descriptive answers using WWW as information source

Info

Publication number: JPWO2009113494A1
Application number: JP2010502807A
Authority: JP
Inventors: 辰則森; 佐藤　充; 充佐藤; 円香石下
Original assignee: Yokohama National University NUC
Current assignee: Yokohama National University NUC
Priority date: 2008-03-10
Filing date: 2009-03-09
Publication date: 2011-07-21
Anticipated expiration: 2029-03-09
Also published as: WO2009113494A1; JP5461388B2

Abstract

通常の言葉で記された質問を受け付け、その回答候補をＷＷＷ上の文書から抽出し、利用者に提示する技術に関し、質問と回答の形式的な相関を反映することを課題とする。質問文１００１を、機能語、疑問詞、あるいは質問の焦点になりやすい「理由」や「意味」などの所定語を表層表現に変換し、その他の内容語を品詞種別に変換する形式化処理により質問文形式化単語列１００２とし、更にその中から疑問詞と所定数の前後の単語列からなる質問文形式要部１００３を抽出する。質問回答事例集の参考質問文も同様に変換し、参考質問文形式要部１００４の類似度により形式の適性を判断する。また、質問回答事例集の参考回答文も形式化し、質問文との形式的な相関度を求める。The problem is to receive a question written in ordinary words, extract the answer candidates from a document on the WWW, and reflect the formal correlation between the question and the answer regarding the technology presented to the user. Formalizing the question sentence 1001 by converting a function word, a question word, or a predetermined word such as “reason” or “meaning” that tends to be a focus of the question into a surface expression, and converting other content words into a part of speech type A question sentence format word string 1002 is extracted, and a question sentence format main part 1003 including a question word and a predetermined number of word strings before and after the question word is extracted. The reference question sentence of the question answer example collection is also converted in a similar manner, and the suitability of the form is determined based on the similarity of the reference question sentence format main part 1004. In addition, the reference answer sentence of the question answer example collection is also formalized, and a formal correlation with the question sentence is obtained.

Description

本発明は、通常の言葉で記された質問を受け付け、その回答候補をＷＷＷ上の文書から抽出し、利用者に提示する技術に関する。 The present invention relates to a technique for receiving a question written in ordinary words, extracting answer candidates from a document on the WWW, and presenting it to a user.

ＷＷＷを情報源とした質問応答システムが従来研究されている。人名、地名、数量等短い表現が問えるｆａｃｔｏｉｄ型、定義や理由等の長い記述が問えるｎｏｎ−ｆａｃｔｏｉｄ型がある。ここではｎｏｎ−ｆａｃｔｏｉｄ型に注目する。 Conventionally, a question answering system using WWW as an information source has been studied. There is a factid type that can ask for short expressions such as names, places, and quantities, and a non-factoid type that can ask for long descriptions such as definitions and reasons. Here, focus on the non-factoid type.

ｎｏｎ−ｆａｃｔｏｉｄ型の質問応答における解候補となるテキストの適切性は、「質問文との内容に関する関連性があるか」（観点１）、「質問型に対する回答の仕方（記述スタイル）が適切であるか」（観点２）という二つの観点において計ることができるといわれている。ここで、質問型は質問文が問うている質問の種類（「定義」、「方法」、「理由」など）である。記述スタイルとは、例えば、「理由」を記述するのであれば、「〜からである。」「〜ために、…」などのように、「理由」を表現するのに適した表現を含む記述様式である。 Appropriateness of the text as a candidate solution in the non-factoid type question answering is “Is there a relevance regarding the content with the question sentence” (viewpoint 1), “How to answer the question type (description style) is appropriate It is said that it can be measured from two viewpoints, “Are there?” (Perspective 2). Here, the question type is the type of question (“definition”, “method”, “reason”, etc.) that the question text asks. The description style is, for example, a description including an expression suitable for expressing the “reason” such as “because it is“ to ”,“ to for, ... ”if“ reason ”is described. It is a style.

一般的な手法においては、質問型をまず推定してから、それに応じた処理を特に観点２について行う。しかし、質問型の推定の精度の問題や、観点２の判定に利用する手がかり表現を質問型に応じて手作業で準備する必要があるという労力の問題があった。 In a general method, a question type is first estimated, and then processing corresponding to that is performed particularly with respect to viewpoint 2. However, there is a problem of accuracy of question type estimation and a problem of labor that a clue expression used for determination of viewpoint 2 needs to be manually prepared according to the question type.

これに対して、非特許文献１は、人手による質問応答コミュニティサイトにある大量の質問・回答事例を学習データとして用いて記述スタイルを獲得することにより、質問の型の推定を行わずに回答を行う学習型のｎｏｎ−ｆａｃｔｏｉｄ型質問応答の手法を提案している。 On the other hand, Non-Patent Document 1 uses a large number of questions / answer cases in a manual question-answering community site as learning data to obtain a descriptive style, thereby answering without estimating the question type. A learning-type non-factoid question answering method is proposed.

非特許文献２は、ＦＡＱサイトの質問・回答事例集合から、回答が質問に「書き換え」られる確率を計算し、質問の型に依存しないｎｏｎ−ｆａｃｔｏｉｄ型質問応答手法を提案している。なお、いずれの手法もＦＡＱなどの質問・回答事例集合は、質問に対する回答を抽出する対象である情報源ではないことに注意されたい。情報源は別にあって、電子化された新聞記事であったり、ＷＷＷ上の文書であったりする。
水野淳太、他１名，「任意の回答を対象とする質問応答のための実世界質問の分析と回答タイプ判定法の検討」，言語処理学会13回年次大会発表論文集(2007)，言語処理学会，平成１９年３月，ｐ．１００２−１００５ラデゥソリカット（Rude Soricut）、他１名，オートマティッククエスチョンアンサーリングユージングザウェブ（Automatic Question Answering Using the Web），Journalof Information Retrieval- Special Issue on WebInformation Retrieval， November 2006，Vol.9，pp.191-206 Non-Patent Document 2 proposes a non-factoid type question answering method that calculates the probability that an answer is “rewritten” from a question / answer example set on a FAQ site and does not depend on the type of question. It should be noted that the question / answer example set such as FAQ is not an information source for extracting answers to the question in any method. There is a separate information source, which may be an electronic newspaper article or a document on the WWW.
Yuta Mizuno and 1 other, “Analysis of real-world questions and examination of answer type judgment methods for answering questions of arbitrary answers”, Proc. Of the 13th Annual Conference of the Language Processing Society of Japan (2007), Language Processing Society, March 2007, p. 1002-1005 Rude Soricut, 1 other, Automatic Question Answering Using the Web, Journalof Information Retrieval- Special Issue on WebInformation Retrieval, November 2006, Vol.9, pp.191-206

非特許文献１の方法では、質問の型の推定を陽に行う必要がなく、手がかり表現も質問・回答事例集合から自動的に学習されるという利点がある。しかし、回答選定の柔軟性に問題がある。同手法では、その方法論により、テキストを段落等の予め決められた大きさの単位で切り出したものを観点１により順位付けし、回答候補とした上で、観点２に従い回答になるか否かの判定をするとともに、観点２において順位付けをし直す。ここで、質問応答において、回答の範囲は通常固定ではなく様々であることが普通であることを考えると、この手法では短かったり長かったりと不完全な回答しか得られないことがあると考えられる。さらに、再順位付けにおいては、観点１によらず観点２での並べかえを行うので、観点１である内容の関連性に由来する解候補の重要さは十分に反映されない。 The method of Non-Patent Document 1 has the advantage that it is not necessary to explicitly estimate the question type, and the clue expression is automatically learned from the question / answer example set. However, there is a problem with the flexibility of answer selection. In this method, the method is to determine whether or not the text is cut out in units of a predetermined size, such as paragraphs, according to viewpoint 1 and ranked as a candidate for answer. In addition to making a determination, reordering in viewpoint 2. Here, in the question answering, considering that the range of answers is usually not fixed but various, it is considered that this method can sometimes obtain incomplete answers such as short or long . Further, in the re-ranking, since the rearrangement is performed from the viewpoint 2 regardless of the viewpoint 1, the importance of the solution candidates derived from the relevance of the contents as the viewpoint 1 is not sufficiently reflected.

非特許文献２の手法でも、回答テキストの大きさを前もって決めておく必要があるとともに、質問の長さに基づいて回答の長さを別途推定する必要がある。さらに、尺度１と尺度２を語の書き換え確率として同時に扱っているために、学習される情報は、内容に纏わるものと記述スタイルに纏わるものが混在している。そのため、学習に利用できる質問・回答事例集合に現れる表現の網羅性が担保されないと精度が低くなると考えられる。 Even in the method of Non-Patent Document 2, it is necessary to determine the size of the answer text in advance and to estimate the length of the answer separately based on the length of the question. Furthermore, since the scale 1 and the scale 2 are simultaneously handled as the word rewriting probabilities, the information to be learned is mixed with the contents and those with the description style. For this reason, it is considered that the accuracy is lowered unless the completeness of the expressions appearing in the question / answer example set that can be used for learning is secured.

本発明においては、質問応答コミュニティサイトにある大量の質問・回答事例集を、観点２の記述スタイルの適切さを判定するためだけに用い、観点１についての尺度を別途用意して組み合わせることにより、任意の型のｎｏｎ−ｆａｃｔｏｉｄ型質問応答を行う。また、観点２の記述スタイルの適切さの判定には学習型ではなく、質問・回答事例集の質問側を記述スタイルの類似度に基づいて利用者が与えた質問により検索し、対応する回答事例から動的に回答の記述スタイルに関する情報を取得する。上記のような構成にすることにより、ｉ）観点１と観点２に関する尺度を独立に設けることができ、なおかつ、それらを同時に考慮できるように統合した一つの評価尺度にすることができる、ｉｉ）使える質問・回答事例集が増えたときには、学習をしなおす必要がなく、単に登録を追加すればよい。 In the present invention, a large amount of question / answer example collections in the question answering community site is used only for determining the appropriateness of the description style of viewpoint 2, and by separately preparing and combining a scale for viewpoint 1, Perform any type of non-factoid type question answering. In addition, the appropriateness of the description style of viewpoint 2 is not a learning type, but the question side of the question / answer example collection is searched by the question given by the user based on the similarity of the description style, and the corresponding answer example The information about the description style of the answer is dynamically acquired from With the configuration as described above, i) a scale relating to viewpoint 1 and viewpoint 2 can be provided independently, and a single evaluation scale integrated so that they can be considered simultaneously, ii) When the number of question / answer examples that can be used is increased, there is no need to re-learn, and registration is simply added.

本発明に係る質問応答システムは、
質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、以下の要素を有することを特徴とする
（１）質問文を入力する質問文入力部
（２）文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化部
（３）疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出部
（４）参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部
（５）参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化部
（６）変換された参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答ＩＤに対応付けて複数記憶する応答例形式記憶部
（７）応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答ＩＤを、入力された質問文に形式が相似する相似形式質問文に係る参考応答ＩＤとして特定する相似形式質問文抽出部
（８）特定された相似形式質問文に係る参考応答ＩＤ群を、相似形式質問文集合として記憶する相似形式質問文集合記憶部
（９）各相似形式質問文に係る参考応答ＩＤに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ＩＤ群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ＩＤ群の数、相似形式質問文集合に含まれる参考応答ＩＤ群の数、及び応答形式要素含回答文集合に含まれる参考応答ＩＤ群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出部
（１０）応答形式要素毎に算出された質問形式相関度を記憶する応答形式要素相関度テーブル
（１１）質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成部
（１２）関連語毎に算出された関連度を記憶する関連語テーブル
（１３）質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索部
（１４）検索された関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶する関連文書記憶部
（１５）各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出部
（１６）関連文毎の文スコアを文番号に対応付けて記憶する文スコアテーブル
（１７）高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出部
（１８）解候補の文番号により特定される関連文を応答文として出力する応答文出力部。The question answering system according to the present invention is:
A question answering system that inputs a question sentence, extracts a sentence suitable for answering a question sentence from a document group to be searched, and outputs it as a response sentence, and has the following elements (1) Question sentence input part for inputting a question sentence (2) The sentence is divided into words, the part of speech classification and the surface expression are analyzed for each word, and each word is a function word, a question word, and the focus of the question If the predetermined word is likely to become a word, it is determined that the word is a word related to the question response format. Otherwise, the word is determined to be a word related to the question response content, and the question response format is determined. The related question word is converted to part of speech type, the word related to the question response content is converted to surface expression, and the input question sentence is converted into a formalized word string based on the converted part of speech type or surface expression. Question sentence to convert to question sentence formalization word string Formulating section (3) Question sentence format for extracting a question sentence format main part from a question sentence formatted word string by a question sentence format main part extraction process for extracting a formatted word string of a predetermined number of words including a question word at a predetermined position Main part extraction unit (4) Reference response example storage unit for storing a plurality of reference response examples consisting of pairs of reference question text and reference answer text (5) Reference question text for each reference response example included in the reference response example storage unit Is converted into a reference question sentence formatted word string by the formalization process, and further, a reference response example formatting unit for converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (6) Response example format storage unit for storing a plurality of pairs of question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs (7) Reference question sentence formatting included in the response example format storage unit For the word string, in the question sentence format main part extraction process The reference question sentence format main part is extracted, compared with the question sentence form main part, and if the comparison result is the same or similar, the reference question sentence form main part of the extracted reference question sentence form main part is extracted. Similar format question sentence extraction unit (8) for identifying a reference response ID as a reference response ID related to a similar format question text whose format is similar to the input question text. Reference response ID group related to the specified similar format question text A similar format question sentence set storage unit for storing as a similar format question sentence set (9) obtaining a reference answer sentence formalized word string corresponding to a reference response ID related to each similar format question sentence from the response example form storage unit, Response format elements that are formatted word strings having a predetermined number of words smaller than the number of words in the formatted word string are extracted in order from the obtained reference answer sentence formatted word string, and for each response format element, the response format element is Included in response example format storage Stored in each reference response sentence formalized word string, and stores the reference response ID group related to the reference reply sentence formatted word string including the response format element as a response sentence element-containing answer sentence set, at least similar The number of reference response ID groups included in both the formal question sentence set and the response sentence element-containing answer sentence set, the number of reference response ID groups included in the similar form question sentence set, and the response form element-containing answer sentence set Using the number of reference response ID groups, the degree to which the response format element is related to the similar question text as a form based on the probability that the similar question text is combined with the reference response formatted word string including the response format element A response format element correlation degree calculation unit (10) that calculates a question format correlation degree indicating a response format element correlation degree table that stores a question format correlation degree calculated for each response format element (11) contents from a question sentence The keyword is extracted, the document is searched from the document group to be searched using the keyword as a condition, and the related word related to the question sentence is extracted as the content based on the appearance frequency of the word included in the searched document group. A related word generation unit that calculates the related degree of the related word (12) a related word table that stores the related degree calculated for each related word (13) extracts a keyword that is a content word from a question sentence, and uses the keyword as a condition Search related document search unit (14) for searching related documents from a search target document group A related document storage unit (15) for storing related documents searched for in association with sentence numbers for each related sentence included in the related documents. For each related sentence, the related sentence is converted into a related sentence formatted word string by the formalization process, the response format elements are sequentially extracted from the related sentence formatted word string, and the question of each response format element The expression correlation is obtained from the response format element correlation degree table, and further, the relevance level as a related word of each word included in the related sentence is obtained from the related word table. A sentence score calculation unit that calculates a sentence score indicating suitability as a solution to the question sentence based on the degree of association of each word (16) A sentence score table (17) that stores a sentence score for each related sentence in association with a sentence number ) A solution candidate extraction unit that extracts a sentence number of a related sentence with a sentence score indicating high aptitude as a solution candidate. (18) A response sentence output unit that outputs a related sentence specified by the sentence number of the solution candidate as a response sentence.

更に、前記質問の焦点になりやすい所定語として、少なくとも「理由」、「方法」、「意味」、又は「違い」の何れかを用いることを特徴とする。 Furthermore, at least one of “reason”, “method”, “meaning”, or “difference” is used as the predetermined word that is likely to be the focus of the question.

更に、前記形式化処理は、各単語が参考応答例の中で出現頻度が高い所定の動詞と形容詞である場合にも、当該単語を質問応答形式に係る単語であると判定することを特徴とする。 Further, the formalization processing is characterized in that even when each word is a predetermined verb and adjective having a high appearance frequency in the reference response example, the word is determined to be a word related to the question response format. To do.

更に、前記質問文形式要部抽出処理により抜き出される形式化単語列は、疑問詞を中心として前後３つ単語を含む合計７つの単語に係る形式化単語列であることを特徴とする。 Further, the formalized word string extracted by the question sentence format main part extraction process is a formalized word string related to a total of seven words including three words before and after the interrogative.

更に、前記相似形式質問文抽出部は、参考質問文形式要部と質問文形式要部に含まれる疑問詞が一致する場合に限り、類似と判定することを特徴とする。 Further, the similar form question sentence extraction unit is characterized in that it is determined to be similar only when the question words included in the reference question form main part and the question sentence form main part match.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、カイ二乗値の平方根を用いることを特徴とする。 Further, the response format element correlation degree calculation unit uses a square root of a chi-square value as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、ダイス係数を用いることを特徴とする。 Furthermore, the response format element correlation degree calculation unit uses a dice coefficient as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、相互情報量を用いることを特徴とする。 Furthermore, the response format element correlation degree calculation unit uses the mutual information amount as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記解候補抽出部は、関連文書に含まれる関連文の順に連続する文スコアについて、極大値を示す文スコアの関連文の文番号を解候補とすることを特徴とする。 Further, the solution candidate extraction unit is characterized in that the sentence number of the related sentence of the sentence score indicating the maximum value is set as the solution candidate for the sentence score that is consecutive in the order of the related sentences included in the related document.

更に、前記解候補抽出部は、前記極大値の所定割合を超える前後の文スコアの関連文の文番号も解候補に含めることを特徴とする。 Furthermore, the solution candidate extraction unit includes sentence numbers of related sentences with sentence scores before and after exceeding a predetermined ratio of the maximum value as solution candidates.

本発明に係るプログラムは、
質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、
参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部と、
参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答ＩＤに対応付けて複数記憶するための応答例形式記憶部と、
相似形式質問文に係る参考応答ＩＤ群を、相似形式質問文集合として記憶するための相似形式質問文集合記憶部と、
応答形式要素毎に算出された質問形式相関度を記憶するための応答形式要素相関度テーブルと、
関連語毎に算出された関連度を記憶するための関連語テーブルと、
関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶するための関連文書記憶部と、
関連文毎の文スコアを文番号に対応付けて記憶するための文スコアテーブルと、
を有する質問応答システムとなるコンピュータに、以下の手順を実行させることを特徴とする
（１）質問文を入力する質問文入力手順
（２）文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化手順
（３）疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出手順
（４）参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化手順
（５）応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答ＩＤを、入力された質問文に形式が相似する相似形式質問文に係る参考応答ＩＤとして特定する相似形式質問文抽出手順
（６）各相似形式質問文に係る参考応答ＩＤに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ＩＤ群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ＩＤ群の数、相似形式質問文集合に含まれる参考応答ＩＤ群の数、及び応答形式要素含回答文集合に含まれる参考応答ＩＤ群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出手順
（７）質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成手順
（８）質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索手順
（９）各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出手順
（１０）高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出手順
（１１）解候補の文番号により特定される関連文を応答文として出力する応答文出力手順。The program according to the present invention is:
A question answering system that inputs a question sentence, extracts a sentence suitable for answering the question sentence from a document group to be searched, and outputs it as a response sentence,
A reference response example storage unit for storing a plurality of reference response examples including pairs of reference question sentences and reference answer sentences;
A response example format storage unit for storing a plurality of pairs of reference question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs;
A reference form ID storage unit for storing a reference response ID group related to a similar form question sentence as a similar form question sentence set;
A response format element correlation degree table for storing the question format correlation calculated for each response format element;
A related word table for storing the relevance calculated for each related word;
A related document storage unit for storing a related document in association with a sentence number for each related sentence included in the related document;
A sentence score table for storing a sentence score for each related sentence in association with a sentence number;
(1) Question sentence input procedure for inputting a question sentence (2) The sentence is divided into words, the part of speech type and the surface layer for each word Analyzing the expression, if each word is a function word, a question word, and a predetermined word that is likely to be the focus of a question, the word is determined to be a word related to a question response format, and In other cases, the word is determined to be a word related to the question response content, the word related to the question response format is converted into a part of speech type, the word related to the question response content is converted into a surface expression, and the converted part of speech type Alternatively, a question sentence formatting procedure for converting an inputted question sentence into a question sentence formatted word string by a formatting process to form a formatted word string in units of surface expression (3) a predetermined word including a question word at a predetermined position Remove the number of formalized word strings Question sentence format main part extraction procedure for extracting the question sentence format main part from the question sentence formatted word string by the question sentence format main part extraction process to be issued (4) Reference response example For each reference response example included in the storage unit, reference Reference response example formatting procedure for converting a question sentence into a reference question sentence formatted word string by the formalization process, and further converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (5) Response example For each reference question sentence formatted word string included in the format storage part, the question sentence form principal part is extracted by the question sentence form principal part extraction process, and compared with the question sentence form principal part, and the comparison result is the same or In the case of similarity, the reference response ID of the reference question sentence format word string from which the relevant part of the reference question sentence format is extracted is identified as the reference response ID related to the similar question sentence whose format is similar to the input question sentence Similar question extraction In order (6), a reference answer sentence formatted word string corresponding to a reference response ID related to each similar form question sentence is acquired from the response example form storage unit, and a predetermined number of words less than the number of words in the formatted word string Response format elements, which are response word format words, are extracted in order from the acquired reference answer sentence format word string, and for each response format element, each response answer format word that includes the response format element in the response example format storage unit Search whether it is included in the column, and store the reference response ID group related to the reference response sentence formatted word string including the response format element as a response format element including response format element set, and at least a similar format question sentence set and a response format element Using the number of reference response ID groups included in both of the included response sentence sets, the number of reference response ID groups included in the similar form question sentence set, and the number of reference response ID groups included in the response form element included response sentence set Response form A response format element that calculates the degree of question format correlation indicating the degree to which the response format element is related to the similar format question text based on the probability that the similar format question text is combined with the reference response text formalized word string including the formula element Correlation degree calculation procedure (7) A keyword that is a content word is extracted from a question sentence, a document is searched from a search target document group using the keyword as a condition, and the content is determined based on the appearance frequency of the word included in the searched document group. A related word generation procedure for extracting a related word related to a question sentence and calculating a degree of relevance of the related word (8) extracting a keyword which is a content word from the question sentence, and using the keyword as a condition from a search target document group Search Related Document Retrieval Procedure for Retrieving Related Documents (9) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formatting process, and a related sentence formatted word is obtained. The response format elements are extracted in order, the question format correlation of each response format element is acquired from the response format element correlation table, and the related level of each word included in the related sentence as the related word Sentence score calculation procedure (10) which shows a high aptitude, and calculates a sentence score indicating aptitude as a solution to the question sentence based on the question form correlation degree and the relevance degree of each word obtained from each response form element Solution candidate extraction procedure for extracting the sentence number of the related sentence of the sentence score as a solution candidate (11) A response sentence output procedure for outputting the related sentence specified by the sentence number of the solution candidate as a response sentence.

本発明に係る質問応答システムは、
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化部と、
前記参考文形式化部において形式化された形式化参考文を記憶する参考文記憶部と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化部と、
前記入力質問文形式化部において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を前記参考文記憶部から抽出する参考回答文抽出部と、
前記参考回答文と、前記入力質問文をＷｅｂサーチエンジンで検索した結果得られたＷｅｂ文書である検索Ｗｅｂ文書との間の記述スタイルの適合性を評価する記述スタイル評価部と、
前記検索Ｗｅｂ文書と、前記入力質問文との間の内容の関連性を評価する関連性評価部と、
前記記述スタイル評価部により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価部により前記入力質問文の内容と関連があると評価された検索Ｗｅｂ文書に対してスコア付け処理を行うスコア処理部と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力部を有することを特徴とする。The question answering system according to the present invention is:
A reference sentence formatting unit that performs a formalization process to generalize a description style for at least a reference question sentence of the reference sentence, with a pair of the reference question sentence and the reference answer sentence as a reference sentence,
A reference sentence storage unit for storing the formatted reference sentence formatted in the reference sentence formatting unit;
An input question sentence formatting unit that performs a formalization process to generalize the description style of the input question sentence;
The formalized reference question sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting unit is searched, and the reference answer sentence included in the formalized reference sentence is retrieved from the reference sentence storage unit. A reference answer sentence extraction unit to be extracted;
A description style evaluation unit that evaluates the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation unit for evaluating relevance of contents between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation unit as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation part A score processing unit for performing scoring processing;
It has an answer sentence output part which outputs an answer sentence to the input question sentence based on the score.

本発明に係るプログラムは、質問応答システムとなるコンピュータに、
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化手順と、
前記参考文形式化手順において形式化された形式化参考文を記憶する参考文記憶手順と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化手順と、
前記入力質問文形式化手順において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を抽出する参考回答文抽出手順と、
前記参考回答文と、前記入力質問文をＷｅｂサーチエンジンで検索した結果得られたＷ
ｅｂ文書である検索Ｗｅｂ文書との間の記述スタイルの適合性を評価する記述スタイル評価手順と、
前記検索Ｗｅｂ文書と、前記入力質問文との間の内容の関連性を評価する関連性評価手順と、
前記記述スタイル評価手順により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価手順により前記入力質問文の内容と関連があると評価された検索Ｗｅｂ文書に対してスコア付け処理を行うスコア処理手順と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力手順を実行させることを特徴とする。A program according to the present invention is provided in a computer serving as a question answering system.
A reference sentence formalization procedure for performing a formalization process to generalize a description style for at least a reference question sentence among the reference sentences, using a pair of a reference question sentence and a reference answer sentence as a reference sentence;
A reference sentence storage procedure for storing the formalized reference sentence formatted in the reference sentence formatting procedure;
An input question sentence formalization procedure that performs formalization processing to generalize the description style of the input question sentence;
Reference answer sentence extraction for searching for the formatted reference sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting procedure and extracting the reference answer sentence included in the formalized reference sentence Procedure and
W obtained as a result of searching the reference answer sentence and the input question sentence with a Web search engine
a description style evaluation procedure for evaluating the suitability of a description style with a search Web document that is an eb document;
A relevance evaluation procedure for evaluating relevance of content between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation procedure as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation procedure A score processing procedure for scoring,
An answer sentence output procedure for outputting an answer sentence for the input question sentence is executed based on the score.

本発明によれば、本発明においては、質問応答コミュニティサイトにある大量の質問・回答事例集を、観点２の記述スタイルの適切さを判定するためだけに用い、観点１についての尺度を別途用意して組み合わせることにより、任意の型のｎｏｎ−ｆａｃｔｏｉｄ型質問応答を行う。また、観点２の記述スタイルの適切さの判定には学習型ではなく、質問・回答事例集の質問側を記述スタイルの類似度に基づいて利用者が与えた質問により検索し、対応する回答事例から動的に回答の記述スタイルに関する情報を取得するので、観点１と観点２に関する尺度を独立に設けることができ、なおかつ、それらを同時に考慮できるように統合した一つの評価尺度にすることができる。使える質問・回答事例集が増えたときには、学習をしなおす必要がなく、単に登録を追加すればよい。 According to the present invention, in the present invention, a large collection of question / answer cases on the question answering community site is used only to determine the appropriateness of the description style of viewpoint 2, and a scale for viewpoint 1 is prepared separately. As a result, any type of non-factoid type question answering is performed. In addition, the appropriateness of the description style of viewpoint 2 is not a learning type, but the question side of the question / answer example collection is searched by the question given by the user based on the similarity of the description style, and the corresponding answer example Since the information about the description style of the answer is dynamically acquired, the scales for viewpoint 1 and viewpoint 2 can be set independently, and the evaluation scale can be integrated so that they can be considered simultaneously. . When the number of question / answer examples that can be used is increased, there is no need to re-learn, and registration is simply added.

実施の形態１．
まず、参考応答例を準備する動作について説明する。参考応答例は、質問応答システムによる質問応答のための学習用データであって、例えば質問とそれに対応する回答の事例を集めた既存の質問・回答事例集合を用いる。Embodiment 1 FIG.
First, an operation for preparing a reference response example will be described. The reference response example is learning data for question answering by the question answering system, and uses, for example, an existing question / answer example set in which examples of questions and corresponding answers are collected.

図１は、参考応答例準備処理に係る構成を示す図である。質問応答システムは、質問・回答事例集合から学習用データとしての参考応答例を生成する参考応答例生成部１０１、生成した参考応答例（参考質問文と参考回答文の対）に参考応答ＩＤを対応付けて記憶する参考応答例記憶部１０２、参考応答例に含まれる参考質問文と参考回答文を所定の手順に従って形式化する参考応答例形式化部１０３、形式化された参考質問文形式化単語列と参考回答文形式化単語列の対を参考応答ＩＤに対応付けて記憶する参考応答例形式記憶部１０４を有している。 FIG. 1 is a diagram illustrating a configuration related to a reference response example preparation process. The question answering system generates a reference response example generation unit 101 that generates a reference response example as learning data from a question / answer example set, and generates a reference response ID for the generated reference response example (a pair of a reference question sentence and a reference answer sentence). Reference response example storage unit 102 that stores the reference response example, a reference question sentence included in the reference response example and a reference answer sentence are formatted according to a predetermined procedure, a formatted reference question sentence formatting A reference response example format storage unit 104 stores a pair of a word string and a reference answer sentence formatted word string in association with a reference response ID.

図２は、参考応答例準備処理フローを示す図である。この例では、Ｗｅｂコミュニティサービスの利用者同士でなされた質問・回答事例集合を用いる。従って、参考応答例生成部１０１による参考応答例生成部（Ｓ２０１）では、一つの質問に複数の回答が対応する場合に、質問者が最良回答として選んだ回答を参考回答文とする。また、質問が１文であって回答文にＵＲＬを含まない質問・回答のみを選択して、参考応答例として参考応答ＩＤに対応付けて参考質問文及び参考回答文として参考応答例記憶部１０２に記憶させる。 FIG. 2 is a diagram illustrating a reference response example preparation process flow. In this example, a set of question / answer cases made by users of the Web community service is used. Therefore, in the reference response example generation unit (S201) by the reference response example generation unit 101, when a plurality of answers correspond to one question, the answer selected by the questioner as the best answer is used as the reference answer sentence. Further, only a question / answer whose URL is not included in the answer sentence is selected as a question, and a reference answer example storage unit 102 as a reference question sentence and a reference answer sentence is associated with the reference response ID as a reference answer example. Remember me.

図３は、参考応答例記憶部の構成例を示す図である。参考応答毎にレコードを設け、参考応答ＩＤ３５１と、参考質問文３５２と、参考回答文３５３との項目を対応付けて記憶するように構成されている。 FIG. 3 is a diagram illustrating a configuration example of the reference response example storage unit. A record is provided for each reference response, and the items of the reference response ID 351, the reference question sentence 352, and the reference answer sentence 353 are stored in association with each other.

参考応答例形式化部１０３による参考応答例形式化処理（Ｓ２０２）では、質問及び回答の内容的な意義を排除して質問及び回答としての形式的な意義のみを有する情報（形式化単語列と呼ぶ。）に変換する。 In the reference response example formatting process (S202) by the reference response example formatting unit 103, information having only a formal significance as a question and an answer (a formalized word string and an answer) is excluded. To call.)

図４は、参考応答例形式化処理フローを示す図である。参考応答例記憶部１０２に含まれる参考応答例毎に以下の処理を繰り返す（Ｓ４０１）。参考質問文を質問文形式化処理（図６）し、参考質問文形式化単語列を得て（Ｓ４０２）、更に参考回答文を回答文形式化処理（図６）し、参考回答文形式化単語列を得る（Ｓ４０３）。そして、これらを参考応答ＩＤに対応付けて参考応答例形式記憶部１０４に記憶させる。これをすべての参考応答例について処理する（Ｓ４０４）。 FIG. 4 is a diagram showing a reference response example formatting process flow. The following processing is repeated for each reference response example included in the reference response example storage unit 102 (S401). The reference question sentence is processed into a question sentence format (FIG. 6), a reference question sentence format word string is obtained (S402), and the reference answer sentence is processed into an answer sentence format process (FIG. 6) to form a reference answer sentence format. A word string is obtained (S403). These are stored in the reference response example format storage unit 104 in association with the reference response ID. This is processed for all the reference response examples (S404).

図５は、参考応答例形式記憶部の構成例を示す図である。参考応答毎にレコードを設け、参考応答ＩＤ５５１と、参考質問文形式化単語列５５２と、参考回答文形式化単語列５５３との項目を対応付けて記憶するように構成されている。このように、形式化単語列は、形式に係る語の表層表現（読み）と、内容に係る語の品詞種別を単語の列として並べた構成となっている。 FIG. 5 is a diagram illustrating a configuration example of the reference response example format storage unit. A record is provided for each reference response, and items of a reference response ID 551, a reference question sentence formatted word string 552, and a reference answer sentence formatted word string 553 are stored in association with each other. As described above, the formalized word string has a configuration in which the surface representation (reading) of the word related to the format and the part of speech type of the word related to the contents are arranged as a word string.

ここで、具体的な質問文形式化処理および回答文形式化処理について説明する。図６は、質問文形式化処理／回答文形式化処理フローを示す図である。この例では、質問文形式化処理と回答文形式化処理は共通である。まず、対象文（参考質問文又は参考回答文、あるいは後述する質問文又は関連文）を形態素解析し、単語毎の品詞種別と表層表現を得る（Ｓ６０１）。そして、単語毎に（Ｓ６０２）、質問応答特性判定処理（Ｓ６０３、図７）により、当該単語の特性を判定する。質問応答形式に係る単語であると判定された場合には、当該単語の品詞種別を、対象文形式化単語列（参考質問文形式化単語列又は参考回答文形式化単語列、あるいは後述する質問文形式化単語列又は関連文形式化単語列）に追加する（Ｓ６０４）。一方、質問応答内容に係る単語と判定された場合には、当該単語の表層表現を、同様に対象文形式化単語列に追加する（Ｓ６０５）。尚、単語間には区切りの記号を入れて単語の単位を識別できるようにする。また、句読点などの記号も単語として扱う。これらの処理をすべての単語に対して行う（Ｓ６０６）。 Here, specific question sentence formatting processing and answer sentence formatting processing will be described. FIG. 6 is a diagram showing a question sentence formatting process / answer sentence formatting process flow. In this example, the question sentence formatting process and the answer sentence formatting process are common. First, a morphological analysis is performed on a target sentence (reference question sentence or reference answer sentence, or a question sentence or related sentence described later) to obtain a part-of-speech type and a surface layer expression for each word (S601). Then, for each word (S602), the characteristic of the word is determined by the question response characteristic determination process (S603, FIG. 7). When it is determined that the word is related to the question answer format, the part of speech type of the word is set as the target sentence formatted word string (reference question sentence formatted word string or reference answer sentence formatted word string, or question to be described later) To a sentence formalized word string or a related sentence formalized word string) (S604). On the other hand, if it is determined that the word is related to the question response content, the surface representation of the word is similarly added to the target sentence formatted word string (S605). In addition, a delimiter is inserted between words so that the unit of the word can be identified. Symbols such as punctuation marks are also treated as words. These processes are performed for all words (S606).

ここで、前述の質問応答特性判定処理（Ｓ６０３）について説明する。図７は、質問応答特性判定処理フローを示す図である。当該単語の品詞が助詞、助動詞等の機能語であるか判断し（Ｓ７０１）、機能語である場合には質問応答形式に係る単語と判定する（Ｓ７０６）。また、当該単語が疑問詞か判定する（Ｓ７０２）。例えば、疑問詞となる代名詞としては、ナニ、ドコ、ダレ、ナン、ドチラ、ドレ、ドッチ、イツ、ドナタ、イクツ、ドッカ、イズレ、ナアニ等がある。疑問詞となる連体詞としては、ドノ、ドンナ、ドウイウ、イカナル等がある。また、疑問詞となる副詞としては、ドウ、ナゼ、ドウシテ、イクラ、イツノマニ等がある。その他にも、ッテナ、ナニモノ等ある。これらの品詞と表層表現の組み合せによって判定する。そして、疑問詞である場合には質問応答形式に係る単語と判定する（Ｓ７０６）。また、内容語であっても、所定の質問の焦点となりやすい単語であるかを判定し（Ｓ７０３）、その所定の単語である場合には、質問応答形式に係る単語と判定する（Ｓ７０６）。例えば、「理由」、「方法」、「意味」、「違い」等が質問の焦点となりやすい単語である。また、参考応答例の中で出現頻度が高い動詞と形容詞も予め特定しておき、その所定の頻出する単語であるかも判定し（Ｓ７０４）、所定の頻出する単語の場合には、質問応答形式に係る単語と判定する（Ｓ７０６）。そして、その他の内容語は、質問応答内容に係る単語と判定する（Ｓ７０６）。Ｓ７０４のステップは省略することもできる。 Here, the above-described question response characteristic determination process (S603) will be described. FIG. 7 is a diagram illustrating a question response characteristic determination process flow. It is determined whether the part of speech of the word is a function word such as a particle or an auxiliary verb (S701). If it is a function word, it is determined as a word related to a question answer format (S706). Further, it is determined whether the word is a question word (S702). For example, pronouns that are interrogatives include Nani, Doko, Dare, Nan, Dochira, Dore, Dotchi, Itatsu, Donata, Ikutsu, Dokka, Izure, Naani and the like. There are Dono, Donna, Douiu, Ikanal, etc. as interrogative words that are interrogative. In addition, adverbs that are interrogatives include Dow, Naze, Doushite, Ikura, Itsuno Mani. In addition, there are TTENA and Nanimono. Judgment is made by a combination of these parts of speech and surface expression. If it is an interrogative word, it is determined as a word related to the question answer format (S706). Moreover, even if it is a content word, it is determined whether it is a word which becomes a focus of a predetermined question (S703), and when it is the predetermined word, it determines with the word which concerns on a question response format (S706). For example, “reason”, “method”, “meaning”, “difference”, and the like are words that are likely to be the focus of a question. In addition, verbs and adjectives having a high appearance frequency in the reference response example are also specified in advance, and it is also determined whether the word is a predetermined frequent word (S704). In the case of the predetermined frequent word, a question response format (S706). The other content words are determined to be words related to the question response content (S706). The step of S704 can be omitted.

上述の動作により、質問応答システムにより参考応答例の準備を事前に行っておく。次に、実際の質問応答の動作について説明する。 With the above-described operation, the reference response example is prepared in advance by the question answering system. Next, the actual question answering operation will be described.

図８は、質問応答処理フローを示す図である。図に示すように順次、質問文を入力する質問文入力処理（Ｓ８０１）と、質問文を形式化する質問文形式化処理（Ｓ８０２）と、形式化した質問文から質問形式としての要部を抽出する質問文形式要部抽出処理（Ｓ８０３）と、参考応答例から質問形式の要部が相似（同一あるいは類似）する参考質問文を抽出する相似形式質問文抽出処理（Ｓ８０４）と、相似する参考質問文に対する参考応答文を形式化し、その形式化された参考回答文に含まれる単語列からなる応答形式要素について、当該相似する参考質問文との相関度を算出する応答形式要素相関度算出処理（Ｓ８０５）と、質問文に内容的に関連する関連語を生成する関連語生成処理（Ｓ８０６）と、質問文に内容的に関連する関連文書を検索する関連文書検索処理（Ｓ８０７）と、関連文書に含まれる文（関連文と呼ぶ）毎に、質問文に対する応答としての適性の程度を文スコアとして算出する文スコア算出処理（Ｓ８０８）と、文スコアに基づいて応答解となる候補範囲を抽出する解候補抽出処理（Ｓ８０９）と、抽出した解候補を応答文として出力する応答文出力処理（Ｓ８１０）を行う。 FIG. 8 is a diagram showing a question response process flow. As shown in the figure, a question sentence input process (S801) for inputting a question sentence sequentially, a question sentence formatting process (S802) for formalizing the question sentence, and a main part as a question form from the formatted question sentence. Similar to the extracted question sentence format main part extraction process (S803) and the similar form question sentence extraction process (S804) to extract the reference question sentence in which the main part of the question format is similar (same or similar) from the reference response example. Response form element correlation calculation that formalizes the reference response sentence for the reference question sentence and calculates the degree of correlation with the similar reference question sentence with respect to the response form element composed of the word strings included in the formatted reference answer sentence A process (S805), a related word generation process (S806) for generating a related word related to the question sentence, a related document search process (S807) for searching for a related document related to the question sentence A sentence score calculation process (S808) for calculating the degree of suitability as a response to the question sentence as a sentence score for each sentence included in the related document (referred to as a related sentence), and a candidate range that becomes a response solution based on the sentence score A solution candidate extraction process (S809) for extracting a response sentence and a response sentence output process (S810) for outputting the extracted solution candidate as a response sentence are performed.

以下ではこれらの動作を、質問文入力処理（Ｓ８０１）から応答形式要素相関度算出処理（Ｓ８０５）の前半動作と、関連語生成処理（Ｓ８０６）から応答文出力処理（Ｓ８１０）の後半動作に分けて説明する。 In the following, these operations are divided into a first half operation from the question sentence input process (S801) to a response format element correlation calculation process (S805) and a second half operation from the related word generation process (S806) to the response sentence output process (S810). I will explain.

図９は、質問文入力から応答形式要素相関度計算までの処理に係る構成を示す図である。質問応答システムは、前述の参考応答例形式記憶部１０４の他、質問文入力処理（Ｓ８０１）を行う質問文入力部９０１、入力された質問文を記憶する質問文記憶部９０２、質問文形式化処理（Ｓ８０２）を行う質問文形式化部９０３、形式化された質問文形式化単語列を記憶する質問文形式記憶部９０４、質問文形式要部抽出処理（Ｓ８０３）を行う質問文形式要部抽出部９０５、抽出された質問文形式要部を記憶する質問文形式要部記憶部９０６、相似形式質問文抽出処理（Ｓ８０４）を行う相似形式質問文抽出部９０７、抽出された相似形式の参考質問文の参考応答ＩＤを集合として記憶する相似形式質問文集合記憶部９０８、応答形式要素相関度算出処理（Ｓ８０５）を行う応答形式要素相関度算出部９０９、応答形式要素を含む参考回答文の参考応答ＩＤを集合として記憶する応答形式要素含回答文集合記憶部９１０、応答形式要素が相似形式質問文に形式的に関連する程度である質問形式相関度を記憶する応答形式要素相関度テーブル９１１を有している。 FIG. 9 is a diagram showing a configuration relating to processing from question text input to response format element correlation calculation. The question answering system includes a question sentence input unit 901 that performs a question sentence input process (S801), a question sentence storage part 902 that stores an inputted question sentence, and a question sentence formalization, in addition to the reference response example format storage unit 104 described above. Question sentence formatting unit 903 that performs processing (S802), question sentence format storage unit 904 that stores the formatted question sentence format word string, and question sentence format main part that performs question sentence format main part extraction processing (S803) An extraction unit 905, a question sentence format main part storage unit 906 for storing the extracted question sentence format main part, a similarity form question sentence extraction unit 907 for performing a similar form question sentence extraction process (S804), and a reference of the extracted similar form A similar format question message set storage unit 908 that stores reference response IDs of question sentences as a set, a response format element correlation calculation unit 909 that performs response format element correlation calculation processing (S805), and a reference including a response format element Response form element-containing answer sentence set storage unit 910 that stores reference response IDs of answer sentences as a set, response form element correlation that stores a question form correlation degree that the response form element is formally related to the similar form question sentence A degree table 911 is provided.

ここで、参考応答例記憶部１０２の参考質問文の中から、質問文と形式が相似する参考質問文を抽出する手順の概要を説明する。図１０は、質問文と参考質問文の比較例を示す図である。前述の通り、各参考質問文１００６は予め質問文形式化処理により参考質問文形式化単語列１００５に変換しておく。そして、質問文１００１が入力されると、これを同様に質問文形式化処理し、質問文形式化単語列１００２を得る。更に、質問文形式化単語列１００２から疑問詞（この例では、ナニ）を中心とする質問文の形式としての要部である質問文形式要部１００３を抽出する。そして、各参考質問文形式化単語列１００５について、同様に参考質問文形式要部１００４を抽出して、それぞれを比較する。比較結果が完全一致する場合に、質問文形式が同一であると判断する。また、疑問詞が一致する部分一致の場合には、一致の程度に従って類似すると判断する。 Here, an outline of a procedure for extracting a reference question sentence similar in format to the question sentence from the reference question sentences in the reference response example storage unit 102 will be described. FIG. 10 is a diagram illustrating a comparative example of a question sentence and a reference question sentence. As described above, each reference question sentence 1006 is converted into a reference question sentence formatted word string 1005 in advance by a question sentence formatting process. When the question sentence 1001 is input, the question sentence formatting process is performed in the same manner to obtain a question sentence formatting word string 1002. Further, a question sentence format main part 1003 which is a main part as a question sentence format centered on a question word (in this example, Nani) is extracted from the question sentence formatted word string 1002. Then, for each reference question sentence formatted word string 1005, the reference question sentence format main part 1004 is similarly extracted and compared. When the comparison results are completely consistent, it is determined that the question sentence formats are the same. Further, in the case of partial match where the interrogative words match, it is determined that they are similar according to the degree of matching.

質問文入力処理（Ｓ８０１）では、操作者の操作等により入力された質問文を質問文記憶部９０２に記憶させ、質問文形式化処理（Ｓ８０２）では、前述の質問文形式化処理（図７）により質問文記憶部９０２に記憶している質問文を形式化して、質問文形式化単語列を生成し、質問文形式記憶部９０４に記憶させる。 In the question sentence input process (S801), the question sentence input by the operator's operation or the like is stored in the question sentence storage unit 902, and in the question sentence formatting process (S802), the above-described question sentence formatting process (FIG. 7). The question sentence stored in the question sentence storage unit 902 is formalized to generate a question sentence format word string, and the question sentence form storage unit 904 stores the question sentence format word string.

次に、質問文形式要部抽出処理（Ｓ８０３）について詳述する。図１１は、質問文形式要部抽出処理フローを示す図である。まず、質問文形式化単語列中の疑問詞を特定する（Ｓ１１０１）。疑問詞は、前述の通り予め定められた品詞種別と表層表現により特定することができる。そして、疑問詞の前の３単語と後の３単語を含む７単語から成る単語列を抽出し、質問文形式要部とする（Ｓ１１０２）。この例では、所定の疑問詞前単語数を３とし、所定の疑問詞後単語数を３として説明する。 Next, the question sentence format main part extraction process (S803) will be described in detail. FIG. 11 is a diagram illustrating a question sentence format main part extraction processing flow. First, the question word in the question sentence formalization word string is specified (S1101). The interrogative can be specified by a part-of-speech type and a surface expression as previously described. Then, a word string composed of 7 words including the 3 words before and 3 after the interrogative word is extracted and used as the main part of the question sentence format (S1102). In this example, it is assumed that the predetermined number of words before the interrogation is 3, and the predetermined number of words after the interrogation is 3.

次に、相似形式質問文抽出処理（Ｓ８０４）について詳述する。図１２は、相似形式質問文抽出処理フローを示す図である。参考応答例形式記憶部１０４に含まれる参考応答例毎に以下の処理を繰り返す（Ｓ１２０１）。まず、当該参考応答例の参考質問文形式化単語列から参考質問文形式要部を抽出する（Ｓ１２０２）。抽出の手順は、前述の質問文形式要部抽出処理（Ｓ８０３：図１１）と同様である。そして、抽出した参考質問文形式要部を質問文形式要部記憶部９０６に記憶している質問文形式要部と比較する。中央の疑問詞が一致しない場合には（Ｓ１２０３）、相似度を０とし（Ｓ１２０５）、相似しないものとして扱う。中央の疑問詞が一致する場合には（Ｓ１２０３）、各位置の単語の一致数をカウントし、相似度とする（Ｓ１２０４）。つまり、位置と単語の両方が一致した場合を１として、中央以外の６つの位置の一致した数を合計する。このとき、単語が表層表現のときには表層表現が一致する場合、単語が品詞種別のときには品詞種別が一致する場合に、単語が一致するとして処理する。そして、すべての参考応答例について処理を終えると（Ｓ１２０６）、計数した相似度の高い順に従って、相似形式質問文を選択する（Ｓ１２０７）。選択数（１又は２以上）を予め設定しておき、所定の選択数に達するまで選択する方法や、選択基準となる相似度の下限（単語全数である７又は６以下）を予め設定しておき、所定の相似度下限以上の相似形式質問文を選択する方法が考えられる。そして、選択した相似形式質問文を識別するための参考応答ＩＤを相似形式質問文集合記憶部９０８に記憶させる。これにより、相似形式質問文を要素とする相似形式質問文集合について、参考応答ＩＤを識別子として取り扱うことができる。 Next, the similar question message extraction process (S804) will be described in detail. FIG. 12 is a diagram showing a similar format question sentence extraction processing flow. The following processing is repeated for each reference response example included in the reference response example format storage unit 104 (S1201). First, the main part of the reference question sentence format is extracted from the reference question sentence formatted word string of the reference response example (S1202). The extraction procedure is the same as the above-described question sentence format main part extraction process (S803: FIG. 11). Then, the extracted reference question sentence format main part is compared with the question sentence format main part stored in the question sentence format main part storage unit 906. If the central question word does not match (S1203), the similarity is set to 0 (S1205), and it is treated as not similar. If the central question word matches (S1203), the number of matches of the words at each position is counted and used as the similarity (S1204). That is, the case where both the position and the word match is set to 1, and the number of matches at the six positions other than the center is totaled. At this time, when the word is a surface expression, the surface expression is matched. When the word is a part of speech type, the part of speech type is matched. When the processing is completed for all the reference response examples (S1206), the similarity type question sentences are selected in the descending order of the degree of similarity counted (S1207). The number of selections (1 or 2 or more) is set in advance, and a method of selecting until the predetermined number of selections is reached, or a lower limit of similarity (7 or 6 or less, which is the total number of words) as a selection criterion is set in advance. In addition, a method of selecting a similar question sentence that is equal to or higher than a predetermined similarity lower limit can be considered. Then, a reference response ID for identifying the selected similar format question message is stored in the similar format question message set storage unit 908. Thereby, it is possible to handle the reference response ID as an identifier for a similar format question sentence set having a similar format question sentence as an element.

次に、応答形式要素相関度算出処理（Ｓ８０５）について詳述する。図１３は、応答形式要素相関度算出処理フローを示す図である。相似形式質問文集合記憶部９０８で記憶している相似形式質問文の参考応答ＩＤを読み出し、参考応答例形式記憶部１０４からその参考応答ＩＤに対応する参考回答文形式化単語列を読み出す。その参考回答文形式化単語列から、順に連続する２単語を抽出し、応答形式要素とする。この処理を相似形式質問文集合記憶部９０８で記憶している各相似形式質問文に対して行う（Ｓ１３０１）。尚、２単語が重複する場合には、省略して構わない。応答形式要素とは、回答文形式単語列よりも小さい単位の形式化された単語列である。この例では、単語数を２とする。そして、抽出した応答形式要素毎に以下の処理を繰り返す（Ｓ１３０２）。 Next, the response format element correlation calculation process (S805) will be described in detail. FIG. 13 is a diagram showing a response format element correlation calculation processing flow. The reference response ID of the similar format question message stored in the similar format question message set storage unit 908 is read out, and the reference answer sentence formatted word string corresponding to the reference response ID is read out from the reference response example format storage unit 104. Two consecutive words are extracted in order from the reference answer sentence formalized word string, and set as response form elements. This process is performed for each similar question text stored in the similar question text set storage unit 908 (S1301). If two words overlap, they can be omitted. The response format element is a formatted word string in units smaller than the answer sentence format word string. In this example, the number of words is 2. Then, the following processing is repeated for each extracted response format element (S1302).

まず、参考応答例形式記憶部１０４から当該応答形式要素を含む参考回答文形式化単語列を検索し、その参考応答ＩＤを応答形式要素含回答文集合として応答形式要素含回答文集合記憶部９１０に記憶させる（Ｓ１３０３）。 First, a reference response sentence formatted word string including the response format element is searched from the reference response example format storage unit 104, and the response format element-containing response sentence set storage unit 910 with the reference response ID as a response format element-containing response sentence set. (S1303).

次に、当該応答形式要素と相似形式質問文の相関度をカイ二乗検定により求める。この場合のカイ二乗値を算出する式を示す。 Next, the correlation between the response format element and the similar format question is obtained by chi-square test. An expression for calculating the chi-square value in this case is shown.

ｎは、全参考応答例数であり、Ａは、相似形式質問文からなる参考応答例の集合であり、Ｂは、応答形式要素含回答文からなる参考応答例の集合である。相似形式質問文と応答形式要素含回答文が共起する頻度に基づいて、両者の相関を求めることができる。 n is the total number of reference response examples, A is a set of reference response examples made up of similar format question sentences, and B is a set of reference response examples made up of response form element-containing answer sentences. Based on the frequency with which the similar format question text and the response style element-containing answer text co-occur, the correlation between them can be obtained.

処理としては、まず式中の各所定集合の要素数を算出する（Ｓ１３０４）。全参考応答例数ｎとして、参考応答例形式記憶部１０４に参考応答ＩＤ数を計数する。また分母の各項について、集合Ａの要素数として相似形式質問文集合記憶部９０８に含まれる参考応答ＩＤ数を計数し分母第１項値とし、Ａの余集合の要素数として全参考応答例数ｎから順に集合Ａの要素数を減じて差を求め分母第２項値とし、集合Ｂの要素数として応答形式要素含回答文集合記憶部９１０に含まれる参考応答ＩＤ数を計数し分母第３項値とし、Ｂの余集合の要素数として全参考応答例数ｎから集合Ｂの要素数を減じて差を求め分母第４項値とする。更に分子括弧内の各項について、集合Ａと集合Ｂの積集合の要素数として相似形式質問文集合記憶部９０８と参考応答例形式記憶部１０４に共に含まれる参考応答ＩＤ数を計数し分子括弧内第１項値とし、Ａの余集合とＢの余集合の積集合の要素数として相似形式質問文集合記憶部９０８と参考応答例形式記憶部１０４のいずれにも含まれない参考応答ＩＤ数を計数し分子括弧内第２項値とし、Ａの余集合と集合Ｂの積集合の要素数として相似形式質問文集合記憶部９０８に含まれず参考応答例形式記憶部１０４に含まれる参考応答ＩＤ数を計数し分子括弧内第３項値とし、集合ＡとＢの余集合の積集合の要素数として相似形式質問文集合記憶部９０８に含まれ参考応答例形式記憶部１０４に含まれない参考応答ＩＤ数を計数し分子括弧内第４項値とする。 As processing, first, the number of elements of each predetermined set in the equation is calculated (S1304). The number of reference response IDs is counted in the reference response example format storage unit 104 as the total number n of reference response examples. For each term in the denominator, the number of reference response IDs included in the similar form question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and all reference response examples as the number of elements in the remaining set of A The number of elements in the set A is subtracted in order from the number n to obtain a difference as a denominator second term value, and the number of reference response IDs included in the response format element-containing answer sentence set storage unit 910 is counted as the number of elements in the set B. The difference is obtained by subtracting the number of elements of the set B from the total number n of reference response examples as the number of elements of the remainder set of B as the number of elements of the B remaining set, and set as the fourth term value of the denominator. Further, for each term in the molecular parenthesis, the number of reference response IDs included in the similar form question sentence set storage unit 908 and the reference response example form storage unit 104 is counted as the number of elements of the product set of set A and set B. The number of reference response IDs that are not included in either the similar form question sentence set storage unit 908 or the reference response example form storage unit 104 as the number of elements of the product set of the remainder set of A and the remainder set of B And the reference value ID included in the reference response example format storage unit 104 not included in the similar format question sentence set storage unit 908 as the number of elements in the intersection set of A and the set B The number is counted as the third term value in numerator brackets, and is included in the similar form question sentence set storage unit 908 as the number of elements of the intersection set of sets A and B. Reference not included in the reference response example form storage unit 104 Count the number of response IDs The term value.

次に、各集合の要素数と全参考応答例数からカイ二乗値を算出する（Ｓ１３０５）。まず、分母第１項値と分母第２項値と分母第３項値と分母第４項値を積算し、分母値を求める。次に、分子括弧内第１項値と分子括弧内第２項値を積算し分子括弧内前項値を求め、分子括弧内第３項値と分子括弧内第４項値を積算し分子括弧内後項値を求め、分子括弧内前項値と分子括弧内後項値の差を求め、差の二乗値に全参考応答例数ｎを乗じて分子値を求める。最語に、分子値を分母値で割って、カイ二乗値とする。また、カイ二乗値の二乗根を算出し、当該二乗根を質問形式相関度として当該応答形式要素に対応付けて記憶する（Ｓ１３０６）。この処理をすべての応答形式要素について行う（Ｓ１３０７）。 Next, a chi-square value is calculated from the number of elements in each set and the total number of reference response examples (S1305). First, the denominator first term value, the denominator second term value, the denominator third term value, and the denominator fourth term value are integrated to obtain the denominator value. Next, the first term value in the molecular bracket and the second term value in the molecular bracket are integrated to obtain the previous term value in the molecular bracket, and the third term value in the molecular bracket and the fourth term value in the molecular bracket are integrated. The rear term value is obtained, the difference between the previous term value in the molecular parenthesis and the rear term value in the molecular parenthesis is obtained, and the molecular value is obtained by multiplying the square value of the difference by the number n of all reference response examples. Lastly, the numerator value is divided by the denominator value to obtain the chi-square value. Also, the square root of the chi-square value is calculated, and the square root is stored in association with the response format element as the question format correlation (S1306). This process is performed for all response format elements (S1307).

上述の処理により応答形式要素相関度テーブル９１１が生成される。図１４は、応答形式要素相関度テーブルを示す図である。応答形式要素毎にレコードを設け、応答形式要素１４５１と、カイ二乗値１４５２と、質問形式相関度１４５３との項目を対応付けて記憶するように構成されている。この例では、カイ二乗値も記憶させているが省略することもできる。 The response format element correlation degree table 911 is generated by the above processing. FIG. 14 is a diagram showing a response format element correlation degree table. A record is provided for each response format element, and the response format element 1451, chi-square value 1452, and question format correlation degree 1453 are associated with each other and stored. In this example, the chi-square value is also stored, but can be omitted.

図１４の例は、図１０の１００３に示したタ＿リユウ＿ハ＿ナニ＿デス＿カ＿＜記号，句点，＊，＊＞の質問文形式要部に対して、タ＿リユウ、タ＿カラ、リユウ＿ハなどの応答形式要素１４５１が、形式的に相関が高いということを示している。 The example shown in FIG. 14 corresponds to the main part of the question sentence format of the tag_review_ha_nani_des_k_ <symbol, punctuation mark, *, *> shown in 1003 of FIG. Response format elements 1451 such as Kara and Ryu_ha indicate that the correlation is formally high.

続いて、図８に示した関連語生成処理（Ｓ８０６）から応答文出力処理（Ｓ８１０）の後半動作について説明する。 Next, the second half operation from the related word generation process (S806) to the response sentence output process (S810) shown in FIG. 8 will be described.

図１５は、関連語生成から応答文出力までの処理に係る構成を示す図である。質問応答システムは、前述の質問文記憶部９０２と応答形式要素相関度テーブル９１１の他、関連語生成処理（Ｓ８０６）を行う関連語生成部１５０１、質問文に内容的に関連する関連語を内容的な関連度と共に記憶する関連語テーブル１５０２、関連文書検索処理（Ｓ８０７）を行う関連文書検索部１５０３、質問文に内容的に関連する関連文書を記憶する関連文書記憶部１５０４、文スコア算出処理（Ｓ８０８）を行う文スコア算出部１５０５、関連文書に含まれる関連文毎に、質問文に対する応答としての適性の程度を文スコアとして記憶する文スコアテーブル１５０６、解候補抽出処理（Ｓ８０９）を行う解候補抽出部１５０７、文スコアに基づいて応答解と判定された候補範囲を記憶する解候補記憶部１５０８、応答文出力処理（Ｓ８１０）を行う応答文出力部１５０９を有している。 FIG. 15 is a diagram illustrating a configuration relating to processing from related word generation to response sentence output. The question answering system includes a related word generation unit 1501 that performs related word generation processing (S806) in addition to the above-described question sentence storage unit 902 and response format element correlation degree table 911, and related words that are related in detail to the question sentence. Related word table 1502 stored together with the related degree of relatedness, related document search unit 1503 for performing related document search processing (S807), related document storage unit 1504 for storing related documents related to contents in question sentences, sentence score calculation processing The sentence score calculation unit 1505 that performs (S808), the sentence score table 1506 that stores the degree of suitability as a response to the question sentence as a sentence score for each related sentence included in the related document, and solution candidate extraction processing (S809) A solution candidate extraction unit 1507, a solution candidate storage unit 1508 for storing a candidate range determined as a response solution based on the sentence score, a response sentence output process ( And a response sentence output section 1509 that performs 810).

次に、関連語生成処理（Ｓ８０６）について詳述する。図１６は、関連語生成処理フローを示す図である。まず、質問文記憶部９０２に記憶している質問文から複数のキーワードを抽出する。この例では、質問文から複合語を含む動詞・形容詞のキーワードを抽出する（Ｓ１６０１）。そして、順次キーワードを組み合せてクエリを生成する。この例では、３つのキーワード組合せのＡＮＤ条件からなるクエリを生成する。そしてクエリ毎に以下の処理を繰り返す（Ｓ１６０２）。当該クエリを入力してＷｅｂ検索し、検索結果の要約であるスニップ集合を得る（Ｓ１６０３）。尚、質問応答システムはインターネットに接続しており、Ｗｅｂ検索サイトを介してＷｅｂ上のサイト、ＨＴＭＬ文書、及びその他のコンテンツを検索できるように構成されている。そしてスニップ集合に含まれる単語（内容語に限る。以下、関連語候補と呼ぶ。）を抽出し、関連語候補の内容的な関連度を算出し、関連語を特定する。この例における内容的な関連度の算出式を以下に示す。 Next, the related word generation process (S806) will be described in detail. FIG. 16 is a diagram showing a related word generation processing flow. First, a plurality of keywords are extracted from the question text stored in the question text storage unit 902. In this example, verb / adjective keywords including compound words are extracted from the question sentence (S1601). Then, a query is generated by sequentially combining the keywords. In this example, a query including an AND condition of three keyword combinations is generated. The following processing is repeated for each query (S1602). A Web search is performed by inputting the query, and a snip set that is a summary of the search results is obtained (S1603). The question answering system is connected to the Internet, and is configured to be able to search Web sites, HTML documents, and other contents via a Web search site. Then, words included in the snip set (limited to content words; hereinafter referred to as related word candidates) are extracted, the content relevance level of the related word candidates is calculated, and the related words are specified. The calculation formula for the content relevance in this example is shown below.

式中、ｗ_jは、各単語（各関連語候補）であり、ｑ_iは、各クエリであり、ｎ_iは、各クエリｑ_iに対して得られたスニップの件数であり、ｆｒｅｑ（ｗ_j，ｉ）は、各単語ｗ_jの各クエリｑ_iに対して得られたスニップ集合中でのスニップ頻度であり、Ｔ（ｗ_j）は、各単語ｗ_jの内容的な関連度である。 _Where w _j is each word (each related word candidate), q _i is each query, n _i is the number of snips obtained for each query q _i , and freq (w _j , i) is the snip frequency in the snip set obtained for each query q _i of each word w _j , and T (w _j ) is the content relevance of each word w _j. .

関連語候補毎に以下の処理を繰り返す（Ｓ１６０４）。当該関連語候補のスニップ集合中における頻度を算出し、当該スニップ頻度をスニップ数で割り、正規化スニップ頻度とする（Ｓ１６０５）。すべての関連語候補について正規化スニップ頻度を求める（Ｓ１６０６）。この処理を、予定しているすべてのキーワード組合せについて行った時点で（Ｓ１６０７）、関連語候補毎に正規化スニップ頻度同士を比較し、最大の正規化スニップ頻度を内容的な関連度とする。内容的な関連度が所定閾値以上の場合に、当該関連語候補を関連語と判定し、関連語とその関連度を関連語テーブル１５０２に記憶させる（Ｓ１６０８）。尚、閾値による判定を行なわずに、すべての関連語候補を関連語とした扱う形態も有効である。 The following processing is repeated for each related word candidate (S1604). The frequency of the related word candidate in the snip set is calculated, and the snip frequency is divided by the number of snips to obtain a normalized snip frequency (S1605). Normalized snip frequency is obtained for all related word candidates (S1606). When this processing is performed for all the planned keyword combinations (S1607), the normalized snip frequencies are compared for each related word candidate, and the maximum normalized snip frequency is set as the content relevance level. If the content related degree is equal to or greater than a predetermined threshold, the related word candidate is determined to be a related word, and the related word and its related degree are stored in the related word table 1502 (S1608). Note that it is also effective to treat all related word candidates as related words without performing the determination based on the threshold value.

図１７は、関連語テーブルの構成例を示す図である。関連語毎にレコードを設け、関連語１７５１と関連度１７５２との項目を対応付けて記憶するように構成されている。 FIG. 17 is a diagram illustrating a configuration example of a related word table. A record is provided for each related word, and the items of the related word 1751 and the related degree 1752 are stored in association with each other.

図１７の例は、図１０の１００１に示した、「琉球王国のグスク及び関連遺産群」が正解遺産に登録された理由は何ですか、という質問文に対して、２０００、沖縄、文化などの関連語１７５１が、内容的に関連が高いということを示している。 The example in FIG. 17 shows the reason why the “Gusuku and related heritage groups of the Ryukyu Kingdom” registered as a correct heritage as shown in 1001 of FIG. The related word 1751 indicates that the content is highly related.

次に、関連文書検索処理（Ｓ８０７）について詳述する。図１８は、関連文書検索処理フローを示す図である。まず、質問文記憶部９０２に記憶している質問文から複数のキーワードを抽出する。この例では、質問文から複合語を含む動詞・形容詞のキーワードを抽出する（Ｓ１８０１）。そして、順次キーワードを組み合せてクエリを生成する。この例では、所定数のキーワード組合せのＡＮＤ条件からなるクエリを生成する。そしてクエリ毎に以下の処理を繰り返す（Ｓ１８０２）。当該クエリを入力してＷｅｂ検索し、検索結果の各文書のＵＲＬを得る（Ｓ１８０３）。そして当該ＵＲＬからＨＴＭＬ文書をダウンロードし（Ｓ１８０４）、ＨＴＭＬ文書をプレーンテキストに変換し、関連文書とする（Ｓ１８０５）。所定のキーワード組合せについて処理すると（Ｓ１８０６）、プレーンテキストに変換した各関連文書に含まれる関連文毎に、関連文書記憶部１５０４の全体を通した文番号を割り当てて順次記憶する（Ｓ１８０７）。 Next, the related document search process (S807) will be described in detail. FIG. 18 is a diagram showing a related document search processing flow. First, a plurality of keywords are extracted from the question text stored in the question text storage unit 902. In this example, verb / adjective keywords including compound words are extracted from the question sentence (S1801). Then, a query is generated by sequentially combining the keywords. In this example, a query including AND conditions of a predetermined number of keyword combinations is generated. Then, the following processing is repeated for each query (S1802). A Web search is performed by inputting the query, and the URL of each document as a search result is obtained (S1803). Then, an HTML document is downloaded from the URL (S1804), and the HTML document is converted into plain text to be a related document (S1805). When processing is performed for a predetermined keyword combination (S1806), a sentence number through the entire related document storage unit 1504 is assigned to each related sentence included in each related document converted into plain text and stored sequentially (S1807).

次に、文スコア算出処理（Ｓ８０８）について説明する。この処理では、関連文毎に、質問文に対する内容的な関連度と形式的な相関度を考慮した応答適性を判定する。 Next, the sentence score calculation process (S808) will be described. In this process, for each related sentence, response aptitude is determined in consideration of the content relevance level and the formal correlation degree with respect to the question sentence.

図１９は、文スコア算出処理フローを示す図である。関連文書記憶部１５０４で記憶している各関連文書に含まれる関連文を順に特定し、当該関連文毎に以下の処理を繰り返す（Ｓ１９０１）。内容評価項算出処理（Ｓ１９０２）では、内容的な関連度に基づく評価を行ない、形式評価項算出処理（Ｓ１９０３）では、形式的な関連度に基づく評価を行なう。ここで、総合的な評価指標となる文スコアの算出式を示す。 FIG. 19 is a diagram showing a sentence score calculation processing flow. The related sentences included in each related document stored in the related document storage unit 1504 are specified in order, and the following processing is repeated for each related sentence (S1901). In the content evaluation term calculation processing (S1902), evaluation based on the content relevance is performed, and in the formal evaluation term calculation processing (S1903), evaluation based on the formal relevance is performed. Here, the calculation formula of the sentence score used as a comprehensive evaluation index is shown.

式中、Ｓ_iは、各関連文であり、ｗ_ijは、各関連文Ｓ_iに含まれる各単語であり、ｎは、関連文Ｓ_i中の単語ｗ_ijの異なり数であり、ｂ_ikは、各関連文Ｓｉに含まれる各応答形式要素（この例では、形式化された２単語）であり、ｍは、関連文Ｓ_i中の応答形式要素ｂ_ikの異なり数である。Ｔは、前述と同様に関連語の関連度であり、カイ二乗値の平方根は、応答形式要素の質問形式相関度であり、Ｓｃｏｒｅ（Ｓ_i）は、各関連文Ｓ_iの文スコアである。In the formula, S _i is each related sentence, w _ij is each word included in each related sentence S _i , n is the number of different words w _ij in the related sentence S _i , and b _ik Is each response format element (in this example, two formalized words) included in each related sentence Si, and m is the number of different response format elements b _{ik in} the related sentence S _i . T is the relevance degree of the related word as described above, the square root of the chi-square value is the question form correlation degree of the response form element, and Score (S _i ) is the sentence score of each related sentence S _i. .

正確な適性を得るためには、関連文内の単語や応答形式要素に関する密度を考慮して文の長さで評価値を割る必要がある。文の長さとして単純に単語数を用いることもできるが、この例では特に、短い文は回答として不適切である場合が多いことを考慮して、短い文の適性を下げる意味で文の長さとして単語数の対数を用いている。 In order to obtain accurate aptitude, it is necessary to divide the evaluation value by the length of the sentence in consideration of the density related to the words in the related sentence and the response form elements. You can simply use the number of words as the length of the sentence, but especially in this example, considering the fact that short sentences are often inappropriate as answers, the length of the sentence is meant to reduce the suitability of short sentences. The logarithm of the number of words is used.

その為、当該関連文の長さ（１＋関連文中単語数）の対数を算出し（Ｓ１９０４）、内容要評価項と形式評価項の積算し、当該積を関連文の長さの対数で除算し、文スコアを求める（Ｓ１９０５）。 Therefore, the logarithm of the length of the relevant sentence (1 + number of words in the relevant sentence) is calculated (S1904), the content evaluation term and the formal evaluation term are added up, and the product is divided by the logarithm of the length of the related sentence. The sentence score is obtained (S1905).

ここで、前述の内容評価項算出処理（Ｓ１９０２）と形式評価項算出処理（Ｓ１９０３）について詳述する。 Here, the content evaluation term calculation process (S1902) and the format evaluation term calculation process (S1903) will be described in detail.

図２０は、内容評価項算出処理フローを示す図である。まず、当該関連文中に含まれる単語（内容語に限る）を特定し（Ｓ２００１）、各単語の関連語としての関連度を関連語テーブル１５０２から取得する。そして、全ての単語の関連度を加算し、関連度総和を求める（Ｓ２００２）。更に、関連度総和のα乗を算出して内容評価項値とする（Ｓ２００３）。この累乗計算に用いる定数αは、０から１の値であって、評価全体に対する内容評価の重みを示している。大きい値ほど内容に対する重みが増す。例えば０の場合は、形式評価のみの文スコアを得ることになり、１の場合には内容評価のみの文スコアを得ることになる。尚、この評価重みのα値は、質問応答システムとして予め設定し、事前に記憶されている値を用いる方式や、質問文に対する応答文の生成の際に、操作者が設定する方式が考えられる。いずれの場合にも、評価重み値を入力する評価重み入力部と、評価重み値を記憶する評価重み記憶部を有し、本処理はその評価重み値を読み出してα値として累乗算出に用いる。 FIG. 20 is a diagram illustrating a content evaluation term calculation processing flow. First, a word (limited to a content word) included in the related sentence is specified (S2001), and the degree of relevance as a related word of each word is acquired from the related word table 1502. And the relevance degree of all the words is added and a relevance total is calculated | required (S2002). Further, the relevance sum is raised to the α power to obtain a content evaluation term value (S2003). The constant α used for the power calculation is a value from 0 to 1, and indicates the weight of content evaluation for the entire evaluation. The larger the value, the more weight is given to the content. For example, in the case of 0, a sentence score of only format evaluation is obtained, and in the case of 1, a sentence score of only content evaluation is obtained. Note that the α value of the evaluation weight is set in advance as a question answering system, and a method using a value stored in advance or a method set by an operator when generating a response sentence for a question sentence can be considered. . In any case, an evaluation weight input unit for inputting an evaluation weight value and an evaluation weight storage unit for storing the evaluation weight value are included. This process reads the evaluation weight value and uses it as an α value for power calculation.

図２１は、形式評価項算出処理フローを示す図である。当該関連文を前述と同様に回答文形式化処理（図６）し、関連文形式化単語列を得る（Ｓ２１０１）。そして、関連文形式化単語列から、順に連続する２単語を抽出し、応答形式要素とする（Ｓ２１０２）。各応答形式要素の応答形式相関度を応答形式要素相関度テーブル９１１から取得し、全ての応答形式要素の質問形式相関度を加算し、質問形式相関度総和を求める（Ｓ２１０３）。そして、質問形式相関度総和の１−α（形式重み）乗を算出して、形式評価項値とする（Ｓ２１０４）。 FIG. 21 is a diagram showing a formal evaluation term calculation processing flow. The related sentence is subjected to an answer sentence formatting process (FIG. 6) in the same manner as described above to obtain a related sentence formatted word string (S2101). Then, two consecutive words are extracted in order from the related sentence formalized word string, and set as response format elements (S2102). The response format correlation degree of each response format element is acquired from the response format element correlation degree table 911, the question format correlation degrees of all response format elements are added, and the query format correlation degree sum is obtained (S2103). Then, the 1-α (form weight) power of the question form correlation degree sum is calculated and used as a form evaluation term value (S2104).

図１９に示すように、当該文番号に対応付けて、求めた文スコアを文スコアテーブル１５０６に記憶し（Ｓ１９０６）、すべての関連文について処理して終了する（Ｓ１９０７）。 As shown in FIG. 19, the sentence score obtained in association with the sentence number is stored in the sentence score table 1506 (S1906), and all related sentences are processed and the process is terminated (S1907).

これにより、関連文毎の文スコアが得られる。文スコアテーブル１５０６で記憶する。図２２は、文スコアの分布例を示す図である。図に示すように、文スコアが高い領域が存在する。次に、これを解候補２２０１，２２０２，２２０３として抽出する解候補抽出処理（Ｓ８０９）を行なう。 Thereby, the sentence score for every related sentence is obtained. The sentence score table 1506 is stored. FIG. 22 is a diagram illustrating an example of sentence score distribution. As shown in the figure, there is a region with a high sentence score. Next, solution candidate extraction processing (S809) is performed to extract these as solution candidates 2201, 2202, 2203.

図２３は、解候補抽出処理フローを示す図である。文スコアテーブル１５０６に含まれる文番号に従って、関連文毎に以下の処理を繰り返す（Ｓ２３０１）。文スコアが極大値か判定する（Ｓ２３０２）。このとき、前後の文の文スコアよりも大きい場合に極大値とする。極大値である場合には、当該関連文の文番号を文スコア極大値に対応付けて、解候補記憶部１５０８の解候補の先頭文番号と末尾文番号に記憶する（Ｓ２３０３）。解候補記憶部１５０８は、解候補毎に、解候補範囲となる先頭文番号と末尾文番号、及び文スコア極大値を対応付けて記憶するように構成されている。更に、順次、前の関連文の文スコアが極大値の１／２以上であるか判定し、１／２以上である場合には解候補の先頭文番号を当該前の関連文の文番号に改める（Ｓ２３０４）。１／２より小さい場合には、その時点で本ステップを終了する。また、順次、後の関連文の文スコアが極大値の１／２以上であるか判定し、１／２以上である場合には解候補の末尾文番号を当該後の関連文の文番号に改める（Ｓ２３０５）。１／２より小さい場合には、その時点で本ステップを終了する。上述の処理をすべての文について行なう。（Ｓ２３０６）。１／２は、所定の割合の例である。 FIG. 23 is a diagram illustrating a solution candidate extraction process flow. The following processing is repeated for each related sentence according to the sentence numbers included in the sentence score table 1506 (S2301). It is determined whether the sentence score is a maximum value (S2302). At this time, the maximum value is set when the sentence score is larger than the sentence score of the preceding and following sentences. If it is a maximum value, the sentence number of the relevant sentence is associated with the sentence score maximum value and stored in the first sentence number and the last sentence number of the solution candidate in the solution candidate storage unit 1508 (S2303). The solution candidate storage unit 1508 is configured to store, for each solution candidate, the first sentence number, the last sentence number, and the sentence score maximum value that are the solution candidate range in association with each other. Further, it is sequentially determined whether the sentence score of the previous related sentence is ½ or more of the maximum value. If it is ½ or more, the first sentence number of the solution candidate is set as the sentence number of the previous related sentence. It is revised (S2304). If it is smaller than ½, this step is terminated at that time. In addition, it is sequentially determined whether the sentence score of the subsequent related sentence is 1/2 or more of the maximum value, and if it is 1/2 or more, the last sentence number of the solution candidate is set as the sentence number of the subsequent related sentence. Amend (S2305). If it is smaller than ½, this step is terminated at that time. The above process is performed for all sentences. (S2306). 1/2 is an example of a predetermined ratio.

最後に、応答文出力処理（Ｓ８１０）について詳述する。図２４は、応答文出力処理フローを示す図である。文スコア極大値の大きい順に、解候補を特定し（Ｓ２４０１）、解候補の先頭文番号から末尾文番号までの関連文書記憶部１５０４から関連文を得る（Ｓ２４０２）。そして、関連文群を応答文として出力する（Ｓ２４０３）。出力する応答文数が定められている場合には、応答文数分の処理を繰り返す（Ｓ２４０４）。また、出力する応答文量が定められている場合には、応答文量に至るまで処理を繰り返す。あるいは、操作者の指示により応答文を切りかえる場合には、指示に従って上述の処理を行なう。 Finally, the response sentence output process (S810) will be described in detail. FIG. 24 is a diagram showing a response sentence output processing flow. Solution candidates are specified in descending order of sentence score maximum values (S2401), and related sentences are obtained from the related document storage unit 1504 from the first sentence number to the last sentence number of the solution candidates (S2402). Then, the related sentence group is output as a response sentence (S2403). If the number of response sentences to be output is determined, the process for the number of response sentences is repeated (S2404). If the response sentence amount to be output is determined, the process is repeated until the response sentence amount is reached. Or when switching a response sentence according to an operator's instruction | indication, the above-mentioned process is performed according to an instruction | indication.

図２５は、正解と不正解の例を示す図である。正解２５０１及び正解２５０２は、図１０の質問文１００１に対して得られた応答文の例である。不正解２５０３は、形式に関する評価が低いために応答文とならなかった例である。 FIG. 25 is a diagram illustrating examples of correct answers and incorrect answers. The correct answer 2501 and the correct answer 2502 are examples of response sentences obtained with respect to the question sentence 1001 in FIG. The incorrect answer 2503 is an example in which the response sentence is not obtained because the evaluation regarding the format is low.

図２５に示したように、形式的な適性を考慮することにより、内容的な適性も向上することがわかる。 As shown in FIG. 25, it can be seen that content suitability is improved by considering formal suitability.

実施の形態２．
上述の形態では、質問文形式要部として疑問詞の前後３単語を含む７単語を抽出したが、他の単語数することもできる。Embodiment 2. FIG.
In the above-described form, seven words including three words before and after the interrogative word are extracted as the main part of the question sentence format, but other words can be used.

前の３単語は、所定の疑問詞前単語数を３とした例であり、他に、１、２、４、５、６以上とすることもできる。また、後の３単語は、所定の疑問詞後単語数を３とした例であり、他に、１、２、４、５、６以上とすることもできる。また、所定前単語数と所定後単語数は同数に限らず、異なる数であっても構わない。 The previous three words are an example in which the predetermined number of words before the interrogation is three, and may be 1, 2, 4, 5, 6 or more. The subsequent three words are examples in which the predetermined number of words after the interrogation is set to 3, and can be 1, 2, 4, 5, 6 or more. The number of pre-predetermined words and the number of post-predetermined words are not limited to the same number, and may be different numbers.

実施の形態３．
上述の形態では、応答形式要素数として２単語抽出したが、他の単語数することもできる。連続する３単語、４単語、５単語以上とすることもできる。Embodiment 3 FIG.
In the above-described form, two words are extracted as the number of response format elements, but other numbers of words can be used. It can also be 3 consecutive words, 4 words, 5 words or more.

応答形式要素数は、質問文形式要部の単語数よりも小さい単語数であれば有効である。 The number of response format elements is effective if the number of words is smaller than the number of words in the main part of the question sentence format.

実施の形態４．
上述の例では、インターネット上に提供されるハイパーテキストシステムを情報源としてＷｅｂ検索を行い、検索結果としてのスニップ及び関連文書を取得した。つまり、ワールドワイドウェブ（ＷＷＷ）を検索対象とした。Embodiment 4 FIG.
In the above example, a web search is performed using a hypertext system provided on the Internet as an information source, and a snip and a related document as a search result are acquired. That is, the search target was the World Wide Web (WWW).

しかし本発明は、他のデータベースを検索対象とする検索を行う場合にも有効である。他のデータベースで得られる要約や文書を前述のスニップ又は関連文書に置き換えて処理することにより、有効な応答が得られる。 However, the present invention is also effective when searching for another database as a search target. Replacing a summary or document obtained in another database with the aforementioned snip or related document and processing can yield a valid response.

実施の形態５．
前述の相似形式質問文抽出処理（図１２）では、各位置の単語の一致数を相似度としたが、他の基準により質問文形式要部同士の相似度を算出してもよい。Embodiment 5 FIG.
In the similarity format question sentence extraction process (FIG. 12) described above, the number of matching words at each position is used as the similarity level. However, the similarity level between the main parts of the question text format may be calculated based on other criteria.

例えば、質問文形式要部同士の単語列としての編集距離を算出し、編集距離を相似度として用いることも有効である。また、疑問詞より前の単語列同士の編集距離を算出し、更に疑問詞より後の単語列同士の編集距離を算出し、両編集距離の和を相似度とすることも有効である。 For example, it is also effective to calculate the edit distance as a word string between the main parts of the question sentence format and use the edit distance as the similarity. It is also effective to calculate the edit distance between the word strings before the interrogative, further calculate the edit distance between the word strings after the interrogative, and use the sum of both edit distances as the similarity.

尚、編集距離は、文字列同士がどの程度異なっているかを示す値であり、文字の削除、挿入、置換によって、一方の文字列を他方の文字列に変形するのに要する最小の手順回数として算出される。この例では、文字列に代えて、上述の形式化された単語列を扱うことにより編集距離を算出することができる。 The edit distance is a value indicating how different character strings are, and is the minimum number of steps required to transform one character string into the other by deleting, inserting, or replacing characters. Calculated. In this example, the edit distance can be calculated by handling the above-described formalized word string instead of the character string.

実施の形態６．
前述の相似形式質問文抽出処理（図１２）では、各位置の単語の一致数を相似度としたが、各位置により重み付けを行うことも有効である。Embodiment 6 FIG.
In the above-described similar question extraction process (FIG. 12), the number of matching words at each position is used as the similarity, but it is also effective to weight each position.

例えば、疑問詞からの距離に応じて、距離が小さい位置の一致の場合には大きい値を加算し、距離が大きい位置の一致の場合には小さい値を加算することにより、疑問詞近辺を重視する相似度を得ることもできる。 For example, depending on the distance from the interrogator, a large value is added when matching a position with a small distance, and a small value is added when matching a position with a large distance. You can also get similarities.

逆に、距離が小さい位置の一致の場合には小さい値を加算し、距離が大きい位置の一致の場合には大きい値を加算することにより、疑問詞近辺を軽視する相似度を得ることもできる。 Conversely, by adding a small value in the case of a match of a position with a small distance and adding a large value in the case of a match of a position with a large distance, it is possible to obtain a similarity degree that neglects the vicinity of an interrogative word. .

実施の形態７．
また、相似度の算出の際に、単語の種類によって重み付けを行ってもよい。例えば、図７のＳ７０１からＳ７０４で判定した単語の種類毎に重みを設定し、単語が一致した場合に、その重みを加算することにより相似度を求めることが有効である。質問の焦点となりやすい単語に対して大きい重みを設定することなどが考えられる。Embodiment 7 FIG.
Further, weighting may be performed according to the type of word when calculating the similarity. For example, it is effective to set a weight for each type of word determined in steps S701 to S704 in FIG. 7 and obtain the similarity by adding the weight when the words match. For example, a large weight may be set for a word that is likely to be the focus of a question.

実施の形態８．
上述の実施の形態では、応答形式要素相関度算出処理（図１３）においてカイ二乗検定により応答形式要素の質問文に対する質問形式相関度を算出したが、他の基準により質問形式相関度を算出することもできる。Embodiment 8 FIG.
In the above-described embodiment, the question form correlation degree for the question sentence of the response form element is calculated by the chi-square test in the response form element correlation degree calculation process (FIG. 13), but the question form correlation degree is calculated by other criteria. You can also

参考応答例を全体の集合として、集合Ａとして参考応答例形式記憶部１０４に含まれる参考応答ＩＤ群と、集合Ｂとして応答形式要素含回答文集合記憶部９１０に含まれる参考応答ＩＤ数群の要素数が計数可能であるので、例えば、ダイス係数を質問形式相関度として用いることもできる。 As a whole set of reference response examples, a reference response ID group included in the reference response example format storage unit 104 as a set A and a reference response ID number group included in a response format element-containing response sentence set storage unit 910 as a set B Since the number of elements can be counted, for example, a dice coefficient can be used as the question form correlation.

その場合には、集合Ａの要素数として相似形式質問文集合記憶部９０８に含まれる参考応答ＩＤ数を計数し分母第１項値とし、集合Ｂの要素数として応答形式要素含回答文集合記憶部９１０に含まれる参考応答ＩＤ数を計数し分母第２項値とし、分母第１項値と分母第２項値を合計して、分母値を求める。また、集合Ａと集合Ｂの積集合の要素数として相似形式質問文集合記憶部９０８と参考応答例形式記憶部１０４に共に含まれる参考応答ＩＤ数を計数し、それに２を乗じて分子値とする。そして、分子値を分母値で割ることによりダイス係数を算出し、それを質問形式相関度とする。 In that case, the number of reference response IDs included in the similarity-type question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and the response form element-containing answer sentence set storage as the number of elements in the set B The number of reference response IDs included in the unit 910 is counted as the second denominator value, and the denominator first term value and the denominator second term value are summed to obtain the denominator value. In addition, the number of reference response IDs included in both the similar format question sentence set storage unit 908 and the reference response example format storage unit 104 is counted as the number of elements of the product set of the set A and the set B, and is multiplied by 2 to obtain the molecular value. To do. Then, a dice coefficient is calculated by dividing the numerator value by the denominator value, and this is used as the question form correlation.

実施の形態９．
また、相互情報量を質問形式相関度とすることもできる。Embodiment 9 FIG.
Also, the mutual information amount can be a question form correlation.

その場合には、集合Ａの要素数として相似形式質問文集合記憶部９０８に含まれる参考応答ＩＤ数を計数し分母第１項値とし、集合Ｂの要素数として応答形式要素含回答文集合記憶部９１０に含まれる参考応答ＩＤ数を計数し分母第２項値とし、分母第１項値と分母第２項値を積算して、分母値を求める。また、集合Ａと集合Ｂの積集合の要素数として相似形式質問文集合記憶部９０８と参考応答例形式記憶部１０４に共に含まれる参考応答ＩＤ数を計数し、それに全参考応答例数を乗じて分子値とする。そして、分子値を分母値で割り、その商に対する底を２とする対数を算出して、相互情報量を得る。そして、それを質問形式相関度とする。 In that case, the number of reference response IDs included in the similarity-type question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and the response form element-containing answer sentence set storage as the number of elements in the set B The number of reference response IDs included in the unit 910 is counted as the second denominator value, and the denominator first term value and the denominator second term value are integrated to obtain the denominator value. In addition, the number of reference response IDs included in both the similar form question sentence set storage unit 908 and the reference response example form storage unit 104 is counted as the number of elements of the product set of the set A and the set B, and is multiplied by the total number of reference response examples. The numerator value. Then, the numerator value is divided by the denominator value, and the logarithm with the base for the quotient being 2 is calculated to obtain the mutual information amount. And it is set as the question form correlation.

いずれの質問形式相関度の算出方法も、応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を算出している。 The calculation method of any question format correlation is based on the probability that the similar question text is combined with the reference response text formalized word string including the response format element, and the degree to which the response format element is related to the similar question text as a format. Is calculated.

実施の形態１０．
Ｗｅｂ検索エンジンを用いて質問応答システムを実現することができる。図２６は、Ｗｅｂ検索エンジンを用いる質問応答システムの構成を示す図である。質問応答システムは、参考文形式化部２６０１、参考文記憶部２６０２、入力質問文形式化部２６０３、参考回答文抽出部２６０４、Ｗｅｂ検索要求部２６０５、記述スタイル評価部２６０６、関連性評価部２６０７、スコア処理部２６０８、及び回答文出力部２６０９を有している。Embodiment 10 FIG.
A question answering system can be realized using a Web search engine. FIG. 26 is a diagram showing a configuration of a question answering system using a Web search engine. The question answering system includes a reference sentence formatting unit 2601, a reference sentence storage unit 2602, an input question sentence formatting unit 2603, a reference answer sentence extraction unit 2604, a Web search request unit 2605, a description style evaluation unit 2606, and a relevance evaluation unit 2607. , A score processing unit 2608 and an answer sentence output unit 2609.

図２７は、Ｗｅｂ検索エンジンを用いる質問応答システムの処理フローを示す図である。参考文形式化部２６０１による参考文形式化処理（Ｓ２７０１）では、参考質問文と参考回答文の対を参考文とし、参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う。参考文記憶部２６０２は、参考文形式化部２６０１において形式化された形式化参考文を記憶する。入力質問文形式化部２６０３による入力質問文形式化処理（Ｓ２７０２）では、入力質問文の記述スタイルを一般化する形式化処理を行う。参考回答文抽出部２６０４による参考回答文抽出処理（Ｓ２７０３）では、入力質問文形式化部２６０３において形式化された形式化入力質問文と類似する形式を有する形式化参考文を探索し、この形式化参考文に含まれる参考回答文を参考文記憶部２６０２から抽出する。 FIG. 27 is a diagram showing a processing flow of a question answering system using a Web search engine. In the reference sentence formatting process (S2701) by the reference sentence formatting unit 2601, a pair of the reference question sentence and the reference answer sentence is used as a reference sentence, and the formatting is generalized for at least the reference question sentence among the reference sentences. Process. The reference text storage unit 2602 stores the formatted reference text formatted by the reference text formatting unit 2601. In the input question sentence formatting process (S2702) by the input question sentence formatting unit 2603, a formalization process for generalizing the description style of the input question sentence is performed. In the reference answer sentence extraction process (S2703) by the reference answer sentence extraction unit 2604, a formalized reference sentence having a format similar to the formalized input question sentence formatted by the input question sentence formatting unit 2603 is searched, and this format is searched. The reference answer sentence included in the structured reference sentence is extracted from the reference sentence storage unit 2602.

Ｗｅｂ検索要求部２６０５によるＷｅｂ検索要求処理（Ｓ２７０４）では、入力質問文を条件としてＷｅｂ検索エンジンにＷｅｂ上の文書の検索を要求し、結果として検索Ｗｅｂ文書を得る。記述スタイル評価部２６０６による記述スタイル評価処理（Ｓ２７０５）では、参考回答文と検索Ｗｅｂ文書の間の記述スタイルの適合性を評価する。関連性評価部２６０７による関連性評価処理（Ｓ２７０６）では、検索Ｗｅｂ文書と入力質問文の間の内容の関連性を評価する。 In the Web search request process (S2704) by the Web search request unit 2605, the Web search engine is requested to search for a document on the Web on the condition of the input question sentence, and a search Web document is obtained as a result. In the description style evaluation process (S2705) by the description style evaluation unit 2606, the suitability of the description style between the reference answer sentence and the search Web document is evaluated. In the relevance evaluation process (S2706) by the relevance evaluation unit 2607, the relevance of the content between the search Web document and the input question sentence is evaluated.

スコア処理部２６０８によるスコア処理（Ｓ２７０７）では、記述スタイル評価部により参考回答文と記述スタイルの適合性があると評価され、かつ、関連性評価部により前記入力質問文の内容と関連があると評価された検索Ｗｅｂ文書に対してスコア付けを行う。そして、回答文出力部２６０９による回答文出力処理（Ｓ２７０８）では、２６０８スコア処理部から入力質問文に対する回答文を得て、出力する。 In the score processing (S2707) by the score processing unit 2608, the description style evaluation unit evaluates that the reference answer sentence and the description style are compatible, and the relevance evaluation unit relates to the contents of the input question sentence. Scoring is performed on the evaluated search Web document. Then, in an answer sentence output process (S2708) by the answer sentence output unit 2609, an answer sentence for the input question sentence is obtained from the 2608 score processor and output.

質問応答システムは、コンピュータであり、各要素はプログラムにより処理を実行することができる。また、プログラムを記憶媒体に記憶させ、記憶媒体からコンピュータに読み取られるようにすることができる。 The question answering system is a computer, and each element can execute processing by a program. Further, the program can be stored in a storage medium so that the computer can read the program from the storage medium.

質問応答システムのハードウェアの構成について説明する。図２８は、質問応答システムのハードウェアの構成を示す図である。バスに、演算装置２８０１、データ記憶装置２８０２、メモリ２８０３、通信インターフェース２８０４、データ入力装置２８０５、データ出力装置２８０６が接続されている。データ記憶装置２８０２は、例えばＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やハードディスクである。メモリ２８０３は、通常ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）である。プログラムは、通常データ記憶装置２８０２に記憶されており、メモリ２８０３にロードされた状態で、順次演算装置２８０１に読み込まれ処理を行う。通信インターフェース２８０４は、ネットワークを介した通信に用いる。データ入力装置２８０５は、データの入力に用いる。データ出力装置２８０６は、データの出力に用いる。 The hardware configuration of the question answering system will be described. FIG. 28 is a diagram illustrating a hardware configuration of the question answering system. An arithmetic device 2801, a data storage device 2802, a memory 2803, a communication interface 2804, a data input device 2805, and a data output device 2806 are connected to the bus. The data storage device 2802 is, for example, a ROM (Read Only Memory) or a hard disk. The memory 2803 is a normal RAM (Random Access Memory). The program is normally stored in the data storage device 2802, and is loaded into the memory 2803 and sequentially read into the arithmetic device 2801 for processing. A communication interface 2804 is used for communication via a network. The data input device 2805 is used for data input. The data output device 2806 is used for outputting data.

参考応答例準備処理に係る構成を示す図である。It is a figure which shows the structure which concerns on a reference response example preparation process. 参考応答例準備処理フローを示す図である。It is a figure which shows a reference response example preparation process flow. 参考応答例記憶部の構成例を示す図である。It is a figure which shows the structural example of a reference response example memory | storage part. 参考応答例形式化処理フローを示す図である。It is a figure which shows the reference response example formalization processing flow. 参考応答例形式記憶部の構成例を示す図である。It is a figure which shows the structural example of a reference response example format memory | storage part. 質問文形式化処理／回答文形式化処理フローを示す図である。It is a figure which shows a question sentence formatting process / answer sentence formatting process flow. 質問応答特性判定処理フローを示す図である。It is a figure which shows the question response characteristic determination processing flow. 質問応答処理フローを示す図である。It is a figure which shows a question response process flow. 質問文入力から応答形式要素相関度計算までの処理に係る構成を示す図である。It is a figure which shows the structure which concerns on the process from a question sentence input to a response format element correlation degree calculation. 質問文と参考質問文の比較例を示す図である。It is a figure which shows the comparative example of a question sentence and a reference question sentence. 質問文形式要部抽出処理フローを示す図である。It is a figure which shows the question sentence format principal part extraction processing flow. 相似形式質問文抽出処理フローを示す図である。It is a figure which shows a similar form question sentence extraction processing flow. 応答形式要素相関度算出処理フローを示す図である。It is a figure which shows a response format element correlation calculation process flow. 応答形式要素相関度テーブルを示す図である。It is a figure which shows a response format element correlation degree table. 関連語生成から応答文出力までの処理に係る構成を示す図である。It is a figure which shows the structure which concerns on the process from a related word production | generation to a response sentence output. 関連語生成処理フローを示す図である。It is a figure which shows a related word production | generation processing flow. 関連語テーブルの構成例を示す図である。It is a figure which shows the structural example of a related word table. 関連文書検索処理フローを示す図である。It is a figure which shows a related document search process flow. 文スコア算出処理フローを示す図である。It is a figure which shows a sentence score calculation process flow. 内容評価項算出処理フローを示す図である。It is a figure which shows a content evaluation term calculation process flow. 形式評価項算出処理フローを示す図である。It is a figure which shows a format evaluation term calculation processing flow. 文スコアの分布例を示す図である。It is a figure which shows the example of distribution of a sentence score. 解候補抽出処理フローを示す図である。It is a figure which shows a solution candidate extraction process flow. 応答文出力処理フローを示す図である。It is a figure which shows a response sentence output process flow. 正解と不正解の例を示す図である。It is a figure which shows the example of a correct answer and an incorrect answer. Ｗｅｂ検索エンジンを用いる質問応答システムの構成を示す図である。It is a figure which shows the structure of the question answering system using a Web search engine. Ｗｅｂ検索エンジンを用いる質問応答システムの処理フローを示す図である。It is a figure which shows the processing flow of the question answering system using a Web search engine. 質問応答システムのハードウェアの構成を示す図である。It is a figure which shows the hardware constitutions of a question answering system.

Explanation of symbols

１０１参考応答例生成部、１０２参考応答例記憶部、１０３参考応答例形式化部、１０４参考応答例形式記憶部、９０１質問文入力部、９０２質問文記憶部、９０３質問文形式化部、９０４質問文形式記憶部、９０５質問文形式要部抽出部、９０６質問文形式要部記憶部、９０７相似形式質問文抽出部、９０８相似形式質問文集合記憶部、９０９応答形式要素相関度算出部、９１０応答形式要素含回答文集合記憶部、９１１応答形式要素相関度テーブル、１５０１関連語生成部、１５０２関連語テーブル、１５０３関連文書検索部、１５０４関連文書記憶部、１５０５文スコア算出部、１５０６文スコアテーブル、１５０７解候補抽出部、１５０８解候補記憶部、１５０９応答文出力部、２６０１参考文形式化部、２６０２参考文記憶部、２６０３入力質問文形式化部、２６０４参考回答文抽出部、２６０５Ｗｅｂ検索要求部、２６０６記述スタイル評価部、２６０７関連性評価部、２６０８スコア処理部、２６０９回答文出力部。 101 Reference response example generation unit, 102 Reference response example storage unit, 103 Reference response example formatting unit, 104 Reference response example format storage unit, 901 Question sentence input unit, 902 Question sentence storage unit, 903 Question sentence formatting unit, 904 Question sentence format storage section, 905 Question sentence format main part extraction section, 906 Question sentence format main section storage section, 907 Similarity format question sentence extraction section, 908 Similarity format question sentence set storage section, 909 Response format element correlation degree calculation section, 910 Response format element including answer sentence set storage unit, 911 Response format element correlation degree table, 1501 Related word generation unit, 1502 Related word table, 1503 Related document search unit, 1504 Related document storage unit, 1505 Sentence score calculation unit, 1506 sentences Score table, 1507 solution candidate extraction unit, 1508 solution candidate storage unit, 1509 response sentence output unit, 2601 reference sentence Formulation unit, 2602 reference sentence storage unit, 2603 input question sentence formatting unit, 2604 reference answer sentence extraction unit, 2605 Web search request unit, 2606 description style evaluation unit, 2607 relevance evaluation unit, 2608 score processing unit, 2609 answer Sentence output part.

Claims

A question answering system for inputting a question sentence, extracting a sentence suitable for answering a question sentence from a document group to be searched, and outputting the sentence as a response sentence, the question answering system having the following elements (1) Question sentence input unit for inputting a question sentence (2) Dividing a sentence into words, analyzing a part of speech type and a surface expression for each word, if each word is a function word, a question word, and If it is a predetermined word that is likely to be the focus of the question, it is determined that the word is a word related to the question response format, and otherwise, the word is determined to be a word related to the question response content, Words related to response format are converted to part-of-speech types, words related to question response contents are converted to surface representations, and input by a formalization process to form a formatted word string based on the converted part-of-speech types or surface layer representations Question sentence formalization word string Question sentence formatting part to be converted (3) The question sentence form main part is extracted from the question sentence formalized word string by the question sentence format main part extracting process that extracts a formatted word string having a predetermined number of words including the interrogative word at a predetermined position. (4) Reference response example storage unit for storing a plurality of reference response examples consisting of pairs of reference question sentences and reference answer sentences (5) Reference response examples included in the reference response example storage unit The reference response example formatting unit (6) converts the reference question sentence into a reference question sentence formatted word string by the formatting process, and further converts the reference answer sentence into a reference answer sentence formatted word string by the formatting process. Response example format storage unit that stores a plurality of pairs of the converted reference question sentence formatted word string and reference answer sentence formatted word string in association with the reference response ID (7) Each reference included in the response example format storage unit For the question sentence formalization word string, the question sentence form The reference question sentence format main part is extracted by the main part extraction process, and compared with the question sentence form main part. When the comparison result is the same or similar, the reference question sentence form main part is extracted. The similarity format question sentence extraction unit that identifies the reference response ID of the categorized word string as the reference response ID related to the similar format question sentence whose format is similar to the input question sentence (8) Reference related to the specified similar format question sentence A similar format question sentence set storage unit that stores response ID groups as a similar format question sentence set. (9) A reference answer sentence formatted word string corresponding to a reference response ID related to each similar format question sentence is sent from the response example form storage unit. To obtain response format elements that are formatted word strings having a predetermined number of words smaller than the number of words in the formatted word string, in order from the acquired reference answer sentence formatted word string, and for each response format element, Response format element is a response example It is searched whether each reference answer sentence formatted word string included in the format storage unit is included, and the reference response ID group related to the reference answer sentence formatted word string including the response format element is set as a response sentence element-containing answer sentence set The number of reference response ID groups included in at least both the similar form question sentence set and the response form element containing answer sentence set, the number of reference response ID groups contained in the similar form question sentence set, and the response form element containing answer Using the number of reference response ID groups included in the sentence set, the response form element is converted into a similar form question sentence based on the probability that the similar form question sentence is combined with the reference answer form formalized word string including the response form element. A response format element correlation degree calculation unit (10) that calculates a question format correlation degree indicating a degree related to a format (10) A response format element correlation degree table (11) that stores a question format correlation degree calculated for each response format element A keyword that is a content word is extracted from a question sentence, a document is searched from a group of documents to be searched using the keyword as a condition, and a relation related to the question sentence as a content based on the appearance frequency of words included in the searched document group A related word generation unit for extracting a word and calculating a related degree of the related word (12) A related word table for storing a related degree calculated for each related word (13) A keyword as a content word is extracted from a question sentence , A search related document search unit (14) for searching related documents from a group of documents to be searched using a keyword as a condition, a related document that stores the searched related documents in association with a sentence number for each related sentence included in the related document Storage unit (15) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formalization process, and the response format elements are sequentially extracted from the related sentence formatted word string. The question form correlation degree of the formal element is acquired from the response form element correlation degree table, and the relevance level as a related word of each word included in the related sentence is further obtained from the related word table, and the question of each acquired response form element A sentence score calculation unit that calculates a sentence score indicating suitability as a solution to the question sentence based on the degree of formal correlation and the degree of association of each word. (16) A sentence that stores a sentence score for each related sentence in association with a sentence number. Score table (17) A solution candidate extraction unit that extracts a sentence number of a related sentence with a sentence score indicating high aptitude as a solution candidate (18) A response sentence output that outputs a related sentence specified by the sentence number of the solution candidate as a response sentence Department.

The question answering system according to claim 1, wherein at least one of "reason", "method", "meaning", or "difference" is used as the predetermined word that is likely to be a focus of the question.

The formalization process is characterized in that, even when each word is a predetermined verb and adjective having a high appearance frequency in a reference response example, the word is determined to be a word related to a question response format. Item 4. The question answering system according to Item 1.

2. The formalized word string extracted by the question sentence format main part extraction process is a formalized word string related to a total of seven words including three words before and after a questionable word. Question answering system.

The question answering system according to claim 1, wherein the similar question sentence extraction unit determines that the reference question sentence form main part and the question sentence included in the question sentence form main part are similar to each other. .

2. The question according to claim 1, wherein the response format element correlation degree calculation unit uses a square root of a chi-square value as a question format correlation degree indicating a degree that the response format element is related to a similar format question sentence as a format. Response system.

The question answering system according to claim 1, wherein the response format element correlation calculation unit uses a dice coefficient as a question format correlation indicating a degree to which a response format element is related to a similar format question sentence as a format.

2. The question answering system according to claim 1, wherein the response format element correlation degree calculation unit uses a mutual information amount as a question format correlation degree indicating a degree that the response format element is related to a similar format question sentence as a format. .

The question candidate according to claim 1, wherein the solution candidate extraction unit sets a sentence number of a related sentence of a sentence score indicating a maximum value as a solution candidate for a sentence score consecutive in the order of the related sentences included in the related document. Response system.

The question answering system according to claim 9, wherein the solution candidate extraction unit includes sentence numbers of related sentences of sentence scores before and after exceeding a predetermined ratio of the local maximum value as solution candidates.

A question answering system that inputs a question sentence, extracts a sentence suitable for answering the question sentence from a document group to be searched, and outputs it as a response sentence,
A reference response example storage unit for storing a plurality of reference response examples including pairs of reference question sentences and reference answer sentences;
A response example format storage unit for storing a plurality of pairs of reference question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs;
A reference form ID storage unit for storing a reference response ID group related to a similar form question sentence as a similar form question sentence set;
A response format element correlation degree table for storing the question format correlation calculated for each response format element;
A related word table for storing the relevance calculated for each related word;
A related document storage unit for storing a related document in association with a sentence number for each related sentence included in the related document;
A sentence score table for storing a sentence score for each related sentence in association with a sentence number;
(1) Question sentence input procedure for inputting a question sentence (2) The sentence is divided into words, and the part-of-speech type and the surface layer expression are divided for each word. Analyzing and determining that each word is a function word, a question word, and a predetermined word that is likely to be the focus of a question, the word is a word related to a question response format, The word is determined to be a word related to the question response content, the word related to the question response format is converted into a part of speech type, the word related to the question response content is converted into a surface expression, and the converted part of speech type or surface layer A question sentence formatting procedure for converting an inputted question sentence into a question sentence formalization word string by a formalization process to form a formalized word string in units of expressions (3) A predetermined number of words including a question word at a predetermined position Extract a formatted word string Question sentence format main part extraction procedure for extracting the question sentence format main part from the question sentence formatted word string by the question sentence format main part extraction process to be issued (4) Reference response example For each reference response example included in the storage unit, reference Reference response example formatting procedure for converting a question sentence into a reference question sentence formatted word string by the formalization process, and further converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (5) Response example For each reference question sentence formatted word string included in the format storage part, the question sentence form principal part is extracted by the question sentence form principal part extraction process, and compared with the question sentence form principal part, and the comparison result is the same or In the case of similarity, the reference response ID of the reference question sentence format word string from which the relevant part of the reference question sentence format is extracted is identified as the reference response ID related to the similar question sentence whose format is similar to the input question sentence Similar question extraction In order (6), a reference answer sentence formatted word string corresponding to a reference response ID related to each similar form question sentence is acquired from the response example form storage unit, and a predetermined number of words less than the number of words in the formatted word string Response format elements, which are response word format words, are extracted in order from the acquired reference answer sentence format word string, and for each response format element, each response answer format word that includes the response format element in the response example format storage unit Search whether it is included in the column, and store the reference response ID group related to the reference response sentence formatted word string including the response format element as a response format element including response format element set, and at least a similar format question sentence set and a response format element Using the number of reference response ID groups included in both of the included response sentence sets, the number of reference response ID groups included in the similar form question sentence set, and the number of reference response ID groups included in the response form element included response sentence set Response form A response format element that calculates the degree of question format correlation indicating the degree to which the response format element is related to the similar format question text based on the probability that the similar format question text is combined with the reference response text formalized word string including the formula element Correlation degree calculation procedure (7) A keyword that is a content word is extracted from a question sentence, a document is searched from a search target document group using the keyword as a condition, and the content is determined based on the appearance frequency of the word included in the searched document group. A related word generation procedure for extracting a related word related to a question sentence and calculating a degree of relevance of the related word (8) extracting a keyword which is a content word from the question sentence, and using the keyword as a condition from a search target document group Search Related Document Retrieval Procedure for Retrieving Related Documents (9) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formatting process, and a related sentence formatted word is obtained. The response format elements are extracted in order, the question format correlation of each response format element is acquired from the response format element correlation table, and the related level of each word included in the related sentence as the related word Sentence score calculation procedure (10) which shows a high aptitude, and calculates a sentence score indicating aptitude as a solution to the question sentence based on the question form correlation degree and the relevance degree of each word obtained from each response form element Solution candidate extraction procedure for extracting the sentence number of the related sentence of the sentence score as a solution candidate (11) A response sentence output procedure for outputting the related sentence specified by the sentence number of the solution candidate as a response sentence.

A reference sentence formatting unit that performs a formalization process to generalize a description style for at least a reference question sentence of the reference sentence, with a pair of the reference question sentence and the reference answer sentence as a reference sentence,
A reference sentence storage unit for storing the formatted reference sentence formatted in the reference sentence formatting unit;
An input question sentence formatting unit that performs a formalization process to generalize the description style of the input question sentence;
The formalized reference question sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting unit is searched, and the reference answer sentence included in the formalized reference sentence is retrieved from the reference sentence storage unit. A reference answer sentence extraction unit to be extracted;
A description style evaluation unit that evaluates the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation unit for evaluating relevance of contents between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation unit as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation part A score processing unit for performing scoring processing;
A question answering system comprising an answer sentence output unit for outputting an answer sentence to the input question sentence based on the score.

In the computer that becomes the question answering system,
A reference sentence formalization procedure for performing a formalization process to generalize a description style for at least a reference question sentence among the reference sentences, using a pair of a reference question sentence and a reference answer sentence as a reference sentence;
A reference sentence storage procedure for storing the formalized reference sentence formatted in the reference sentence formatting procedure;
An input question sentence formalization procedure that performs formalization processing to generalize the description style of the input question sentence;
Reference answer sentence extraction for searching for the formatted reference sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting procedure and extracting the reference answer sentence included in the formalized reference sentence Procedure and
A description style evaluation procedure for evaluating the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation procedure for evaluating relevance of content between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation procedure as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation procedure A score processing procedure for scoring,
A program for executing an answer sentence output procedure for outputting an answer sentence for the input question sentence based on the score.