JPWO2009113494A1 - Question answering system capable of descriptive answers using WWW as information source - Google Patents

Question answering system capable of descriptive answers using WWW as information source Download PDF

Info

Publication number
JPWO2009113494A1
JPWO2009113494A1 JP2010502807A JP2010502807A JPWO2009113494A1 JP WO2009113494 A1 JPWO2009113494 A1 JP WO2009113494A1 JP 2010502807 A JP2010502807 A JP 2010502807A JP 2010502807 A JP2010502807 A JP 2010502807A JP WO2009113494 A1 JPWO2009113494 A1 JP WO2009113494A1
Authority
JP
Japan
Prior art keywords
sentence
question
response
format
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2010502807A
Other languages
Japanese (ja)
Other versions
JP5461388B2 (en
Inventor
辰則 森
辰則 森
佐藤 充
充 佐藤
円香 石下
円香 石下
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yokohama National University NUC
Original Assignee
Yokohama National University NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yokohama National University NUC filed Critical Yokohama National University NUC
Priority to JP2010502807A priority Critical patent/JP5461388B2/en
Publication of JPWO2009113494A1 publication Critical patent/JPWO2009113494A1/en
Application granted granted Critical
Publication of JP5461388B2 publication Critical patent/JP5461388B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

通常の言葉で記された質問を受け付け、その回答候補をWWW上の文書から抽出し、利用者に提示する技術に関し、質問と回答の形式的な相関を反映することを課題とする。質問文1001を、機能語、疑問詞、あるいは質問の焦点になりやすい「理由」や「意味」などの所定語を表層表現に変換し、その他の内容語を品詞種別に変換する形式化処理により質問文形式化単語列1002とし、更にその中から疑問詞と所定数の前後の単語列からなる質問文形式要部1003を抽出する。質問回答事例集の参考質問文も同様に変換し、参考質問文形式要部1004の類似度により形式の適性を判断する。また、質問回答事例集の参考回答文も形式化し、質問文との形式的な相関度を求める。The problem is to receive a question written in ordinary words, extract the answer candidates from a document on the WWW, and reflect the formal correlation between the question and the answer regarding the technology presented to the user. Formalizing the question sentence 1001 by converting a function word, a question word, or a predetermined word such as “reason” or “meaning” that tends to be a focus of the question into a surface expression, and converting other content words into a part of speech type A question sentence format word string 1002 is extracted, and a question sentence format main part 1003 including a question word and a predetermined number of word strings before and after the question word is extracted. The reference question sentence of the question answer example collection is also converted in a similar manner, and the suitability of the form is determined based on the similarity of the reference question sentence format main part 1004. In addition, the reference answer sentence of the question answer example collection is also formalized, and a formal correlation with the question sentence is obtained.

Description

本発明は、通常の言葉で記された質問を受け付け、その回答候補をWWW上の文書から抽出し、利用者に提示する技術に関する。   The present invention relates to a technique for receiving a question written in ordinary words, extracting answer candidates from a document on the WWW, and presenting it to a user.

WWWを情報源とした質問応答システムが従来研究されている。人名、地名、数量等短い表現が問えるfactoid型、定義や理由等の長い記述が問えるnon−factoid型がある。ここではnon−factoid型に注目する。   Conventionally, a question answering system using WWW as an information source has been studied. There is a factid type that can ask for short expressions such as names, places, and quantities, and a non-factoid type that can ask for long descriptions such as definitions and reasons. Here, focus on the non-factoid type.

non−factoid型の質問応答における解候補となるテキストの適切性は、「質問文との内容に関する関連性があるか」(観点1)、「質問型に対する回答の仕方(記述スタイル)が適切であるか」(観点2)という二つの観点において計ることができるといわれている。ここで、質問型は質問文が問うている質問の種類(「定義」、「方法」、「理由」など)である。記述スタイルとは、例えば、「理由」を記述するのであれば、「〜からである。」「〜ために、…」などのように、「理由」を表現するのに適した表現を含む記述様式である。   Appropriateness of the text as a candidate solution in the non-factoid type question answering is “Is there a relevance regarding the content with the question sentence” (viewpoint 1), “How to answer the question type (description style) is appropriate It is said that it can be measured from two viewpoints, “Are there?” (Perspective 2). Here, the question type is the type of question (“definition”, “method”, “reason”, etc.) that the question text asks. The description style is, for example, a description including an expression suitable for expressing the “reason” such as “because it is“ to ”,“ to for, ... ”if“ reason ”is described. It is a style.

一般的な手法においては、質問型をまず推定してから、それに応じた処理を特に観点2について行う。しかし、質問型の推定の精度の問題や、観点2の判定に利用する手がかり表現を質問型に応じて手作業で準備する必要があるという労力の問題があった。   In a general method, a question type is first estimated, and then processing corresponding to that is performed particularly with respect to viewpoint 2. However, there is a problem of accuracy of question type estimation and a problem of labor that a clue expression used for determination of viewpoint 2 needs to be manually prepared according to the question type.

これに対して、非特許文献1は、人手による質問応答コミュニティサイトにある大量の質 問・回答事例を学習データとして用いて記述スタイルを獲得することにより、質問の型の推定を行わずに回答を行う学習型のnon−factoid型質問応答の手法を提案している。   On the other hand, Non-Patent Document 1 uses a large number of questions / answer cases in a manual question-answering community site as learning data to obtain a descriptive style, thereby answering without estimating the question type. A learning-type non-factoid question answering method is proposed.

非特許文献2は、FAQサイトの質問・回答事例集合から、回答が質問に「書き換え」られる確率を計算し、質問の型に依存しないnon−factoid型質問応答手法を提案している。なお、いずれの手法もFAQなどの質問・回答事例集合は、質問に対する回答を抽出する対象である情報源ではないことに注意されたい。情報源は別にあって、電子化された新聞記事であったり、WWW上の文書であったりする。
水野淳太、他1名,「任意の回答を対象とする質問応答のための実世界質問の分析と回答タイプ判定法の検討」,言語処理学会13回年次大会発表論文集(2007),言語処理学会,平成19年3月,p.1002−1005 ラデゥ ソリカット(Rude Soricut)、他1名,オートマティック クエスチョン アンサーリング ユージング ザ ウェブ(Automatic Question Answering Using the Web),Journalof Information Retrieval- Special Issue on WebInformation Retrieval, November 2006,Vol.9,pp.191-206
Non-Patent Document 2 proposes a non-factoid type question answering method that calculates the probability that an answer is “rewritten” from a question / answer example set on a FAQ site and does not depend on the type of question. It should be noted that the question / answer example set such as FAQ is not an information source for extracting answers to the question in any method. There is a separate information source, which may be an electronic newspaper article or a document on the WWW.
Yuta Mizuno and 1 other, “Analysis of real-world questions and examination of answer type judgment methods for answering questions of arbitrary answers”, Proc. Of the 13th Annual Conference of the Language Processing Society of Japan (2007), Language Processing Society, March 2007, p. 1002-1005 Rude Soricut, 1 other, Automatic Question Answering Using the Web, Journalof Information Retrieval- Special Issue on WebInformation Retrieval, November 2006, Vol.9, pp.191-206

非特許文献1の方法では、質問の型の推定を陽に行う必要がなく、手がかり表現も質問・回答事例集合から自動的に学習されるという利点がある。しかし、回答選定の柔軟性に問題がある。同手法では、その方法論により、テキストを段落等の予め決められた大きさの単位で切り出したものを観点1により順位付けし、回答候補とした上で、観点2に従い回答になるか否かの判定をするとともに、観点2において順位付けをし直す。ここで、質問応答において、回答の範囲は通常固定ではなく様々であることが普通であることを考えると、この手法では短かったり長かったりと不完全な回答しか得られないことがあると考えられる。さらに、再順位付けにおいては、観点1によらず観点2での並べかえを行うので、観点1である内容の関連性に由来する解候補の重要さは十分に反映されない。   The method of Non-Patent Document 1 has the advantage that it is not necessary to explicitly estimate the question type, and the clue expression is automatically learned from the question / answer example set. However, there is a problem with the flexibility of answer selection. In this method, the method is to determine whether or not the text is cut out in units of a predetermined size, such as paragraphs, according to viewpoint 1 and ranked as a candidate for answer. In addition to making a determination, reordering in viewpoint 2. Here, in the question answering, considering that the range of answers is usually not fixed but various, it is considered that this method can sometimes obtain incomplete answers such as short or long . Further, in the re-ranking, since the rearrangement is performed from the viewpoint 2 regardless of the viewpoint 1, the importance of the solution candidates derived from the relevance of the contents as the viewpoint 1 is not sufficiently reflected.

非特許文献2の手法でも、回答テキストの大きさを前もって決めておく必要があるとともに、質問の長さに基づいて回答の長さを別途推定する必要がある。さらに、尺度1と尺度2を語の書き換え確率として同時に扱っているために、学習される情報は、内容に纏わるものと記述スタイルに纏わるものが混在している。そのため、学習に利用できる質問・回答事例集合に現れる表現の網羅性が担保されないと精度が低くなると考えられる。   Even in the method of Non-Patent Document 2, it is necessary to determine the size of the answer text in advance and to estimate the length of the answer separately based on the length of the question. Furthermore, since the scale 1 and the scale 2 are simultaneously handled as the word rewriting probabilities, the information to be learned is mixed with the contents and those with the description style. For this reason, it is considered that the accuracy is lowered unless the completeness of the expressions appearing in the question / answer example set that can be used for learning is secured.

本発明においては、質問応答コミュニティサイトにある大量の質問・回答事例集を、観点2の記述スタイルの適切さを判定するためだけに用い、観点1についての尺度を別途用意して組み合わせることにより、任意の型のnon−factoid型質問応答を行う。また、観点2の記述スタイルの適切さの判定には学習型ではなく、質問・回答事例集の質問側を記述スタイルの類似度に基づいて利用者が与えた質問により検索し、対応する回答事例から動的に回答の記述スタイルに関する情報を取得する。上記のような構成にすることにより、i) 観点1と観点2に関する尺度を独立に設けることができ、なおかつ、それらを同時に考慮できるように統合した一つの評価尺度にすることができる、ii) 使える質問・回答事例集が増えたときには、学習をしなおす必要がなく、単に登録を追加すればよい。   In the present invention, a large amount of question / answer example collections in the question answering community site is used only for determining the appropriateness of the description style of viewpoint 2, and by separately preparing and combining a scale for viewpoint 1, Perform any type of non-factoid type question answering. In addition, the appropriateness of the description style of viewpoint 2 is not a learning type, but the question side of the question / answer example collection is searched by the question given by the user based on the similarity of the description style, and the corresponding answer example The information about the description style of the answer is dynamically acquired from With the configuration as described above, i) a scale relating to viewpoint 1 and viewpoint 2 can be provided independently, and a single evaluation scale integrated so that they can be considered simultaneously, ii) When the number of question / answer examples that can be used is increased, there is no need to re-learn, and registration is simply added.

本発明に係る質問応答システムは、
質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、以下の要素を有することを特徴とする
(1)質問文を入力する質問文入力部
(2)文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化部
(3)疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出部
(4)参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部
(5)参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化部
(6)変換された参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答IDに対応付けて複数記憶する応答例形式記憶部
(7)応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答IDを、入力された質問文に形式が相似する相似形式質問文に係る参考応答IDとして特定する相似形式質問文抽出部
(8)特定された相似形式質問文に係る参考応答ID群を、相似形式質問文集合として記憶する相似形式質問文集合記憶部
(9)各相似形式質問文に係る参考応答IDに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ID群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ID群の数、相似形式質問文集合に含まれる参考応答ID群の数、及び応答形式要素含回答文集合に含まれる参考応答ID群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出部
(10)応答形式要素毎に算出された質問形式相関度を記憶する応答形式要素相関度テーブル
(11)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成部
(12)関連語毎に算出された関連度を記憶する関連語テーブル
(13)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索部
(14)検索された関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶する関連文書記憶部
(15)各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出部
(16)関連文毎の文スコアを文番号に対応付けて記憶する文スコアテーブル
(17)高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出部
(18)解候補の文番号により特定される関連文を応答文として出力する応答文出力部。
The question answering system according to the present invention is:
A question answering system that inputs a question sentence, extracts a sentence suitable for answering a question sentence from a document group to be searched, and outputs it as a response sentence, and has the following elements (1) Question sentence input part for inputting a question sentence (2) The sentence is divided into words, the part of speech classification and the surface expression are analyzed for each word, and each word is a function word, a question word, and the focus of the question If the predetermined word is likely to become a word, it is determined that the word is a word related to the question response format. Otherwise, the word is determined to be a word related to the question response content, and the question response format is determined. The related question word is converted to part of speech type, the word related to the question response content is converted to surface expression, and the input question sentence is converted into a formalized word string based on the converted part of speech type or surface expression. Question sentence to convert to question sentence formalization word string Formulating section (3) Question sentence format for extracting a question sentence format main part from a question sentence formatted word string by a question sentence format main part extraction process for extracting a formatted word string of a predetermined number of words including a question word at a predetermined position Main part extraction unit (4) Reference response example storage unit for storing a plurality of reference response examples consisting of pairs of reference question text and reference answer text (5) Reference question text for each reference response example included in the reference response example storage unit Is converted into a reference question sentence formatted word string by the formalization process, and further, a reference response example formatting unit for converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (6) Response example format storage unit for storing a plurality of pairs of question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs (7) Reference question sentence formatting included in the response example format storage unit For the word string, in the question sentence format main part extraction process The reference question sentence format main part is extracted, compared with the question sentence form main part, and if the comparison result is the same or similar, the reference question sentence form main part of the extracted reference question sentence form main part is extracted. Similar format question sentence extraction unit (8) for identifying a reference response ID as a reference response ID related to a similar format question text whose format is similar to the input question text. Reference response ID group related to the specified similar format question text A similar format question sentence set storage unit for storing as a similar format question sentence set (9) obtaining a reference answer sentence formalized word string corresponding to a reference response ID related to each similar format question sentence from the response example form storage unit, Response format elements that are formatted word strings having a predetermined number of words smaller than the number of words in the formatted word string are extracted in order from the obtained reference answer sentence formatted word string, and for each response format element, the response format element is Included in response example format storage Stored in each reference response sentence formalized word string, and stores the reference response ID group related to the reference reply sentence formatted word string including the response format element as a response sentence element-containing answer sentence set, at least similar The number of reference response ID groups included in both the formal question sentence set and the response sentence element-containing answer sentence set, the number of reference response ID groups included in the similar form question sentence set, and the response form element-containing answer sentence set Using the number of reference response ID groups, the degree to which the response format element is related to the similar question text as a form based on the probability that the similar question text is combined with the reference response formatted word string including the response format element A response format element correlation degree calculation unit (10) that calculates a question format correlation degree indicating a response format element correlation degree table that stores a question format correlation degree calculated for each response format element (11) contents from a question sentence The keyword is extracted, the document is searched from the document group to be searched using the keyword as a condition, and the related word related to the question sentence is extracted as the content based on the appearance frequency of the word included in the searched document group. A related word generation unit that calculates the related degree of the related word (12) a related word table that stores the related degree calculated for each related word (13) extracts a keyword that is a content word from a question sentence, and uses the keyword as a condition Search related document search unit (14) for searching related documents from a search target document group A related document storage unit (15) for storing related documents searched for in association with sentence numbers for each related sentence included in the related documents. For each related sentence, the related sentence is converted into a related sentence formatted word string by the formalization process, the response format elements are sequentially extracted from the related sentence formatted word string, and the question of each response format element The expression correlation is obtained from the response format element correlation degree table, and further, the relevance level as a related word of each word included in the related sentence is obtained from the related word table. A sentence score calculation unit that calculates a sentence score indicating suitability as a solution to the question sentence based on the degree of association of each word (16) A sentence score table (17) that stores a sentence score for each related sentence in association with a sentence number ) A solution candidate extraction unit that extracts a sentence number of a related sentence with a sentence score indicating high aptitude as a solution candidate. (18) A response sentence output unit that outputs a related sentence specified by the sentence number of the solution candidate as a response sentence.

更に、前記質問の焦点になりやすい所定語として、少なくとも「理由」、「方法」、「意味」、又は「違い」の何れかを用いることを特徴とする。     Furthermore, at least one of “reason”, “method”, “meaning”, or “difference” is used as the predetermined word that is likely to be the focus of the question.

更に、前記形式化処理は、各単語が参考応答例の中で出現頻度が高い所定の動詞と形容詞である場合にも、当該単語を質問応答形式に係る単語であると判定することを特徴とする。   Further, the formalization processing is characterized in that even when each word is a predetermined verb and adjective having a high appearance frequency in the reference response example, the word is determined to be a word related to the question response format. To do.

更に、前記質問文形式要部抽出処理により抜き出される形式化単語列は、疑問詞を中心として前後3つ単語を含む合計7つの単語に係る形式化単語列であることを特徴とする。   Further, the formalized word string extracted by the question sentence format main part extraction process is a formalized word string related to a total of seven words including three words before and after the interrogative.

更に、前記相似形式質問文抽出部は、参考質問文形式要部と質問文形式要部に含まれる疑問詞が一致する場合に限り、類似と判定することを特徴とする。   Further, the similar form question sentence extraction unit is characterized in that it is determined to be similar only when the question words included in the reference question form main part and the question sentence form main part match.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、カイ二乗値の平方根を用いることを特徴とする。   Further, the response format element correlation degree calculation unit uses a square root of a chi-square value as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、ダイス係数を用いることを特徴とする。   Furthermore, the response format element correlation degree calculation unit uses a dice coefficient as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、相互情報量を用いることを特徴とする。   Furthermore, the response format element correlation degree calculation unit uses the mutual information amount as a question format correlation indicating the degree to which the response format element is related to the similar format question sentence as a format.

更に、前記解候補抽出部は、関連文書に含まれる関連文の順に連続する文スコアについて、極大値を示す文スコアの関連文の文番号を解候補とすることを特徴とする。   Further, the solution candidate extraction unit is characterized in that the sentence number of the related sentence of the sentence score indicating the maximum value is set as the solution candidate for the sentence score that is consecutive in the order of the related sentences included in the related document.

更に、前記解候補抽出部は、前記極大値の所定割合を超える前後の文スコアの関連文の文番号も解候補に含めることを特徴とする。   Furthermore, the solution candidate extraction unit includes sentence numbers of related sentences with sentence scores before and after exceeding a predetermined ratio of the maximum value as solution candidates.

本発明に係るプログラムは、
質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、
参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部と、
参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答IDに対応付けて複数記憶するための応答例形式記憶部と、
相似形式質問文に係る参考応答ID群を、相似形式質問文集合として記憶するための相似形式質問文集合記憶部と、
応答形式要素毎に算出された質問形式相関度を記憶するための応答形式要素相関度テーブルと、
関連語毎に算出された関連度を記憶するための関連語テーブルと、
関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶するための関連文書記憶部と、
関連文毎の文スコアを文番号に対応付けて記憶するための文スコアテーブルと、
を有する質問応答システムとなるコンピュータに、以下の手順を実行させることを特徴とする
(1)質問文を入力する質問文入力手順
(2)文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化手順
(3)疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出手順
(4)参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化手順
(5)応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答IDを、入力された質問文に形式が相似する相似形式質問文に係る参考応答IDとして特定する相似形式質問文抽出手順
(6)各相似形式質問文に係る参考応答IDに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ID群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ID群の数、相似形式質問文集合に含まれる参考応答ID群の数、及び応答形式要素含回答文集合に含まれる参考応答ID群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出手順
(7)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成手順
(8)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索手順
(9)各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出手順
(10)高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出手順
(11)解候補の文番号により特定される関連文を応答文として出力する応答文出力手順。
The program according to the present invention is:
A question answering system that inputs a question sentence, extracts a sentence suitable for answering the question sentence from a document group to be searched, and outputs it as a response sentence,
A reference response example storage unit for storing a plurality of reference response examples including pairs of reference question sentences and reference answer sentences;
A response example format storage unit for storing a plurality of pairs of reference question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs;
A reference form ID storage unit for storing a reference response ID group related to a similar form question sentence as a similar form question sentence set;
A response format element correlation degree table for storing the question format correlation calculated for each response format element;
A related word table for storing the relevance calculated for each related word;
A related document storage unit for storing a related document in association with a sentence number for each related sentence included in the related document;
A sentence score table for storing a sentence score for each related sentence in association with a sentence number;
(1) Question sentence input procedure for inputting a question sentence (2) The sentence is divided into words, the part of speech type and the surface layer for each word Analyzing the expression, if each word is a function word, a question word, and a predetermined word that is likely to be the focus of a question, the word is determined to be a word related to a question response format, and In other cases, the word is determined to be a word related to the question response content, the word related to the question response format is converted into a part of speech type, the word related to the question response content is converted into a surface expression, and the converted part of speech type Alternatively, a question sentence formatting procedure for converting an inputted question sentence into a question sentence formatted word string by a formatting process to form a formatted word string in units of surface expression (3) a predetermined word including a question word at a predetermined position Remove the number of formalized word strings Question sentence format main part extraction procedure for extracting the question sentence format main part from the question sentence formatted word string by the question sentence format main part extraction process to be issued (4) Reference response example For each reference response example included in the storage unit, reference Reference response example formatting procedure for converting a question sentence into a reference question sentence formatted word string by the formalization process, and further converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (5) Response example For each reference question sentence formatted word string included in the format storage part, the question sentence form principal part is extracted by the question sentence form principal part extraction process, and compared with the question sentence form principal part, and the comparison result is the same or In the case of similarity, the reference response ID of the reference question sentence format word string from which the relevant part of the reference question sentence format is extracted is identified as the reference response ID related to the similar question sentence whose format is similar to the input question sentence Similar question extraction In order (6), a reference answer sentence formatted word string corresponding to a reference response ID related to each similar form question sentence is acquired from the response example form storage unit, and a predetermined number of words less than the number of words in the formatted word string Response format elements, which are response word format words, are extracted in order from the acquired reference answer sentence format word string, and for each response format element, each response answer format word that includes the response format element in the response example format storage unit Search whether it is included in the column, and store the reference response ID group related to the reference response sentence formatted word string including the response format element as a response format element including response format element set, and at least a similar format question sentence set and a response format element Using the number of reference response ID groups included in both of the included response sentence sets, the number of reference response ID groups included in the similar form question sentence set, and the number of reference response ID groups included in the response form element included response sentence set Response form A response format element that calculates the degree of question format correlation indicating the degree to which the response format element is related to the similar format question text based on the probability that the similar format question text is combined with the reference response text formalized word string including the formula element Correlation degree calculation procedure (7) A keyword that is a content word is extracted from a question sentence, a document is searched from a search target document group using the keyword as a condition, and the content is determined based on the appearance frequency of the word included in the searched document group. A related word generation procedure for extracting a related word related to a question sentence and calculating a degree of relevance of the related word (8) extracting a keyword which is a content word from the question sentence, and using the keyword as a condition from a search target document group Search Related Document Retrieval Procedure for Retrieving Related Documents (9) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formatting process, and a related sentence formatted word is obtained. The response format elements are extracted in order, the question format correlation of each response format element is acquired from the response format element correlation table, and the related level of each word included in the related sentence as the related word Sentence score calculation procedure (10) which shows a high aptitude, and calculates a sentence score indicating aptitude as a solution to the question sentence based on the question form correlation degree and the relevance degree of each word obtained from each response form element Solution candidate extraction procedure for extracting the sentence number of the related sentence of the sentence score as a solution candidate (11) A response sentence output procedure for outputting the related sentence specified by the sentence number of the solution candidate as a response sentence.

本発明に係る質問応答システムは、
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化部と、
前記参考文形式化部において形式化された形式化参考文を記憶する参考文記憶部と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化部と、
前記入力質問文形式化部において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を前記参考文記憶部から抽出する参考回答文抽出部と、
前記参考回答文と、前記入力質問文をWebサーチエンジンで検索した結果得られたWeb文書である検索Web文書との間の記述スタイルの適合性を評価する記述スタイル評価部と、
前記検索Web文書と、前記入力質問文との間の内容の関連性を評価する関連性評価部と、
前記記述スタイル評価部により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価部により前記入力質問文の内容と関連があると評価された検索Web文書に対してスコア付け処理を行うスコア処理部と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力部を有することを特徴とする。
The question answering system according to the present invention is:
A reference sentence formatting unit that performs a formalization process to generalize a description style for at least a reference question sentence of the reference sentence, with a pair of the reference question sentence and the reference answer sentence as a reference sentence,
A reference sentence storage unit for storing the formatted reference sentence formatted in the reference sentence formatting unit;
An input question sentence formatting unit that performs a formalization process to generalize the description style of the input question sentence;
The formalized reference question sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting unit is searched, and the reference answer sentence included in the formalized reference sentence is retrieved from the reference sentence storage unit. A reference answer sentence extraction unit to be extracted;
A description style evaluation unit that evaluates the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation unit for evaluating relevance of contents between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation unit as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation part A score processing unit for performing scoring processing;
It has an answer sentence output part which outputs an answer sentence to the input question sentence based on the score.

本発明に係るプログラムは、質問応答システムとなるコンピュータに、
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化手順と、
前記参考文形式化手順において形式化された形式化参考文を記憶する参考文記憶手順と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化手順と、
前記入力質問文形式化手順において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を抽出する参考回答文抽出手順と、
前記参考回答文と、前記入力質問文をWebサーチエンジンで検索した結果得られたW
eb文書である検索Web文書との間の記述スタイルの適合性を評価する記述スタイル評価手順と、
前記検索Web文書と、前記入力質問文との間の内容の関連性を評価する関連性評価手順と、
前記記述スタイル評価手順により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価手順により前記入力質問文の内容と関連があると評価された検索Web文書に対してスコア付け処理を行うスコア処理手順と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力手順を実行させることを特徴とする。
A program according to the present invention is provided in a computer serving as a question answering system.
A reference sentence formalization procedure for performing a formalization process to generalize a description style for at least a reference question sentence among the reference sentences, using a pair of a reference question sentence and a reference answer sentence as a reference sentence;
A reference sentence storage procedure for storing the formalized reference sentence formatted in the reference sentence formatting procedure;
An input question sentence formalization procedure that performs formalization processing to generalize the description style of the input question sentence;
Reference answer sentence extraction for searching for the formatted reference sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting procedure and extracting the reference answer sentence included in the formalized reference sentence Procedure and
W obtained as a result of searching the reference answer sentence and the input question sentence with a Web search engine
a description style evaluation procedure for evaluating the suitability of a description style with a search Web document that is an eb document;
A relevance evaluation procedure for evaluating relevance of content between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation procedure as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation procedure A score processing procedure for scoring,
An answer sentence output procedure for outputting an answer sentence for the input question sentence is executed based on the score.

本発明によれば、本発明においては、質問応答コミュニティサイトにある大量の質問・回答事例集を、観点2の記述スタイルの適切さを判定するためだけに用い、観点1についての尺度を別途用意して組み合わせることにより、任意の型のnon−factoid型質問応答を行う。また、観点2の記述スタイルの適切さの判定には学習型ではなく、質問・回答事例集の質問側を記述スタイルの類似度に基づいて利用者が与えた質問により検索し、対応する回答事例から動的に回答の記述スタイルに関する情報を取得するので、観点1と観点2に関する尺度を独立に設けることができ、なおかつ、それらを同時に考慮できるように統合した一つの評価尺度にすることができる。使える質問・回答事例集が増えたときには、学習をしなおす必要がなく、単に登録を追加すればよい。   According to the present invention, in the present invention, a large collection of question / answer cases on the question answering community site is used only to determine the appropriateness of the description style of viewpoint 2, and a scale for viewpoint 1 is prepared separately. As a result, any type of non-factoid type question answering is performed. In addition, the appropriateness of the description style of viewpoint 2 is not a learning type, but the question side of the question / answer example collection is searched by the question given by the user based on the similarity of the description style, and the corresponding answer example Since the information about the description style of the answer is dynamically acquired, the scales for viewpoint 1 and viewpoint 2 can be set independently, and the evaluation scale can be integrated so that they can be considered simultaneously. . When the number of question / answer examples that can be used is increased, there is no need to re-learn, and registration is simply added.

実施の形態1.
まず、参考応答例を準備する動作について説明する。参考応答例は、質問応答システムによる質問応答のための学習用データであって、例えば質問とそれに対応する回答の事例を集めた既存の質問・回答事例集合を用いる。
Embodiment 1 FIG.
First, an operation for preparing a reference response example will be described. The reference response example is learning data for question answering by the question answering system, and uses, for example, an existing question / answer example set in which examples of questions and corresponding answers are collected.

図1は、参考応答例準備処理に係る構成を示す図である。質問応答システムは、質問・回答事例集合から学習用データとしての参考応答例を生成する参考応答例生成部101、生成した参考応答例(参考質問文と参考回答文の対)に参考応答IDを対応付けて記憶する参考応答例記憶部102、参考応答例に含まれる参考質問文と参考回答文を所定の手順に従って形式化する参考応答例形式化部103、形式化された参考質問文形式化単語列と参考回答文形式化単語列の対を参考応答IDに対応付けて記憶する参考応答例形式記憶部104を有している。   FIG. 1 is a diagram illustrating a configuration related to a reference response example preparation process. The question answering system generates a reference response example generation unit 101 that generates a reference response example as learning data from a question / answer example set, and generates a reference response ID for the generated reference response example (a pair of a reference question sentence and a reference answer sentence). Reference response example storage unit 102 that stores the reference response example, a reference question sentence included in the reference response example and a reference answer sentence are formatted according to a predetermined procedure, a formatted reference question sentence formatting A reference response example format storage unit 104 stores a pair of a word string and a reference answer sentence formatted word string in association with a reference response ID.

図2は、参考応答例準備処理フローを示す図である。この例では、Webコミュニティサービスの利用者同士でなされた質問・回答事例集合を用いる。従って、参考応答例生成部101による参考応答例生成部(S201)では、一つの質問に複数の回答が対応する場合に、質問者が最良回答として選んだ回答を参考回答文とする。また、質問が1文であって回答文にURLを含まない質問・回答のみを選択して、参考応答例として参考応答IDに対応付けて参考質問文及び参考回答文として参考応答例記憶部102に記憶させる。   FIG. 2 is a diagram illustrating a reference response example preparation process flow. In this example, a set of question / answer cases made by users of the Web community service is used. Therefore, in the reference response example generation unit (S201) by the reference response example generation unit 101, when a plurality of answers correspond to one question, the answer selected by the questioner as the best answer is used as the reference answer sentence. Further, only a question / answer whose URL is not included in the answer sentence is selected as a question, and a reference answer example storage unit 102 as a reference question sentence and a reference answer sentence is associated with the reference response ID as a reference answer example. Remember me.

図3は、参考応答例記憶部の構成例を示す図である。参考応答毎にレコードを設け、参考応答ID351と、参考質問文352と、参考回答文353との項目を対応付けて記憶するように構成されている。   FIG. 3 is a diagram illustrating a configuration example of the reference response example storage unit. A record is provided for each reference response, and the items of the reference response ID 351, the reference question sentence 352, and the reference answer sentence 353 are stored in association with each other.

参考応答例形式化部103による参考応答例形式化処理(S202)では、質問及び回答の内容的な意義を排除して質問及び回答としての形式的な意義のみを有する情報(形式化単語列と呼ぶ。)に変換する。   In the reference response example formatting process (S202) by the reference response example formatting unit 103, information having only a formal significance as a question and an answer (a formalized word string and an answer) is excluded. To call.)

図4は、参考応答例形式化処理フローを示す図である。参考応答例記憶部102に含まれる参考応答例毎に以下の処理を繰り返す(S401)。参考質問文を質問文形式化処理(図6)し、参考質問文形式化単語列を得て(S402)、更に参考回答文を回答文形式化処理(図6)し、参考回答文形式化単語列を得る(S403)。そして、これらを参考応答IDに対応付けて参考応答例形式記憶部104に記憶させる。これをすべての参考応答例について処理する(S404)。   FIG. 4 is a diagram showing a reference response example formatting process flow. The following processing is repeated for each reference response example included in the reference response example storage unit 102 (S401). The reference question sentence is processed into a question sentence format (FIG. 6), a reference question sentence format word string is obtained (S402), and the reference answer sentence is processed into an answer sentence format process (FIG. 6) to form a reference answer sentence format. A word string is obtained (S403). These are stored in the reference response example format storage unit 104 in association with the reference response ID. This is processed for all the reference response examples (S404).

図5は、参考応答例形式記憶部の構成例を示す図である。参考応答毎にレコードを設け、参考応答ID551と、参考質問文形式化単語列552と、参考回答文形式化単語列553との項目を対応付けて記憶するように構成されている。このように、形式化単語列は、形式に係る語の表層表現(読み)と、内容に係る語の品詞種別を単語の列として並べた構成となっている。   FIG. 5 is a diagram illustrating a configuration example of the reference response example format storage unit. A record is provided for each reference response, and items of a reference response ID 551, a reference question sentence formatted word string 552, and a reference answer sentence formatted word string 553 are stored in association with each other. As described above, the formalized word string has a configuration in which the surface representation (reading) of the word related to the format and the part of speech type of the word related to the contents are arranged as a word string.

ここで、具体的な質問文形式化処理および回答文形式化処理について説明する。図6は、質問文形式化処理/回答文形式化処理フローを示す図である。この例では、質問文形式化処理と回答文形式化処理は共通である。まず、対象文(参考質問文又は参考回答文、あるいは後述する質問文又は関連文)を形態素解析し、単語毎の品詞種別と表層表現を得る(S601)。そして、単語毎に(S602)、質問応答特性判定処理(S603、図7)により、当該単語の特性を判定する。質問応答形式に係る単語であると判定された場合には、当該単語の品詞種別を、対象文形式化単語列(参考質問文形式化単語列又は参考回答文形式化単語列、あるいは後述する質問文形式化単語列又は関連文形式化単語列)に追加する(S604)。一方、質問応答内容に係る単語と判定された場合には、当該単語の表層表現を、同様に対象文形式化単語列に追加する(S605)。尚、単語間には区切りの記号を入れて単語の単位を識別できるようにする。また、句読点などの記号も単語として扱う。これらの処理をすべての単語に対して行う(S606)。   Here, specific question sentence formatting processing and answer sentence formatting processing will be described. FIG. 6 is a diagram showing a question sentence formatting process / answer sentence formatting process flow. In this example, the question sentence formatting process and the answer sentence formatting process are common. First, a morphological analysis is performed on a target sentence (reference question sentence or reference answer sentence, or a question sentence or related sentence described later) to obtain a part-of-speech type and a surface layer expression for each word (S601). Then, for each word (S602), the characteristic of the word is determined by the question response characteristic determination process (S603, FIG. 7). When it is determined that the word is related to the question answer format, the part of speech type of the word is set as the target sentence formatted word string (reference question sentence formatted word string or reference answer sentence formatted word string, or question to be described later) To a sentence formalized word string or a related sentence formalized word string) (S604). On the other hand, if it is determined that the word is related to the question response content, the surface representation of the word is similarly added to the target sentence formatted word string (S605). In addition, a delimiter is inserted between words so that the unit of the word can be identified. Symbols such as punctuation marks are also treated as words. These processes are performed for all words (S606).

ここで、前述の質問応答特性判定処理(S603)について説明する。図7は、質問応答特性判定処理フローを示す図である。当該単語の品詞が助詞、助動詞等の機能語であるか判断し(S701)、機能語である場合には質問応答形式に係る単語と判定する(S706)。また、当該単語が疑問詞か判定する(S702)。例えば、疑問詞となる代名詞としては、ナニ、ドコ、ダレ、ナン、ドチラ、ドレ、ドッチ、イツ、ドナタ、イクツ、ドッカ、イズレ、ナアニ等がある。疑問詞となる連体詞としては、ドノ、ドンナ、ドウイウ、イカナル等がある。また、疑問詞となる副詞としては、ドウ、ナゼ、ドウシテ、イクラ、イツノマニ等がある。その他にも、ッテナ、ナニモノ等ある。これらの品詞と表層表現の組み合せによって判定する。そして、疑問詞である場合には質問応答形式に係る単語と判定する(S706)。また、内容語であっても、所定の質問の焦点となりやすい単語であるかを判定し(S703)、その所定の単語である場合には、質問応答形式に係る単語と判定する(S706)。例えば、「理由」、「方法」、「意味」、「違い」等が質問の焦点となりやすい単語である。また、参考応答例の中で出現頻度が高い動詞と形容詞も予め特定しておき、その所定の頻出する単語であるかも判定し(S704)、所定の頻出する単語の場合には、質問応答形式に係る単語と判定する(S706)。そして、その他の内容語は、質問応答内容に係る単語と判定する(S706)。S704のステップは省略することもできる。   Here, the above-described question response characteristic determination process (S603) will be described. FIG. 7 is a diagram illustrating a question response characteristic determination process flow. It is determined whether the part of speech of the word is a function word such as a particle or an auxiliary verb (S701). If it is a function word, it is determined as a word related to a question answer format (S706). Further, it is determined whether the word is a question word (S702). For example, pronouns that are interrogatives include Nani, Doko, Dare, Nan, Dochira, Dore, Dotchi, Itatsu, Donata, Ikutsu, Dokka, Izure, Naani and the like. There are Dono, Donna, Douiu, Ikanal, etc. as interrogative words that are interrogative. In addition, adverbs that are interrogatives include Dow, Naze, Doushite, Ikura, Itsuno Mani. In addition, there are TTENA and Nanimono. Judgment is made by a combination of these parts of speech and surface expression. If it is an interrogative word, it is determined as a word related to the question answer format (S706). Moreover, even if it is a content word, it is determined whether it is a word which becomes a focus of a predetermined question (S703), and when it is the predetermined word, it determines with the word which concerns on a question response format (S706). For example, “reason”, “method”, “meaning”, “difference”, and the like are words that are likely to be the focus of a question. In addition, verbs and adjectives having a high appearance frequency in the reference response example are also specified in advance, and it is also determined whether the word is a predetermined frequent word (S704). In the case of the predetermined frequent word, a question response format (S706). The other content words are determined to be words related to the question response content (S706). The step of S704 can be omitted.

上述の動作により、質問応答システムにより参考応答例の準備を事前に行っておく。次に、実際の質問応答の動作について説明する。   With the above-described operation, the reference response example is prepared in advance by the question answering system. Next, the actual question answering operation will be described.

図8は、質問応答処理フローを示す図である。図に示すように順次、質問文を入力する質問文入力処理(S801)と、質問文を形式化する質問文形式化処理(S802)と、形式化した質問文から質問形式としての要部を抽出する質問文形式要部抽出処理(S803)と、参考応答例から質問形式の要部が相似(同一あるいは類似)する参考質問文を抽出する相似形式質問文抽出処理(S804)と、相似する参考質問文に対する参考応答文を形式化し、その形式化された参考回答文に含まれる単語列からなる応答形式要素について、当該相似する参考質問文との相関度を算出する応答形式要素相関度算出処理(S805)と、質問文に内容的に関連する関連語を生成する関連語生成処理(S806)と、質問文に内容的に関連する関連文書を検索する関連文書検索処理(S807)と、関連文書に含まれる文(関連文と呼ぶ)毎に、質問文に対する応答としての適性の程度を文スコアとして算出する文スコア算出処理(S808)と、文スコアに基づいて応答解となる候補範囲を抽出する解候補抽出処理(S809)と、抽出した解候補を応答文として出力する応答文出力処理(S810)を行う。   FIG. 8 is a diagram showing a question response process flow. As shown in the figure, a question sentence input process (S801) for inputting a question sentence sequentially, a question sentence formatting process (S802) for formalizing the question sentence, and a main part as a question form from the formatted question sentence. Similar to the extracted question sentence format main part extraction process (S803) and the similar form question sentence extraction process (S804) to extract the reference question sentence in which the main part of the question format is similar (same or similar) from the reference response example. Response form element correlation calculation that formalizes the reference response sentence for the reference question sentence and calculates the degree of correlation with the similar reference question sentence with respect to the response form element composed of the word strings included in the formatted reference answer sentence A process (S805), a related word generation process (S806) for generating a related word related to the question sentence, a related document search process (S807) for searching for a related document related to the question sentence A sentence score calculation process (S808) for calculating the degree of suitability as a response to the question sentence as a sentence score for each sentence included in the related document (referred to as a related sentence), and a candidate range that becomes a response solution based on the sentence score A solution candidate extraction process (S809) for extracting a response sentence and a response sentence output process (S810) for outputting the extracted solution candidate as a response sentence are performed.

以下ではこれらの動作を、質問文入力処理(S801)から応答形式要素相関度算出処理(S805)の前半動作と、関連語生成処理(S806)から応答文出力処理(S810)の後半動作に分けて説明する。   In the following, these operations are divided into a first half operation from the question sentence input process (S801) to a response format element correlation calculation process (S805) and a second half operation from the related word generation process (S806) to the response sentence output process (S810). I will explain.

図9は、質問文入力から応答形式要素相関度計算までの処理に係る構成を示す図である。質問応答システムは、前述の参考応答例形式記憶部104の他、質問文入力処理(S801)を行う質問文入力部901、入力された質問文を記憶する質問文記憶部902、質問文形式化処理(S802)を行う質問文形式化部903、形式化された質問文形式化単語列を記憶する質問文形式記憶部904、質問文形式要部抽出処理(S803)を行う質問文形式要部抽出部905、抽出された質問文形式要部を記憶する質問文形式要部記憶部906、相似形式質問文抽出処理(S804)を行う相似形式質問文抽出部907、抽出された相似形式の参考質問文の参考応答IDを集合として記憶する相似形式質問文集合記憶部908、応答形式要素相関度算出処理(S805)を行う応答形式要素相関度算出部909、応答形式要素を含む参考回答文の参考応答IDを集合として記憶する応答形式要素含回答文集合記憶部910、応答形式要素が相似形式質問文に形式的に関連する程度である質問形式相関度を記憶する応答形式要素相関度テーブル911を有している。   FIG. 9 is a diagram showing a configuration relating to processing from question text input to response format element correlation calculation. The question answering system includes a question sentence input unit 901 that performs a question sentence input process (S801), a question sentence storage part 902 that stores an inputted question sentence, and a question sentence formalization, in addition to the reference response example format storage unit 104 described above. Question sentence formatting unit 903 that performs processing (S802), question sentence format storage unit 904 that stores the formatted question sentence format word string, and question sentence format main part that performs question sentence format main part extraction processing (S803) An extraction unit 905, a question sentence format main part storage unit 906 for storing the extracted question sentence format main part, a similarity form question sentence extraction unit 907 for performing a similar form question sentence extraction process (S804), and a reference of the extracted similar form A similar format question message set storage unit 908 that stores reference response IDs of question sentences as a set, a response format element correlation calculation unit 909 that performs response format element correlation calculation processing (S805), and a reference including a response format element Response form element-containing answer sentence set storage unit 910 that stores reference response IDs of answer sentences as a set, response form element correlation that stores a question form correlation degree that the response form element is formally related to the similar form question sentence A degree table 911 is provided.

ここで、参考応答例記憶部102の参考質問文の中から、質問文と形式が相似する参考質問文を抽出する手順の概要を説明する。図10は、質問文と参考質問文の比較例を示す図である。前述の通り、各参考質問文1006は予め質問文形式化処理により参考質問文形式化単語列1005に変換しておく。そして、質問文1001が入力されると、これを同様に質問文形式化処理し、質問文形式化単語列1002を得る。更に、質問文形式化単語列1002から疑問詞(この例では、ナニ)を中心とする質問文の形式としての要部である質問文形式要部1003を抽出する。そして、各参考質問文形式化単語列1005について、同様に参考質問文形式要部1004を抽出して、それぞれを比較する。比較結果が完全一致する場合に、質問文形式が同一であると判断する。また、疑問詞が一致する部分一致の場合には、一致の程度に従って類似すると判断する。   Here, an outline of a procedure for extracting a reference question sentence similar in format to the question sentence from the reference question sentences in the reference response example storage unit 102 will be described. FIG. 10 is a diagram illustrating a comparative example of a question sentence and a reference question sentence. As described above, each reference question sentence 1006 is converted into a reference question sentence formatted word string 1005 in advance by a question sentence formatting process. When the question sentence 1001 is input, the question sentence formatting process is performed in the same manner to obtain a question sentence formatting word string 1002. Further, a question sentence format main part 1003 which is a main part as a question sentence format centered on a question word (in this example, Nani) is extracted from the question sentence formatted word string 1002. Then, for each reference question sentence formatted word string 1005, the reference question sentence format main part 1004 is similarly extracted and compared. When the comparison results are completely consistent, it is determined that the question sentence formats are the same. Further, in the case of partial match where the interrogative words match, it is determined that they are similar according to the degree of matching.

質問文入力処理(S801)では、操作者の操作等により入力された質問文を質問文記憶部902に記憶させ、質問文形式化処理(S802)では、前述の質問文形式化処理(図7)により質問文記憶部902に記憶している質問文を形式化して、質問文形式化単語列を生成し、質問文形式記憶部904に記憶させる。   In the question sentence input process (S801), the question sentence input by the operator's operation or the like is stored in the question sentence storage unit 902, and in the question sentence formatting process (S802), the above-described question sentence formatting process (FIG. 7). The question sentence stored in the question sentence storage unit 902 is formalized to generate a question sentence format word string, and the question sentence form storage unit 904 stores the question sentence format word string.

次に、質問文形式要部抽出処理(S803)について詳述する。図11は、質問文形式要部抽出処理フローを示す図である。まず、質問文形式化単語列中の疑問詞を特定する(S1101)。疑問詞は、前述の通り予め定められた品詞種別と表層表現により特定することができる。そして、疑問詞の前の3単語と後の3単語を含む7単語から成る単語列を抽出し、質問文形式要部とする(S1102)。この例では、所定の疑問詞前単語数を3とし、所定の疑問詞後単語数を3として説明する。   Next, the question sentence format main part extraction process (S803) will be described in detail. FIG. 11 is a diagram illustrating a question sentence format main part extraction processing flow. First, the question word in the question sentence formalization word string is specified (S1101). The interrogative can be specified by a part-of-speech type and a surface expression as previously described. Then, a word string composed of 7 words including the 3 words before and 3 after the interrogative word is extracted and used as the main part of the question sentence format (S1102). In this example, it is assumed that the predetermined number of words before the interrogation is 3, and the predetermined number of words after the interrogation is 3.

次に、相似形式質問文抽出処理(S804)について詳述する。図12は、相似形式質問文抽出処理フローを示す図である。参考応答例形式記憶部104に含まれる参考応答例毎に以下の処理を繰り返す(S1201)。まず、当該参考応答例の参考質問文形式化単語列から参考質問文形式要部を抽出する(S1202)。抽出の手順は、前述の質問文形式要部抽出処理(S803:図11)と同様である。そして、抽出した参考質問文形式要部を質問文形式要部記憶部906に記憶している質問文形式要部と比較する。中央の疑問詞が一致しない場合には(S1203)、相似度を0とし(S1205)、相似しないものとして扱う。中央の疑問詞が一致する場合には(S1203)、各位置の単語の一致数をカウントし、相似度とする(S1204)。つまり、位置と単語の両方が一致した場合を1として、中央以外の6つの位置の一致した数を合計する。このとき、単語が表層表現のときには表層表現が一致する場合、単語が品詞種別のときには品詞種別が一致する場合に、単語が一致するとして処理する。そして、すべての参考応答例について処理を終えると(S1206)、計数した相似度の高い順に従って、相似形式質問文を選択する(S1207)。選択数(1又は2以上)を予め設定しておき、所定の選択数に達するまで選択する方法や、選択基準となる相似度の下限(単語全数である7又は6以下)を予め設定しておき、所定の相似度下限以上の相似形式質問文を選択する方法が考えられる。そして、選択した相似形式質問文を識別するための参考応答IDを相似形式質問文集合記憶部908に記憶させる。これにより、相似形式質問文を要素とする相似形式質問文集合について、参考応答IDを識別子として取り扱うことができる。   Next, the similar question message extraction process (S804) will be described in detail. FIG. 12 is a diagram showing a similar format question sentence extraction processing flow. The following processing is repeated for each reference response example included in the reference response example format storage unit 104 (S1201). First, the main part of the reference question sentence format is extracted from the reference question sentence formatted word string of the reference response example (S1202). The extraction procedure is the same as the above-described question sentence format main part extraction process (S803: FIG. 11). Then, the extracted reference question sentence format main part is compared with the question sentence format main part stored in the question sentence format main part storage unit 906. If the central question word does not match (S1203), the similarity is set to 0 (S1205), and it is treated as not similar. If the central question word matches (S1203), the number of matches of the words at each position is counted and used as the similarity (S1204). That is, the case where both the position and the word match is set to 1, and the number of matches at the six positions other than the center is totaled. At this time, when the word is a surface expression, the surface expression is matched. When the word is a part of speech type, the part of speech type is matched. When the processing is completed for all the reference response examples (S1206), the similarity type question sentences are selected in the descending order of the degree of similarity counted (S1207). The number of selections (1 or 2 or more) is set in advance, and a method of selecting until the predetermined number of selections is reached, or a lower limit of similarity (7 or 6 or less, which is the total number of words) as a selection criterion is set in advance. In addition, a method of selecting a similar question sentence that is equal to or higher than a predetermined similarity lower limit can be considered. Then, a reference response ID for identifying the selected similar format question message is stored in the similar format question message set storage unit 908. Thereby, it is possible to handle the reference response ID as an identifier for a similar format question sentence set having a similar format question sentence as an element.

次に、応答形式要素相関度算出処理(S805)について詳述する。図13は、応答形式要素相関度算出処理フローを示す図である。相似形式質問文集合記憶部908で記憶している相似形式質問文の参考応答IDを読み出し、参考応答例形式記憶部104からその参考応答IDに対応する参考回答文形式化単語列を読み出す。その参考回答文形式化単語列から、順に連続する2単語を抽出し、応答形式要素とする。この処理を相似形式質問文集合記憶部908で記憶している各相似形式質問文に対して行う(S1301)。尚、2単語が重複する場合には、省略して構わない。応答形式要素とは、回答文形式単語列よりも小さい単位の形式化された単語列である。この例では、単語数を2とする。そして、抽出した応答形式要素毎に以下の処理を繰り返す(S1302)。   Next, the response format element correlation calculation process (S805) will be described in detail. FIG. 13 is a diagram showing a response format element correlation calculation processing flow. The reference response ID of the similar format question message stored in the similar format question message set storage unit 908 is read out, and the reference answer sentence formatted word string corresponding to the reference response ID is read out from the reference response example format storage unit 104. Two consecutive words are extracted in order from the reference answer sentence formalized word string, and set as response form elements. This process is performed for each similar question text stored in the similar question text set storage unit 908 (S1301). If two words overlap, they can be omitted. The response format element is a formatted word string in units smaller than the answer sentence format word string. In this example, the number of words is 2. Then, the following processing is repeated for each extracted response format element (S1302).

まず、参考応答例形式記憶部104から当該応答形式要素を含む参考回答文形式化単語列を検索し、その参考応答IDを応答形式要素含回答文集合として応答形式要素含回答文集合記憶部910に記憶させる(S1303)。   First, a reference response sentence formatted word string including the response format element is searched from the reference response example format storage unit 104, and the response format element-containing response sentence set storage unit 910 with the reference response ID as a response format element-containing response sentence set. (S1303).

次に、当該応答形式要素と相似形式質問文の相関度をカイ二乗検定により求める。この場合のカイ二乗値を算出する式を示す。   Next, the correlation between the response format element and the similar format question is obtained by chi-square test. An expression for calculating the chi-square value in this case is shown.

Figure 2009113494
Figure 2009113494

nは、全参考応答例数であり、Aは、相似形式質問文からなる参考応答例の集合であり、Bは、応答形式要素含回答文からなる参考応答例の集合である。相似形式質問文と応答形式要素含回答文が共起する頻度に基づいて、両者の相関を求めることができる。   n is the total number of reference response examples, A is a set of reference response examples made up of similar format question sentences, and B is a set of reference response examples made up of response form element-containing answer sentences. Based on the frequency with which the similar format question text and the response style element-containing answer text co-occur, the correlation between them can be obtained.

処理としては、まず式中の各所定集合の要素数を算出する(S1304)。全参考応答例数nとして、参考応答例形式記憶部104に参考応答ID数を計数する。また分母の各項について、集合Aの要素数として相似形式質問文集合記憶部908に含まれる参考応答ID数を計数し分母第1項値とし、Aの余集合の要素数として全参考応答例数nから順に集合Aの要素数を減じて差を求め分母第2項値とし、集合Bの要素数として応答形式要素含回答文集合記憶部910に含まれる参考応答ID数を計数し分母第3項値とし、Bの余集合の要素数として全参考応答例数nから集合Bの要素数を減じて差を求め分母第4項値とする。更に分子括弧内の各項について、集合Aと集合Bの積集合の要素数として相似形式質問文集合記憶部908と参考応答例形式記憶部104に共に含まれる参考応答ID数を計数し分子括弧内第1項値とし、Aの余集合とBの余集合の積集合の要素数として相似形式質問文集合記憶部908と参考応答例形式記憶部104のいずれにも含まれない参考応答ID数を計数し分子括弧内第2項値とし、Aの余集合と集合Bの積集合の要素数として相似形式質問文集合記憶部908に含まれず参考応答例形式記憶部104に含まれる参考応答ID数を計数し分子括弧内第3項値とし、集合AとBの余集合の積集合の要素数として相似形式質問文集合記憶部908に含まれ参考応答例形式記憶部104に含まれない参考応答ID数を計数し分子括弧内第4項値とする。   As processing, first, the number of elements of each predetermined set in the equation is calculated (S1304). The number of reference response IDs is counted in the reference response example format storage unit 104 as the total number n of reference response examples. For each term in the denominator, the number of reference response IDs included in the similar form question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and all reference response examples as the number of elements in the remaining set of A The number of elements in the set A is subtracted in order from the number n to obtain a difference as a denominator second term value, and the number of reference response IDs included in the response format element-containing answer sentence set storage unit 910 is counted as the number of elements in the set B. The difference is obtained by subtracting the number of elements of the set B from the total number n of reference response examples as the number of elements of the remainder set of B as the number of elements of the B remaining set, and set as the fourth term value of the denominator. Further, for each term in the molecular parenthesis, the number of reference response IDs included in the similar form question sentence set storage unit 908 and the reference response example form storage unit 104 is counted as the number of elements of the product set of set A and set B. The number of reference response IDs that are not included in either the similar form question sentence set storage unit 908 or the reference response example form storage unit 104 as the number of elements of the product set of the remainder set of A and the remainder set of B And the reference value ID included in the reference response example format storage unit 104 not included in the similar format question sentence set storage unit 908 as the number of elements in the intersection set of A and the set B The number is counted as the third term value in numerator brackets, and is included in the similar form question sentence set storage unit 908 as the number of elements of the intersection set of sets A and B. Reference not included in the reference response example form storage unit 104 Count the number of response IDs The term value.

次に、各集合の要素数と全参考応答例数からカイ二乗値を算出する(S1305)。まず、分母第1項値と分母第2項値と分母第3項値と分母第4項値を積算し、分母値を求める。次に、分子括弧内第1項値と分子括弧内第2項値を積算し分子括弧内前項値を求め、分子括弧内第3項値と分子括弧内第4項値を積算し分子括弧内後項値を求め、分子括弧内前項値と分子括弧内後項値の差を求め、差の二乗値に全参考応答例数nを乗じて分子値を求める。最語に、分子値を分母値で割って、カイ二乗値とする。また、カイ二乗値の二乗根を算出し、当該二乗根を質問形式相関度として当該応答形式要素に対応付けて記憶する(S1306)。この処理をすべての応答形式要素について行う(S1307)。   Next, a chi-square value is calculated from the number of elements in each set and the total number of reference response examples (S1305). First, the denominator first term value, the denominator second term value, the denominator third term value, and the denominator fourth term value are integrated to obtain the denominator value. Next, the first term value in the molecular bracket and the second term value in the molecular bracket are integrated to obtain the previous term value in the molecular bracket, and the third term value in the molecular bracket and the fourth term value in the molecular bracket are integrated. The rear term value is obtained, the difference between the previous term value in the molecular parenthesis and the rear term value in the molecular parenthesis is obtained, and the molecular value is obtained by multiplying the square value of the difference by the number n of all reference response examples. Lastly, the numerator value is divided by the denominator value to obtain the chi-square value. Also, the square root of the chi-square value is calculated, and the square root is stored in association with the response format element as the question format correlation (S1306). This process is performed for all response format elements (S1307).

上述の処理により応答形式要素相関度テーブル911が生成される。図14は、応答形式要素相関度テーブルを示す図である。応答形式要素毎にレコードを設け、応答形式要素1451と、カイ二乗値1452と、質問形式相関度1453との項目を対応付けて記憶するように構成されている。この例では、カイ二乗値も記憶させているが省略することもできる。   The response format element correlation degree table 911 is generated by the above processing. FIG. 14 is a diagram showing a response format element correlation degree table. A record is provided for each response format element, and the response format element 1451, chi-square value 1452, and question format correlation degree 1453 are associated with each other and stored. In this example, the chi-square value is also stored, but can be omitted.

図14の例は、図10の1003に示したタ_リユウ_ハ_ナニ_デス_カ_<記号,句点,*,*>の質問文形式要部に対して、タ_リユウ、タ_カラ、リユウ_ハなどの応答形式要素1451が、形式的に相関が高いということを示している。   The example shown in FIG. 14 corresponds to the main part of the question sentence format of the tag_review_ha_nani_des_k_ <symbol, punctuation mark, *, *> shown in 1003 of FIG. Response format elements 1451 such as Kara and Ryu_ha indicate that the correlation is formally high.

続いて、図8に示した関連語生成処理(S806)から応答文出力処理(S810)の後半動作について説明する。   Next, the second half operation from the related word generation process (S806) to the response sentence output process (S810) shown in FIG. 8 will be described.

図15は、関連語生成から応答文出力までの処理に係る構成を示す図である。質問応答システムは、前述の質問文記憶部902と応答形式要素相関度テーブル911の他、関連語生成処理(S806)を行う関連語生成部1501、質問文に内容的に関連する関連語を内容的な関連度と共に記憶する関連語テーブル1502、関連文書検索処理(S807)を行う関連文書検索部1503、質問文に内容的に関連する関連文書を記憶する関連文書記憶部1504、文スコア算出処理(S808)を行う文スコア算出部1505、関連文書に含まれる関連文毎に、質問文に対する応答としての適性の程度を文スコアとして記憶する文スコアテーブル1506、解候補抽出処理(S809)を行う解候補抽出部1507、文スコアに基づいて応答解と判定された候補範囲を記憶する解候補記憶部1508、応答文出力処理(S810)を行う応答文出力部1509を有している。   FIG. 15 is a diagram illustrating a configuration relating to processing from related word generation to response sentence output. The question answering system includes a related word generation unit 1501 that performs related word generation processing (S806) in addition to the above-described question sentence storage unit 902 and response format element correlation degree table 911, and related words that are related in detail to the question sentence. Related word table 1502 stored together with the related degree of relatedness, related document search unit 1503 for performing related document search processing (S807), related document storage unit 1504 for storing related documents related to contents in question sentences, sentence score calculation processing The sentence score calculation unit 1505 that performs (S808), the sentence score table 1506 that stores the degree of suitability as a response to the question sentence as a sentence score for each related sentence included in the related document, and solution candidate extraction processing (S809) A solution candidate extraction unit 1507, a solution candidate storage unit 1508 for storing a candidate range determined as a response solution based on the sentence score, a response sentence output process ( And a response sentence output section 1509 that performs 810).

次に、関連語生成処理(S806)について詳述する。図16は、関連語生成処理フローを示す図である。まず、質問文記憶部902に記憶している質問文から複数のキーワードを抽出する。この例では、質問文から複合語を含む動詞・形容詞のキーワードを抽出する(S1601)。そして、順次キーワードを組み合せてクエリを生成する。この例では、3つのキーワード組合せのAND条件からなるクエリを生成する。そしてクエリ毎に以下の処理を繰り返す(S1602)。当該クエリを入力してWeb検索し、検索結果の要約であるスニップ集合を得る(S1603)。尚、質問応答システムはインターネットに接続しており、Web検索サイトを介してWeb上のサイト、HTML文書、及びその他のコンテンツを検索できるように構成されている。そしてスニップ集合に含まれる単語(内容語に限る。以下、関連語候補と呼ぶ。)を抽出し、関連語候補の内容的な関連度を算出し、関連語を特定する。この例における内容的な関連度の算出式を以下に示す。   Next, the related word generation process (S806) will be described in detail. FIG. 16 is a diagram showing a related word generation processing flow. First, a plurality of keywords are extracted from the question text stored in the question text storage unit 902. In this example, verb / adjective keywords including compound words are extracted from the question sentence (S1601). Then, a query is generated by sequentially combining the keywords. In this example, a query including an AND condition of three keyword combinations is generated. The following processing is repeated for each query (S1602). A Web search is performed by inputting the query, and a snip set that is a summary of the search results is obtained (S1603). The question answering system is connected to the Internet, and is configured to be able to search Web sites, HTML documents, and other contents via a Web search site. Then, words included in the snip set (limited to content words; hereinafter referred to as related word candidates) are extracted, the content relevance level of the related word candidates is calculated, and the related words are specified. The calculation formula for the content relevance in this example is shown below.

Figure 2009113494
Figure 2009113494

式中、wjは、各単語(各関連語候補)であり、qiは、各クエリであり、niは、各クエリqiに対して得られたスニップの件数であり、freq(wj,i)は、各単語wjの各クエリqiに対して得られたスニップ集合中でのスニップ頻度であり、T(wj)は、各単語wjの内容的な関連度である。 Where w j is each word (each related word candidate), q i is each query, n i is the number of snips obtained for each query q i , and freq (w j , i) is the snip frequency in the snip set obtained for each query q i of each word w j , and T (w j ) is the content relevance of each word w j. .

関連語候補毎に以下の処理を繰り返す(S1604)。当該関連語候補のスニップ集合中における頻度を算出し、当該スニップ頻度をスニップ数で割り、正規化スニップ頻度とする(S1605)。すべての関連語候補について正規化スニップ頻度を求める(S1606)。この処理を、予定しているすべてのキーワード組合せについて行った時点で(S1607)、関連語候補毎に正規化スニップ頻度同士を比較し、最大の正規化スニップ頻度を内容的な関連度とする。内容的な関連度が所定閾値以上の場合に、当該関連語候補を関連語と判定し、関連語とその関連度を関連語テーブル1502に記憶させる(S1608)。尚、閾値による判定を行なわずに、すべての関連語候補を関連語とした扱う形態も有効である。   The following processing is repeated for each related word candidate (S1604). The frequency of the related word candidate in the snip set is calculated, and the snip frequency is divided by the number of snips to obtain a normalized snip frequency (S1605). Normalized snip frequency is obtained for all related word candidates (S1606). When this processing is performed for all the planned keyword combinations (S1607), the normalized snip frequencies are compared for each related word candidate, and the maximum normalized snip frequency is set as the content relevance level. If the content related degree is equal to or greater than a predetermined threshold, the related word candidate is determined to be a related word, and the related word and its related degree are stored in the related word table 1502 (S1608). Note that it is also effective to treat all related word candidates as related words without performing the determination based on the threshold value.

図17は、関連語テーブルの構成例を示す図である。関連語毎にレコードを設け、関連語1751と関連度1752との項目を対応付けて記憶するように構成されている。   FIG. 17 is a diagram illustrating a configuration example of a related word table. A record is provided for each related word, and the items of the related word 1751 and the related degree 1752 are stored in association with each other.

図17の例は、図10の1001に示した、「琉球王国のグスク及び関連遺産群」が正解遺産に登録された理由は何ですか、という質問文に対して、2000、沖縄、文化などの関連語1751が、内容的に関連が高いということを示している。   The example in FIG. 17 shows the reason why the “Gusuku and related heritage groups of the Ryukyu Kingdom” registered as a correct heritage as shown in 1001 of FIG. The related word 1751 indicates that the content is highly related.

次に、関連文書検索処理(S807)について詳述する。図18は、関連文書検索処理フローを示す図である。まず、質問文記憶部902に記憶している質問文から複数のキーワードを抽出する。この例では、質問文から複合語を含む動詞・形容詞のキーワードを抽出する(S1801)。そして、順次キーワードを組み合せてクエリを生成する。この例では、所定数のキーワード組合せのAND条件からなるクエリを生成する。そしてクエリ毎に以下の処理を繰り返す(S1802)。当該クエリを入力してWeb検索し、検索結果の各文書のURLを得る(S1803)。そして当該URLからHTML文書をダウンロードし(S1804)、HTML文書をプレーンテキストに変換し、関連文書とする(S1805)。所定のキーワード組合せについて処理すると(S1806)、プレーンテキストに変換した各関連文書に含まれる関連文毎に、関連文書記憶部1504の全体を通した文番号を割り当てて順次記憶する(S1807)。   Next, the related document search process (S807) will be described in detail. FIG. 18 is a diagram showing a related document search processing flow. First, a plurality of keywords are extracted from the question text stored in the question text storage unit 902. In this example, verb / adjective keywords including compound words are extracted from the question sentence (S1801). Then, a query is generated by sequentially combining the keywords. In this example, a query including AND conditions of a predetermined number of keyword combinations is generated. Then, the following processing is repeated for each query (S1802). A Web search is performed by inputting the query, and the URL of each document as a search result is obtained (S1803). Then, an HTML document is downloaded from the URL (S1804), and the HTML document is converted into plain text to be a related document (S1805). When processing is performed for a predetermined keyword combination (S1806), a sentence number through the entire related document storage unit 1504 is assigned to each related sentence included in each related document converted into plain text and stored sequentially (S1807).

次に、文スコア算出処理(S808)について説明する。この処理では、関連文毎に、質問文に対する内容的な関連度と形式的な相関度を考慮した応答適性を判定する。   Next, the sentence score calculation process (S808) will be described. In this process, for each related sentence, response aptitude is determined in consideration of the content relevance level and the formal correlation degree with respect to the question sentence.

図19は、文スコア算出処理フローを示す図である。関連文書記憶部1504で記憶している各関連文書に含まれる関連文を順に特定し、当該関連文毎に以下の処理を繰り返す(S1901)。内容評価項算出処理(S1902)では、内容的な関連度に基づく評価を行ない、形式評価項算出処理(S1903)では、形式的な関連度に基づく評価を行なう。ここで、総合的な評価指標となる文スコアの算出式を示す。   FIG. 19 is a diagram showing a sentence score calculation processing flow. The related sentences included in each related document stored in the related document storage unit 1504 are specified in order, and the following processing is repeated for each related sentence (S1901). In the content evaluation term calculation processing (S1902), evaluation based on the content relevance is performed, and in the formal evaluation term calculation processing (S1903), evaluation based on the formal relevance is performed. Here, the calculation formula of the sentence score used as a comprehensive evaluation index is shown.

Figure 2009113494
Figure 2009113494

式中、Siは、各関連文であり、wijは、各関連文Siに含まれる各単語であり、nは、関連文Si中の単語wijの異なり数であり、bikは、各関連文Siに含まれる各応答形式要素(この例では、形式化された2単語)であり、mは、関連文Si中の応答形式要素bikの異なり数である。Tは、前述と同様に関連語の関連度であり、カイ二乗値の平方根は、応答形式要素の質問形式相関度であり、Score(Si)は、各関連文Siの文スコアである。In the formula, S i is each related sentence, w ij is each word included in each related sentence S i , n is the number of different words w ij in the related sentence S i , and b ik Is each response format element (in this example, two formalized words) included in each related sentence Si, and m is the number of different response format elements b ik in the related sentence S i . T is the relevance degree of the related word as described above, the square root of the chi-square value is the question form correlation degree of the response form element, and Score (S i ) is the sentence score of each related sentence S i. .

正確な適性を得るためには、関連文内の単語や応答形式要素に関する密度を考慮して文の長さで評価値を割る必要がある。文の長さとして単純に単語数を用いることもできるが、この例では特に、短い文は回答として不適切である場合が多いことを考慮して、短い文の適性を下げる意味で文の長さとして単語数の対数を用いている。   In order to obtain accurate aptitude, it is necessary to divide the evaluation value by the length of the sentence in consideration of the density related to the words in the related sentence and the response form elements. You can simply use the number of words as the length of the sentence, but especially in this example, considering the fact that short sentences are often inappropriate as answers, the length of the sentence is meant to reduce the suitability of short sentences. The logarithm of the number of words is used.

その為、当該関連文の長さ(1+関連文中単語数)の対数を算出し(S1904)、内容要評価項と形式評価項の積算し、当該積を関連文の長さの対数で除算し、文スコアを求める(S1905)。   Therefore, the logarithm of the length of the relevant sentence (1 + number of words in the relevant sentence) is calculated (S1904), the content evaluation term and the formal evaluation term are added up, and the product is divided by the logarithm of the length of the related sentence. The sentence score is obtained (S1905).

ここで、前述の内容評価項算出処理(S1902)と形式評価項算出処理(S1903)について詳述する。   Here, the content evaluation term calculation process (S1902) and the format evaluation term calculation process (S1903) will be described in detail.

図20は、内容評価項算出処理フローを示す図である。まず、当該関連文中に含まれる単語(内容語に限る)を特定し(S2001)、各単語の関連語としての関連度を関連語テーブル1502から取得する。そして、全ての単語の関連度を加算し、関連度総和を求める(S2002)。更に、関連度総和のα乗を算出して内容評価項値とする(S2003)。この累乗計算に用いる定数αは、0から1の値であって、評価全体に対する内容評価の重みを示している。大きい値ほど内容に対する重みが増す。例えば0の場合は、形式評価のみの文スコアを得ることになり、1の場合には内容評価のみの文スコアを得ることになる。尚、この評価重みのα値は、質問応答システムとして予め設定し、事前に記憶されている値を用いる方式や、質問文に対する応答文の生成の際に、操作者が設定する方式が考えられる。いずれの場合にも、評価重み値を入力する評価重み入力部と、評価重み値を記憶する評価重み記憶部を有し、本処理はその評価重み値を読み出してα値として累乗算出に用いる。   FIG. 20 is a diagram illustrating a content evaluation term calculation processing flow. First, a word (limited to a content word) included in the related sentence is specified (S2001), and the degree of relevance as a related word of each word is acquired from the related word table 1502. And the relevance degree of all the words is added and a relevance total is calculated | required (S2002). Further, the relevance sum is raised to the α power to obtain a content evaluation term value (S2003). The constant α used for the power calculation is a value from 0 to 1, and indicates the weight of content evaluation for the entire evaluation. The larger the value, the more weight is given to the content. For example, in the case of 0, a sentence score of only format evaluation is obtained, and in the case of 1, a sentence score of only content evaluation is obtained. Note that the α value of the evaluation weight is set in advance as a question answering system, and a method using a value stored in advance or a method set by an operator when generating a response sentence for a question sentence can be considered. . In any case, an evaluation weight input unit for inputting an evaluation weight value and an evaluation weight storage unit for storing the evaluation weight value are included. This process reads the evaluation weight value and uses it as an α value for power calculation.

図21は、形式評価項算出処理フローを示す図である。当該関連文を前述と同様に回答文形式化処理(図6)し、関連文形式化単語列を得る(S2101)。そして、関連文形式化単語列から、順に連続する2単語を抽出し、応答形式要素とする(S2102)。各応答形式要素の応答形式相関度を応答形式要素相関度テーブル911から取得し、全ての応答形式要素の質問形式相関度を加算し、質問形式相関度総和を求める(S2103)。そして、質問形式相関度総和の1−α(形式重み)乗を算出して、形式評価項値とする(S2104)。   FIG. 21 is a diagram showing a formal evaluation term calculation processing flow. The related sentence is subjected to an answer sentence formatting process (FIG. 6) in the same manner as described above to obtain a related sentence formatted word string (S2101). Then, two consecutive words are extracted in order from the related sentence formalized word string, and set as response format elements (S2102). The response format correlation degree of each response format element is acquired from the response format element correlation degree table 911, the question format correlation degrees of all response format elements are added, and the query format correlation degree sum is obtained (S2103). Then, the 1-α (form weight) power of the question form correlation degree sum is calculated and used as a form evaluation term value (S2104).

図19に示すように、当該文番号に対応付けて、求めた文スコアを文スコアテーブル1506に記憶し(S1906)、すべての関連文について処理して終了する(S1907)。   As shown in FIG. 19, the sentence score obtained in association with the sentence number is stored in the sentence score table 1506 (S1906), and all related sentences are processed and the process is terminated (S1907).

これにより、関連文毎の文スコアが得られる。文スコアテーブル1506で記憶する。図22は、文スコアの分布例を示す図である。図に示すように、文スコアが高い領域が存在する。次に、これを解候補2201,2202,2203として抽出する解候補抽出処理(S809)を行なう。   Thereby, the sentence score for every related sentence is obtained. The sentence score table 1506 is stored. FIG. 22 is a diagram illustrating an example of sentence score distribution. As shown in the figure, there is a region with a high sentence score. Next, solution candidate extraction processing (S809) is performed to extract these as solution candidates 2201, 2202, 2203.

図23は、解候補抽出処理フローを示す図である。文スコアテーブル1506に含まれる文番号に従って、関連文毎に以下の処理を繰り返す(S2301)。文スコアが極大値か判定する(S2302)。このとき、前後の文の文スコアよりも大きい場合に極大値とする。極大値である場合には、当該関連文の文番号を文スコア極大値に対応付けて、解候補記憶部1508の解候補の先頭文番号と末尾文番号に記憶する(S2303)。解候補記憶部1508は、解候補毎に、解候補範囲となる先頭文番号と末尾文番号、及び文スコア極大値を対応付けて記憶するように構成されている。更に、順次、前の関連文の文スコアが極大値の1/2以上であるか判定し、1/2以上である場合には解候補の先頭文番号を当該前の関連文の文番号に改める(S2304)。1/2より小さい場合には、その時点で本ステップを終了する。また、順次、後の関連文の文スコアが極大値の1/2以上であるか判定し、1/2以上である場合には解候補の末尾文番号を当該後の関連文の文番号に改める(S2305)。1/2より小さい場合には、その時点で本ステップを終了する。上述の処理をすべての文について行なう。(S2306)。1/2は、所定の割合の例である。   FIG. 23 is a diagram illustrating a solution candidate extraction process flow. The following processing is repeated for each related sentence according to the sentence numbers included in the sentence score table 1506 (S2301). It is determined whether the sentence score is a maximum value (S2302). At this time, the maximum value is set when the sentence score is larger than the sentence score of the preceding and following sentences. If it is a maximum value, the sentence number of the relevant sentence is associated with the sentence score maximum value and stored in the first sentence number and the last sentence number of the solution candidate in the solution candidate storage unit 1508 (S2303). The solution candidate storage unit 1508 is configured to store, for each solution candidate, the first sentence number, the last sentence number, and the sentence score maximum value that are the solution candidate range in association with each other. Further, it is sequentially determined whether the sentence score of the previous related sentence is ½ or more of the maximum value. If it is ½ or more, the first sentence number of the solution candidate is set as the sentence number of the previous related sentence. It is revised (S2304). If it is smaller than ½, this step is terminated at that time. In addition, it is sequentially determined whether the sentence score of the subsequent related sentence is 1/2 or more of the maximum value, and if it is 1/2 or more, the last sentence number of the solution candidate is set as the sentence number of the subsequent related sentence. Amend (S2305). If it is smaller than ½, this step is terminated at that time. The above process is performed for all sentences. (S2306). 1/2 is an example of a predetermined ratio.

最後に、応答文出力処理(S810)について詳述する。図24は、応答文出力処理フローを示す図である。文スコア極大値の大きい順に、解候補を特定し(S2401)、解候補の先頭文番号から末尾文番号までの関連文書記憶部1504から関連文を得る(S2402)。そして、関連文群を応答文として出力する(S2403)。出力する応答文数が定められている場合には、応答文数分の処理を繰り返す(S2404)。また、出力する応答文量が定められている場合には、応答文量に至るまで処理を繰り返す。あるいは、操作者の指示により応答文を切りかえる場合には、指示に従って上述の処理を行なう。   Finally, the response sentence output process (S810) will be described in detail. FIG. 24 is a diagram showing a response sentence output processing flow. Solution candidates are specified in descending order of sentence score maximum values (S2401), and related sentences are obtained from the related document storage unit 1504 from the first sentence number to the last sentence number of the solution candidates (S2402). Then, the related sentence group is output as a response sentence (S2403). If the number of response sentences to be output is determined, the process for the number of response sentences is repeated (S2404). If the response sentence amount to be output is determined, the process is repeated until the response sentence amount is reached. Or when switching a response sentence according to an operator's instruction | indication, the above-mentioned process is performed according to an instruction | indication.

図25は、正解と不正解の例を示す図である。正解2501及び正解2502は、図10の質問文1001に対して得られた応答文の例である。不正解2503は、形式に関する評価が低いために応答文とならなかった例である。   FIG. 25 is a diagram illustrating examples of correct answers and incorrect answers. The correct answer 2501 and the correct answer 2502 are examples of response sentences obtained with respect to the question sentence 1001 in FIG. The incorrect answer 2503 is an example in which the response sentence is not obtained because the evaluation regarding the format is low.

図25に示したように、形式的な適性を考慮することにより、内容的な適性も向上することがわかる。   As shown in FIG. 25, it can be seen that content suitability is improved by considering formal suitability.

実施の形態2.
上述の形態では、質問文形式要部として疑問詞の前後3単語を含む7単語を抽出したが、他の単語数することもできる。
Embodiment 2. FIG.
In the above-described form, seven words including three words before and after the interrogative word are extracted as the main part of the question sentence format, but other words can be used.

前の3単語は、所定の疑問詞前単語数を3とした例であり、他に、1、2、4、5、6以上とすることもできる。また、後の3単語は、所定の疑問詞後単語数を3とした例であり、他に、1、2、4、5、6以上とすることもできる。また、所定前単語数と所定後単語数は同数に限らず、異なる数であっても構わない。   The previous three words are an example in which the predetermined number of words before the interrogation is three, and may be 1, 2, 4, 5, 6 or more. The subsequent three words are examples in which the predetermined number of words after the interrogation is set to 3, and can be 1, 2, 4, 5, 6 or more. The number of pre-predetermined words and the number of post-predetermined words are not limited to the same number, and may be different numbers.

実施の形態3.
上述の形態では、応答形式要素数として2単語抽出したが、他の単語数することもできる。連続する3単語、4単語、5単語以上とすることもできる。
Embodiment 3 FIG.
In the above-described form, two words are extracted as the number of response format elements, but other numbers of words can be used. It can also be 3 consecutive words, 4 words, 5 words or more.

応答形式要素数は、質問文形式要部の単語数よりも小さい単語数であれば有効である。   The number of response format elements is effective if the number of words is smaller than the number of words in the main part of the question sentence format.

実施の形態4.
上述の例では、インターネット上に提供されるハイパーテキストシステムを情報源としてWeb検索を行い、検索結果としてのスニップ及び関連文書を取得した。つまり、ワールドワイドウェブ(WWW)を検索対象とした。
Embodiment 4 FIG.
In the above example, a web search is performed using a hypertext system provided on the Internet as an information source, and a snip and a related document as a search result are acquired. That is, the search target was the World Wide Web (WWW).

しかし本発明は、他のデータベースを検索対象とする検索を行う場合にも有効である。他のデータベースで得られる要約や文書を前述のスニップ又は関連文書に置き換えて処理することにより、有効な応答が得られる。   However, the present invention is also effective when searching for another database as a search target. Replacing a summary or document obtained in another database with the aforementioned snip or related document and processing can yield a valid response.

実施の形態5.
前述の相似形式質問文抽出処理(図12)では、各位置の単語の一致数を相似度としたが、他の基準により質問文形式要部同士の相似度を算出してもよい。
Embodiment 5 FIG.
In the similarity format question sentence extraction process (FIG. 12) described above, the number of matching words at each position is used as the similarity level. However, the similarity level between the main parts of the question text format may be calculated based on other criteria.

例えば、質問文形式要部同士の単語列としての編集距離を算出し、編集距離を相似度として用いることも有効である。また、疑問詞より前の単語列同士の編集距離を算出し、更に疑問詞より後の単語列同士の編集距離を算出し、両編集距離の和を相似度とすることも有効である。   For example, it is also effective to calculate the edit distance as a word string between the main parts of the question sentence format and use the edit distance as the similarity. It is also effective to calculate the edit distance between the word strings before the interrogative, further calculate the edit distance between the word strings after the interrogative, and use the sum of both edit distances as the similarity.

尚、編集距離は、文字列同士がどの程度異なっているかを示す値であり、文字の削除、挿入、置換によって、一方の文字列を他方の文字列に変形するのに要する最小の手順回数として算出される。この例では、文字列に代えて、上述の形式化された単語列を扱うことにより編集距離を算出することができる。   The edit distance is a value indicating how different character strings are, and is the minimum number of steps required to transform one character string into the other by deleting, inserting, or replacing characters. Calculated. In this example, the edit distance can be calculated by handling the above-described formalized word string instead of the character string.

実施の形態6.
前述の相似形式質問文抽出処理(図12)では、各位置の単語の一致数を相似度としたが、各位置により重み付けを行うことも有効である。
Embodiment 6 FIG.
In the above-described similar question extraction process (FIG. 12), the number of matching words at each position is used as the similarity, but it is also effective to weight each position.

例えば、疑問詞からの距離に応じて、距離が小さい位置の一致の場合には大きい値を加算し、距離が大きい位置の一致の場合には小さい値を加算することにより、疑問詞近辺を重視する相似度を得ることもできる。   For example, depending on the distance from the interrogator, a large value is added when matching a position with a small distance, and a small value is added when matching a position with a large distance. You can also get similarities.

逆に、距離が小さい位置の一致の場合には小さい値を加算し、距離が大きい位置の一致の場合には大きい値を加算することにより、疑問詞近辺を軽視する相似度を得ることもできる。   Conversely, by adding a small value in the case of a match of a position with a small distance and adding a large value in the case of a match of a position with a large distance, it is possible to obtain a similarity degree that neglects the vicinity of an interrogative word. .

実施の形態7.
また、相似度の算出の際に、単語の種類によって重み付けを行ってもよい。例えば、図7のS701からS704で判定した単語の種類毎に重みを設定し、単語が一致した場合に、その重みを加算することにより相似度を求めることが有効である。質問の焦点となりやすい単語に対して大きい重みを設定することなどが考えられる。
Embodiment 7 FIG.
Further, weighting may be performed according to the type of word when calculating the similarity. For example, it is effective to set a weight for each type of word determined in steps S701 to S704 in FIG. 7 and obtain the similarity by adding the weight when the words match. For example, a large weight may be set for a word that is likely to be the focus of a question.

実施の形態8.
上述の実施の形態では、応答形式要素相関度算出処理(図13)においてカイ二乗検定により応答形式要素の質問文に対する質問形式相関度を算出したが、他の基準により質問形式相関度を算出することもできる。
Embodiment 8 FIG.
In the above-described embodiment, the question form correlation degree for the question sentence of the response form element is calculated by the chi-square test in the response form element correlation degree calculation process (FIG. 13), but the question form correlation degree is calculated by other criteria. You can also

参考応答例を全体の集合として、集合Aとして参考応答例形式記憶部104に含まれる参考応答ID群と、集合Bとして応答形式要素含回答文集合記憶部910に含まれる参考応答ID数群の要素数が計数可能であるので、例えば、ダイス係数を質問形式相関度として用いることもできる。   As a whole set of reference response examples, a reference response ID group included in the reference response example format storage unit 104 as a set A and a reference response ID number group included in a response format element-containing response sentence set storage unit 910 as a set B Since the number of elements can be counted, for example, a dice coefficient can be used as the question form correlation.

その場合には、集合Aの要素数として相似形式質問文集合記憶部908に含まれる参考応答ID数を計数し分母第1項値とし、集合Bの要素数として応答形式要素含回答文集合記憶部910に含まれる参考応答ID数を計数し分母第2項値とし、分母第1項値と分母第2項値を合計して、分母値を求める。また、集合Aと集合Bの積集合の要素数として相似形式質問文集合記憶部908と参考応答例形式記憶部104に共に含まれる参考応答ID数を計数し、それに2を乗じて分子値とする。そして、分子値を分母値で割ることによりダイス係数を算出し、それを質問形式相関度とする。   In that case, the number of reference response IDs included in the similarity-type question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and the response form element-containing answer sentence set storage as the number of elements in the set B The number of reference response IDs included in the unit 910 is counted as the second denominator value, and the denominator first term value and the denominator second term value are summed to obtain the denominator value. In addition, the number of reference response IDs included in both the similar format question sentence set storage unit 908 and the reference response example format storage unit 104 is counted as the number of elements of the product set of the set A and the set B, and is multiplied by 2 to obtain the molecular value. To do. Then, a dice coefficient is calculated by dividing the numerator value by the denominator value, and this is used as the question form correlation.

実施の形態9.
また、相互情報量を質問形式相関度とすることもできる。
Embodiment 9 FIG.
Also, the mutual information amount can be a question form correlation.

その場合には、集合Aの要素数として相似形式質問文集合記憶部908に含まれる参考応答ID数を計数し分母第1項値とし、集合Bの要素数として応答形式要素含回答文集合記憶部910に含まれる参考応答ID数を計数し分母第2項値とし、分母第1項値と分母第2項値を積算して、分母値を求める。また、集合Aと集合Bの積集合の要素数として相似形式質問文集合記憶部908と参考応答例形式記憶部104に共に含まれる参考応答ID数を計数し、それに全参考応答例数を乗じて分子値とする。そして、分子値を分母値で割り、その商に対する底を2とする対数を算出して、相互情報量を得る。そして、それを質問形式相関度とする。   In that case, the number of reference response IDs included in the similarity-type question sentence set storage unit 908 is counted as the number of elements in the set A to obtain the first denominator value, and the response form element-containing answer sentence set storage as the number of elements in the set B The number of reference response IDs included in the unit 910 is counted as the second denominator value, and the denominator first term value and the denominator second term value are integrated to obtain the denominator value. In addition, the number of reference response IDs included in both the similar form question sentence set storage unit 908 and the reference response example form storage unit 104 is counted as the number of elements of the product set of the set A and the set B, and is multiplied by the total number of reference response examples. The numerator value. Then, the numerator value is divided by the denominator value, and the logarithm with the base for the quotient being 2 is calculated to obtain the mutual information amount. And it is set as the question form correlation.

いずれの質問形式相関度の算出方法も、応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を算出している。   The calculation method of any question format correlation is based on the probability that the similar question text is combined with the reference response text formalized word string including the response format element, and the degree to which the response format element is related to the similar question text as a format. Is calculated.

実施の形態10.
Web検索エンジンを用いて質問応答システムを実現することができる。図26は、Web検索エンジンを用いる質問応答システムの構成を示す図である。質問応答システムは、参考文形式化部2601、参考文記憶部2602、入力質問文形式化部2603、参考回答文抽出部2604、Web検索要求部2605、記述スタイル評価部2606、関連性評価部2607、スコア処理部2608、及び回答文出力部2609を有している。
Embodiment 10 FIG.
A question answering system can be realized using a Web search engine. FIG. 26 is a diagram showing a configuration of a question answering system using a Web search engine. The question answering system includes a reference sentence formatting unit 2601, a reference sentence storage unit 2602, an input question sentence formatting unit 2603, a reference answer sentence extraction unit 2604, a Web search request unit 2605, a description style evaluation unit 2606, and a relevance evaluation unit 2607. , A score processing unit 2608 and an answer sentence output unit 2609.

図27は、Web検索エンジンを用いる質問応答システムの処理フローを示す図である。参考文形式化部2601による参考文形式化処理(S2701)では、参考質問文と参考回答文の対を参考文とし、参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う。参考文記憶部2602は、参考文形式化部2601において形式化された形式化参考文を記憶する。入力質問文形式化部2603による入力質問文形式化処理(S2702)では、入力質問文の記述スタイルを一般化する形式化処理を行う。参考回答文抽出部2604による参考回答文抽出処理(S2703)では、入力質問文形式化部2603において形式化された形式化入力質問文と類似する形式を有する形式化参考文を探索し、この形式化参考文に含まれる参考回答文を参考文記憶部2602から抽出する。   FIG. 27 is a diagram showing a processing flow of a question answering system using a Web search engine. In the reference sentence formatting process (S2701) by the reference sentence formatting unit 2601, a pair of the reference question sentence and the reference answer sentence is used as a reference sentence, and the formatting is generalized for at least the reference question sentence among the reference sentences. Process. The reference text storage unit 2602 stores the formatted reference text formatted by the reference text formatting unit 2601. In the input question sentence formatting process (S2702) by the input question sentence formatting unit 2603, a formalization process for generalizing the description style of the input question sentence is performed. In the reference answer sentence extraction process (S2703) by the reference answer sentence extraction unit 2604, a formalized reference sentence having a format similar to the formalized input question sentence formatted by the input question sentence formatting unit 2603 is searched, and this format is searched. The reference answer sentence included in the structured reference sentence is extracted from the reference sentence storage unit 2602.

Web検索要求部2605によるWeb検索要求処理(S2704)では、入力質問文を条件としてWeb検索エンジンにWeb上の文書の検索を要求し、結果として検索Web文書を得る。記述スタイル評価部2606による記述スタイル評価処理(S2705)では、参考回答文と検索Web文書の間の記述スタイルの適合性を評価する。関連性評価部2607による関連性評価処理(S2706)では、検索Web文書と入力質問文の間の内容の関連性を評価する。   In the Web search request process (S2704) by the Web search request unit 2605, the Web search engine is requested to search for a document on the Web on the condition of the input question sentence, and a search Web document is obtained as a result. In the description style evaluation process (S2705) by the description style evaluation unit 2606, the suitability of the description style between the reference answer sentence and the search Web document is evaluated. In the relevance evaluation process (S2706) by the relevance evaluation unit 2607, the relevance of the content between the search Web document and the input question sentence is evaluated.

スコア処理部2608によるスコア処理(S2707)では、記述スタイル評価部により参考回答文と記述スタイルの適合性があると評価され、かつ、関連性評価部により前記入力質問文の内容と関連があると評価された検索Web文書に対してスコア付けを行う。そして、回答文出力部2609による回答文出力処理(S2708)では、2608スコア処理部から入力質問文に対する回答文を得て、出力する。   In the score processing (S2707) by the score processing unit 2608, the description style evaluation unit evaluates that the reference answer sentence and the description style are compatible, and the relevance evaluation unit relates to the contents of the input question sentence. Scoring is performed on the evaluated search Web document. Then, in an answer sentence output process (S2708) by the answer sentence output unit 2609, an answer sentence for the input question sentence is obtained from the 2608 score processor and output.

質問応答システムは、コンピュータであり、各要素はプログラムにより処理を実行することができる。また、プログラムを記憶媒体に記憶させ、記憶媒体からコンピュータに読み取られるようにすることができる。   The question answering system is a computer, and each element can execute processing by a program. Further, the program can be stored in a storage medium so that the computer can read the program from the storage medium.

質問応答システムのハードウェアの構成について説明する。図28は、質問応答システムのハードウェアの構成を示す図である。バスに、演算装置2801、データ記憶装置2802、メモリ2803、通信インターフェース2804、データ入力装置2805、データ出力装置2806が接続されている。データ記憶装置2802は、例えばROM(Read Only Memory)やハードディスクである。メモリ2803は、通常RAM(Random Access Memory)である。プログラムは、通常データ記憶装置2802に記憶されており、メモリ2803にロードされた状態で、順次演算装置2801に読み込まれ処理を行う。通信インターフェース2804は、ネットワークを介した通信に用いる。データ入力装置2805は、データの入力に用いる。データ出力装置2806は、データの出力に用いる。   The hardware configuration of the question answering system will be described. FIG. 28 is a diagram illustrating a hardware configuration of the question answering system. An arithmetic device 2801, a data storage device 2802, a memory 2803, a communication interface 2804, a data input device 2805, and a data output device 2806 are connected to the bus. The data storage device 2802 is, for example, a ROM (Read Only Memory) or a hard disk. The memory 2803 is a normal RAM (Random Access Memory). The program is normally stored in the data storage device 2802, and is loaded into the memory 2803 and sequentially read into the arithmetic device 2801 for processing. A communication interface 2804 is used for communication via a network. The data input device 2805 is used for data input. The data output device 2806 is used for outputting data.

参考応答例準備処理に係る構成を示す図である。It is a figure which shows the structure which concerns on a reference response example preparation process. 参考応答例準備処理フローを示す図である。It is a figure which shows a reference response example preparation process flow. 参考応答例記憶部の構成例を示す図である。It is a figure which shows the structural example of a reference response example memory | storage part. 参考応答例形式化処理フローを示す図である。It is a figure which shows the reference response example formalization processing flow. 参考応答例形式記憶部の構成例を示す図である。It is a figure which shows the structural example of a reference response example format memory | storage part. 質問文形式化処理/回答文形式化処理フローを示す図である。It is a figure which shows a question sentence formatting process / answer sentence formatting process flow. 質問応答特性判定処理フローを示す図である。It is a figure which shows the question response characteristic determination processing flow. 質問応答処理フローを示す図である。It is a figure which shows a question response process flow. 質問文入力から応答形式要素相関度計算までの処理に係る構成を示す図である。It is a figure which shows the structure which concerns on the process from a question sentence input to a response format element correlation degree calculation. 質問文と参考質問文の比較例を示す図である。It is a figure which shows the comparative example of a question sentence and a reference question sentence. 質問文形式要部抽出処理フローを示す図である。It is a figure which shows the question sentence format principal part extraction processing flow. 相似形式質問文抽出処理フローを示す図である。It is a figure which shows a similar form question sentence extraction processing flow. 応答形式要素相関度算出処理フローを示す図である。It is a figure which shows a response format element correlation calculation process flow. 応答形式要素相関度テーブルを示す図である。It is a figure which shows a response format element correlation degree table. 関連語生成から応答文出力までの処理に係る構成を示す図である。It is a figure which shows the structure which concerns on the process from a related word production | generation to a response sentence output. 関連語生成処理フローを示す図である。It is a figure which shows a related word production | generation processing flow. 関連語テーブルの構成例を示す図である。It is a figure which shows the structural example of a related word table. 関連文書検索処理フローを示す図である。It is a figure which shows a related document search process flow. 文スコア算出処理フローを示す図である。It is a figure which shows a sentence score calculation process flow. 内容評価項算出処理フローを示す図である。It is a figure which shows a content evaluation term calculation process flow. 形式評価項算出処理フローを示す図である。It is a figure which shows a format evaluation term calculation processing flow. 文スコアの分布例を示す図である。It is a figure which shows the example of distribution of a sentence score. 解候補抽出処理フローを示す図である。It is a figure which shows a solution candidate extraction process flow. 応答文出力処理フローを示す図である。It is a figure which shows a response sentence output process flow. 正解と不正解の例を示す図である。It is a figure which shows the example of a correct answer and an incorrect answer. Web検索エンジンを用いる質問応答システムの構成を示す図である。It is a figure which shows the structure of the question answering system using a Web search engine. Web検索エンジンを用いる質問応答システムの処理フローを示す図である。It is a figure which shows the processing flow of the question answering system using a Web search engine. 質問応答システムのハードウェアの構成を示す図である。It is a figure which shows the hardware constitutions of a question answering system.

符号の説明Explanation of symbols

101 参考応答例生成部、102 参考応答例記憶部、103 参考応答例形式化部、104 参考応答例形式記憶部、901 質問文入力部、902 質問文記憶部、903 質問文形式化部、904 質問文形式記憶部、905 質問文形式要部抽出部、906 質問文形式要部記憶部、907 相似形式質問文抽出部、908 相似形式質問文集合記憶部、909 応答形式要素相関度算出部、910 応答形式要素含回答文集合記憶部、911 応答形式要素相関度テーブル、1501 関連語生成部、1502 関連語テーブル、1503 関連文書検索部、1504 関連文書記憶部、1505 文スコア算出部、1506 文スコアテーブル、1507 解候補抽出部、1508 解候補記憶部、1509 応答文出力部、2601 参考文形式化部、2602 参考文記憶部、2603 入力質問文形式化部、2604 参考回答文抽出部、2605 Web検索要求部、2606 記述スタイル評価部、2607 関連性評価部、2608 スコア処理部、2609 回答文出力部。   101 Reference response example generation unit, 102 Reference response example storage unit, 103 Reference response example formatting unit, 104 Reference response example format storage unit, 901 Question sentence input unit, 902 Question sentence storage unit, 903 Question sentence formatting unit, 904 Question sentence format storage section, 905 Question sentence format main part extraction section, 906 Question sentence format main section storage section, 907 Similarity format question sentence extraction section, 908 Similarity format question sentence set storage section, 909 Response format element correlation degree calculation section, 910 Response format element including answer sentence set storage unit, 911 Response format element correlation degree table, 1501 Related word generation unit, 1502 Related word table, 1503 Related document search unit, 1504 Related document storage unit, 1505 Sentence score calculation unit, 1506 sentences Score table, 1507 solution candidate extraction unit, 1508 solution candidate storage unit, 1509 response sentence output unit, 2601 reference sentence Formulation unit, 2602 reference sentence storage unit, 2603 input question sentence formatting unit, 2604 reference answer sentence extraction unit, 2605 Web search request unit, 2606 description style evaluation unit, 2607 relevance evaluation unit, 2608 score processing unit, 2609 answer Sentence output part.

Claims (13)

質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、以下の要素を有することを特徴とする質問応答システム
(1)質問文を入力する質問文入力部
(2)文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化部
(3)疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出部
(4)参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部
(5)参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化部
(6)変換された参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答IDに対応付けて複数記憶する応答例形式記憶部
(7)応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答IDを、入力された質問文に形式が相似する相似形式質問文に係る参考応答IDとして特定する相似形式質問文抽出部
(8)特定された相似形式質問文に係る参考応答ID群を、相似形式質問文集合として記憶する相似形式質問文集合記憶部
(9)各相似形式質問文に係る参考応答IDに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ID群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ID群の数、相似形式質問文集合に含まれる参考応答ID群の数、及び応答形式要素含回答文集合に含まれる参考応答ID群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出部
(10)応答形式要素毎に算出された質問形式相関度を記憶する応答形式要素相関度テーブル
(11)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成部
(12)関連語毎に算出された関連度を記憶する関連語テーブル
(13)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索部
(14)検索された関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶する関連文書記憶部
(15)各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出部
(16)関連文毎の文スコアを文番号に対応付けて記憶する文スコアテーブル
(17)高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出部
(18)解候補の文番号により特定される関連文を応答文として出力する応答文出力部。
A question answering system for inputting a question sentence, extracting a sentence suitable for answering a question sentence from a document group to be searched, and outputting the sentence as a response sentence, the question answering system having the following elements (1) Question sentence input unit for inputting a question sentence (2) Dividing a sentence into words, analyzing a part of speech type and a surface expression for each word, if each word is a function word, a question word, and If it is a predetermined word that is likely to be the focus of the question, it is determined that the word is a word related to the question response format, and otherwise, the word is determined to be a word related to the question response content, Words related to response format are converted to part-of-speech types, words related to question response contents are converted to surface representations, and input by a formalization process to form a formatted word string based on the converted part-of-speech types or surface layer representations Question sentence formalization word string Question sentence formatting part to be converted (3) The question sentence form main part is extracted from the question sentence formalized word string by the question sentence format main part extracting process that extracts a formatted word string having a predetermined number of words including the interrogative word at a predetermined position. (4) Reference response example storage unit for storing a plurality of reference response examples consisting of pairs of reference question sentences and reference answer sentences (5) Reference response examples included in the reference response example storage unit The reference response example formatting unit (6) converts the reference question sentence into a reference question sentence formatted word string by the formatting process, and further converts the reference answer sentence into a reference answer sentence formatted word string by the formatting process. Response example format storage unit that stores a plurality of pairs of the converted reference question sentence formatted word string and reference answer sentence formatted word string in association with the reference response ID (7) Each reference included in the response example format storage unit For the question sentence formalization word string, the question sentence form The reference question sentence format main part is extracted by the main part extraction process, and compared with the question sentence form main part. When the comparison result is the same or similar, the reference question sentence form main part is extracted. The similarity format question sentence extraction unit that identifies the reference response ID of the categorized word string as the reference response ID related to the similar format question sentence whose format is similar to the input question sentence (8) Reference related to the specified similar format question sentence A similar format question sentence set storage unit that stores response ID groups as a similar format question sentence set. (9) A reference answer sentence formatted word string corresponding to a reference response ID related to each similar format question sentence is sent from the response example form storage unit. To obtain response format elements that are formatted word strings having a predetermined number of words smaller than the number of words in the formatted word string, in order from the acquired reference answer sentence formatted word string, and for each response format element, Response format element is a response example It is searched whether each reference answer sentence formatted word string included in the format storage unit is included, and the reference response ID group related to the reference answer sentence formatted word string including the response format element is set as a response sentence element-containing answer sentence set The number of reference response ID groups included in at least both the similar form question sentence set and the response form element containing answer sentence set, the number of reference response ID groups contained in the similar form question sentence set, and the response form element containing answer Using the number of reference response ID groups included in the sentence set, the response form element is converted into a similar form question sentence based on the probability that the similar form question sentence is combined with the reference answer form formalized word string including the response form element. A response format element correlation degree calculation unit (10) that calculates a question format correlation degree indicating a degree related to a format (10) A response format element correlation degree table (11) that stores a question format correlation degree calculated for each response format element A keyword that is a content word is extracted from a question sentence, a document is searched from a group of documents to be searched using the keyword as a condition, and a relation related to the question sentence as a content based on the appearance frequency of words included in the searched document group A related word generation unit for extracting a word and calculating a related degree of the related word (12) A related word table for storing a related degree calculated for each related word (13) A keyword as a content word is extracted from a question sentence , A search related document search unit (14) for searching related documents from a group of documents to be searched using a keyword as a condition, a related document that stores the searched related documents in association with a sentence number for each related sentence included in the related document Storage unit (15) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formalization process, and the response format elements are sequentially extracted from the related sentence formatted word string. The question form correlation degree of the formal element is acquired from the response form element correlation degree table, and the relevance level as a related word of each word included in the related sentence is further obtained from the related word table, and the question of each acquired response form element A sentence score calculation unit that calculates a sentence score indicating suitability as a solution to the question sentence based on the degree of formal correlation and the degree of association of each word. (16) A sentence that stores a sentence score for each related sentence in association with a sentence number. Score table (17) A solution candidate extraction unit that extracts a sentence number of a related sentence with a sentence score indicating high aptitude as a solution candidate (18) A response sentence output that outputs a related sentence specified by the sentence number of the solution candidate as a response sentence Department.
前記質問の焦点になりやすい所定語として、少なくとも「理由」、「方法」、「意味」、又は「違い」の何れかを用いることを特徴とする請求項1記載の質問応答システム。   The question answering system according to claim 1, wherein at least one of "reason", "method", "meaning", or "difference" is used as the predetermined word that is likely to be a focus of the question. 前記形式化処理は、各単語が参考応答例の中で出現頻度が高い所定の動詞と形容詞である場合にも、当該単語を質問応答形式に係る単語であると判定することを特徴とする請求項1記載の質問応答システム。   The formalization process is characterized in that, even when each word is a predetermined verb and adjective having a high appearance frequency in a reference response example, the word is determined to be a word related to a question response format. Item 4. The question answering system according to Item 1. 前記質問文形式要部抽出処理により抜き出される形式化単語列は、疑問詞を中心として前後3つ単語を含む合計7つの単語に係る形式化単語列であることを特徴とする請求項1記載の質問応答システム。   2. The formalized word string extracted by the question sentence format main part extraction process is a formalized word string related to a total of seven words including three words before and after a questionable word. Question answering system. 前記相似形式質問文抽出部は、参考質問文形式要部と質問文形式要部に含まれる疑問詞が一致する場合に限り、類似と判定することを特徴とする請求項1記載の質問応答システム。   The question answering system according to claim 1, wherein the similar question sentence extraction unit determines that the reference question sentence form main part and the question sentence included in the question sentence form main part are similar to each other. . 前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、カイ二乗値の平方根を用いることを特徴とする請求項1記載の質問応答システム。   2. The question according to claim 1, wherein the response format element correlation degree calculation unit uses a square root of a chi-square value as a question format correlation degree indicating a degree that the response format element is related to a similar format question sentence as a format. Response system. 前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、ダイス係数を用いることを特徴とする請求項1記載の質問応答システム。   The question answering system according to claim 1, wherein the response format element correlation calculation unit uses a dice coefficient as a question format correlation indicating a degree to which a response format element is related to a similar format question sentence as a format. 前記応答形式要素相関度算出部は、応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度として、相互情報量を用いることを特徴とする請求項1記載の質問応答システム。   2. The question answering system according to claim 1, wherein the response format element correlation degree calculation unit uses a mutual information amount as a question format correlation degree indicating a degree that the response format element is related to a similar format question sentence as a format. . 前記解候補抽出部は、関連文書に含まれる関連文の順に連続する文スコアについて、極大値を示す文スコアの関連文の文番号を解候補とすることを特徴とする請求項1記載の質問応答システム。   The question candidate according to claim 1, wherein the solution candidate extraction unit sets a sentence number of a related sentence of a sentence score indicating a maximum value as a solution candidate for a sentence score consecutive in the order of the related sentences included in the related document. Response system. 前記解候補抽出部は、前記極大値の所定割合を超える前後の文スコアの関連文の文番号も解候補に含めることを特徴とする請求項9記載の質問応答システム。   The question answering system according to claim 9, wherein the solution candidate extraction unit includes sentence numbers of related sentences of sentence scores before and after exceeding a predetermined ratio of the local maximum value as solution candidates. 質問文を入力し、検索対象である文書群から質問文の解に適する文を抽出して、応答文として出力する質問応答システムであって、
参考質問文と参考回答文の対からなる参考応答例を複数記憶する参考応答例記憶部と、
参考質問文形式化単語列と参考回答文形式化単語列の対を、参考応答IDに対応付けて複数記憶するための応答例形式記憶部と、
相似形式質問文に係る参考応答ID群を、相似形式質問文集合として記憶するための相似形式質問文集合記憶部と、
応答形式要素毎に算出された質問形式相関度を記憶するための応答形式要素相関度テーブルと、
関連語毎に算出された関連度を記憶するための関連語テーブルと、
関連文書を、関連文書に含まれる関連文毎に文番号を対応付けて記憶するための関連文書記憶部と、
関連文毎の文スコアを文番号に対応付けて記憶するための文スコアテーブルと、
を有する質問応答システムとなるコンピュータに、以下の手順を実行させるためのプログラム
(1)質問文を入力する質問文入力手順
(2)文を単語に分割し、単語毎に品詞種別と表層表現を解析し、各単語が機能語である場合、疑問詞である場合、及び質問の焦点になりやすい所定語である場合に、当該単語を質問応答形式に係る単語であると判定し、それ以外の場合に、当該単語を質問応答内容に係る単語であると判定し、質問応答形式に係る単語は品詞種別に変換し、質問応答内容に係る単語は表層表現に変換し、変換した品詞種別あるいは表層表現を単位とした形式化単語列とする形式化処理により、入力された質問文を質問文形式化単語列に変換する質問文形式化手順
(3)疑問詞を所定位置に含む所定単語数の形式化単語列を抜き出す質問文形式要部抽出処理により、質問文形式化単語列から質問文形式要部を抽出する質問文形式要部抽出手順
(4)参考応答例記憶部に含まれる各参考応答例について、参考質問文を前記形式化処理により参考質問文形式化単語列に変換し、更に参考回答文を前記形式化処理により参考回答文形式化単語列に変換する参考応答例形式化手順
(5)応答例形式記憶部に含まれる各参考質問文形式化単語列について、前記質問文形式要部抽出処理により参考質問文形式要部を抽出し、前記質問文形式要部と比較し、比較結果が同一又は類似の場合に、当該参考質問文形式要部が抽出された参考質問文形式化単語列の参考応答IDを、入力された質問文に形式が相似する相似形式質問文に係る参考応答IDとして特定する相似形式質問文抽出手順
(6)各相似形式質問文に係る参考応答IDに対応する参考回答文形式化単語列を応答例形式記憶部から取得し、前記形式化単語列の単語数よりも少ない所定単語数の形式化単語列である応答形式要素を、取得した参考回答文形式化単語列から順に抽出し、各応答形式要素について、当該応答形式要素が応答例形式記憶部に含まれる各参考回答文形式化単語列に含まれるか検索し、当該応答形式要素が含まれる参考回答文形式化単語列に係る参考応答ID群を応答形式要素含回答文集合として記憶し、少なくとも相似形式質問文集合と応答形式要素含回答文集合の両方に含まれる参考応答ID群の数、相似形式質問文集合に含まれる参考応答ID群の数、及び応答形式要素含回答文集合に含まれる参考応答ID群の数を用いて、当該応答形式要素を含む参考回答文形式化単語列と相似形式質問文が組み合せられる確率に基づき、当該応答形式要素が相似形式質問文に形式として関連する程度を示す質問形式相関度を算出する応答形式要素相関度算出手順
(7)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から文書を検索し、検索した文書群に含まれる単語の出現頻度に基づいて、内容として質問文に関連する関連語を抽出するとともに当該関連語の関連度を算出する関連語生成手順
(8)質問文から内容語であるキーワードを抽出し、キーワードを条件として検索対象の文書群から関連文書を検索する検索関連文書検索手順
(9)各関連文について、当該関連文を前記形式化処理により関連文形式化単語列に変換し、関連文形式化単語列から前記応答形式要素を順に抽出し、各応答形式要素の質問形式相関度を応答形式要素相関度テーブルから取得し、更に当該関連文に含まれる各単語の関連語としての関連度を関連語テーブルから取得し、取得した各応答形式要素の質問形式相関度及び各単語の関連度に基づいて、質問文に対する解としての適性を示す文スコアを算出する文スコア算出手順
(10)高い適性を示す文スコアの関連文の文番号を解候補として抽出する解候補抽出手順
(11)解候補の文番号により特定される関連文を応答文として出力する応答文出力手順。
A question answering system that inputs a question sentence, extracts a sentence suitable for answering the question sentence from a document group to be searched, and outputs it as a response sentence,
A reference response example storage unit for storing a plurality of reference response examples including pairs of reference question sentences and reference answer sentences;
A response example format storage unit for storing a plurality of pairs of reference question sentence formatted word strings and reference answer sentence formatted word strings in association with reference response IDs;
A reference form ID storage unit for storing a reference response ID group related to a similar form question sentence as a similar form question sentence set;
A response format element correlation degree table for storing the question format correlation calculated for each response format element;
A related word table for storing the relevance calculated for each related word;
A related document storage unit for storing a related document in association with a sentence number for each related sentence included in the related document;
A sentence score table for storing a sentence score for each related sentence in association with a sentence number;
(1) Question sentence input procedure for inputting a question sentence (2) The sentence is divided into words, and the part-of-speech type and the surface layer expression are divided for each word. Analyzing and determining that each word is a function word, a question word, and a predetermined word that is likely to be the focus of a question, the word is a word related to a question response format, The word is determined to be a word related to the question response content, the word related to the question response format is converted into a part of speech type, the word related to the question response content is converted into a surface expression, and the converted part of speech type or surface layer A question sentence formatting procedure for converting an inputted question sentence into a question sentence formalization word string by a formalization process to form a formalized word string in units of expressions (3) A predetermined number of words including a question word at a predetermined position Extract a formatted word string Question sentence format main part extraction procedure for extracting the question sentence format main part from the question sentence formatted word string by the question sentence format main part extraction process to be issued (4) Reference response example For each reference response example included in the storage unit, reference Reference response example formatting procedure for converting a question sentence into a reference question sentence formatted word string by the formalization process, and further converting a reference answer sentence into a reference answer sentence formatted word string by the formalization process (5) Response example For each reference question sentence formatted word string included in the format storage part, the question sentence form principal part is extracted by the question sentence form principal part extraction process, and compared with the question sentence form principal part, and the comparison result is the same or In the case of similarity, the reference response ID of the reference question sentence format word string from which the relevant part of the reference question sentence format is extracted is identified as the reference response ID related to the similar question sentence whose format is similar to the input question sentence Similar question extraction In order (6), a reference answer sentence formatted word string corresponding to a reference response ID related to each similar form question sentence is acquired from the response example form storage unit, and a predetermined number of words less than the number of words in the formatted word string Response format elements, which are response word format words, are extracted in order from the acquired reference answer sentence format word string, and for each response format element, each response answer format word that includes the response format element in the response example format storage unit Search whether it is included in the column, and store the reference response ID group related to the reference response sentence formatted word string including the response format element as a response format element including response format element set, and at least a similar format question sentence set and a response format element Using the number of reference response ID groups included in both of the included response sentence sets, the number of reference response ID groups included in the similar form question sentence set, and the number of reference response ID groups included in the response form element included response sentence set Response form A response format element that calculates the degree of question format correlation indicating the degree to which the response format element is related to the similar format question text based on the probability that the similar format question text is combined with the reference response text formalized word string including the formula element Correlation degree calculation procedure (7) A keyword that is a content word is extracted from a question sentence, a document is searched from a search target document group using the keyword as a condition, and the content is determined based on the appearance frequency of the word included in the searched document group. A related word generation procedure for extracting a related word related to a question sentence and calculating a degree of relevance of the related word (8) extracting a keyword which is a content word from the question sentence, and using the keyword as a condition from a search target document group Search Related Document Retrieval Procedure for Retrieving Related Documents (9) For each related sentence, the related sentence is converted into a related sentence formatted word string by the formatting process, and a related sentence formatted word is obtained. The response format elements are extracted in order, the question format correlation of each response format element is acquired from the response format element correlation table, and the related level of each word included in the related sentence as the related word Sentence score calculation procedure (10) which shows a high aptitude, and calculates a sentence score indicating aptitude as a solution to the question sentence based on the question form correlation degree and the relevance degree of each word obtained from each response form element Solution candidate extraction procedure for extracting the sentence number of the related sentence of the sentence score as a solution candidate (11) A response sentence output procedure for outputting the related sentence specified by the sentence number of the solution candidate as a response sentence.
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化部と、
前記参考文形式化部において形式化された形式化参考文を記憶する参考文記憶部と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化部と、
前記入力質問文形式化部において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を前記参考文記憶部から抽出する参考回答文抽出部と、
前記参考回答文と、前記入力質問文をWebサーチエンジンで検索した結果得られたWeb文書である検索Web文書との間の記述スタイルの適合性を評価する記述スタイル評価部と、
前記検索Web文書と、前記入力質問文との間の内容の関連性を評価する関連性評価部と、
前記記述スタイル評価部により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価部により前記入力質問文の内容と関連があると評価された検索Web文書に対してスコア付け処理を行うスコア処理部と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力部を有することを特徴とする質問応答システム。
A reference sentence formatting unit that performs a formalization process to generalize a description style for at least a reference question sentence of the reference sentence, with a pair of the reference question sentence and the reference answer sentence as a reference sentence,
A reference sentence storage unit for storing the formatted reference sentence formatted in the reference sentence formatting unit;
An input question sentence formatting unit that performs a formalization process to generalize the description style of the input question sentence;
The formalized reference question sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting unit is searched, and the reference answer sentence included in the formalized reference sentence is retrieved from the reference sentence storage unit. A reference answer sentence extraction unit to be extracted;
A description style evaluation unit that evaluates the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation unit for evaluating relevance of contents between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation unit as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation part A score processing unit for performing scoring processing;
A question answering system comprising an answer sentence output unit for outputting an answer sentence to the input question sentence based on the score.
質問応答システムとなるコンピュータに、
参考質問文と参考回答文の対を参考文とし、該参考文のうち少なくとも参考質問文に対して記述スタイルを一般化する形式化処理を行う参考文形式化手順と、
前記参考文形式化手順において形式化された形式化参考文を記憶する参考文記憶手順と、
入力質問文の記述スタイルを一般化する形式化処理を行う入力質問文形式化手順と、
前記入力質問文形式化手順において形式化された形式化入力質問文と類似する形式を有する前記形式化参考文を探索し、該形式化参考文に含まれる参考回答文を抽出する参考回答文抽出手順と、
前記参考回答文と、前記入力質問文をWebサーチエンジンで検索した結果得られたWeb文書である検索Web文書との間の記述スタイルの適合性を評価する記述スタイル評価手順と、
前記検索Web文書と、前記入力質問文との間の内容の関連性を評価する関連性評価手順と、
前記記述スタイル評価手順により前記参考回答文と記述スタイルの適合性があると評価され、かつ、前記関連性評価手順により前記入力質問文の内容と関連があると評価された検索Web文書に対してスコア付け処理を行うスコア処理手順と、
該スコアに基づいて、前記入力質問文に対する回答文を出力する回答文出力手順を実行させるためのプログラム。
In the computer that becomes the question answering system,
A reference sentence formalization procedure for performing a formalization process to generalize a description style for at least a reference question sentence among the reference sentences, using a pair of a reference question sentence and a reference answer sentence as a reference sentence;
A reference sentence storage procedure for storing the formalized reference sentence formatted in the reference sentence formatting procedure;
An input question sentence formalization procedure that performs formalization processing to generalize the description style of the input question sentence;
Reference answer sentence extraction for searching for the formatted reference sentence having a format similar to the formalized input question sentence formatted in the input question sentence formatting procedure and extracting the reference answer sentence included in the formalized reference sentence Procedure and
A description style evaluation procedure for evaluating the compatibility of the description style between the reference answer sentence and a search Web document that is a Web document obtained as a result of searching the input question sentence with a Web search engine;
A relevance evaluation procedure for evaluating relevance of content between the search Web document and the input question sentence;
For a search Web document evaluated by the description style evaluation procedure as being compatible with the reference answer sentence and the description style, and evaluated as being related to the contents of the input question sentence by the relevance evaluation procedure A score processing procedure for scoring,
A program for executing an answer sentence output procedure for outputting an answer sentence for the input question sentence based on the score.
JP2010502807A 2008-03-10 2009-03-09 Question answering system capable of descriptive answers using WWW as information source Expired - Fee Related JP5461388B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010502807A JP5461388B2 (en) 2008-03-10 2009-03-09 Question answering system capable of descriptive answers using WWW as information source

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2008060292 2008-03-10
JP2008060292 2008-03-10
PCT/JP2009/054425 WO2009113494A1 (en) 2008-03-10 2009-03-09 Question and answer system which can provide descriptive answer using www as source of information
JP2010502807A JP5461388B2 (en) 2008-03-10 2009-03-09 Question answering system capable of descriptive answers using WWW as information source

Publications (2)

Publication Number Publication Date
JPWO2009113494A1 true JPWO2009113494A1 (en) 2011-07-21
JP5461388B2 JP5461388B2 (en) 2014-04-02

Family

ID=41065161

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2010502807A Expired - Fee Related JP5461388B2 (en) 2008-03-10 2009-03-09 Question answering system capable of descriptive answers using WWW as information source

Country Status (2)

Country Link
JP (1) JP5461388B2 (en)
WO (1) WO2009113494A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7081454B2 (en) * 2018-11-15 2022-06-07 日本電信電話株式会社 Processing equipment, processing method, and processing program
CN109858626B (en) * 2019-01-23 2021-08-03 腾讯科技(深圳)有限公司 Knowledge base construction method and device
CN110727765B (en) * 2019-10-10 2021-12-07 合肥工业大学 Problem classification method and system based on multi-attention machine mechanism and storage medium
CN110851579B (en) * 2019-11-06 2023-03-10 杨鑫蛟 User intention identification method, system, mobile terminal and storage medium
CN111144098B (en) * 2019-12-26 2023-05-30 支付宝(杭州)信息技术有限公司 Recall method and device for extended question

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002132811A (en) * 2000-10-19 2002-05-10 Nippon Telegr & Teleph Corp <Ntt> Method and system for answering question and recording medium with recorded question answering program
JP4162223B2 (en) * 2003-05-30 2008-10-08 日本電信電話株式会社 Natural sentence search device, method and program thereof
JP4116599B2 (en) * 2004-07-26 2008-07-09 日本電信電話株式会社 Question answering system, method and program

Also Published As

Publication number Publication date
WO2009113494A1 (en) 2009-09-17
JP5461388B2 (en) 2014-04-02

Similar Documents

Publication Publication Date Title
Purves et al. The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US9208218B2 (en) Methods and apparatuses for generating search expressions from content, for applying search expressions to content collections, and/or for analyzing corresponding search results
CN111241212B (en) Knowledge graph construction method and device, storage medium and electronic equipment
JP2014112316A (en) Question answering program using large amounts of comment sentences, server and method
KR20040016799A (en) Document retrieval system and question answering system
KR102128659B1 (en) System and Method for Extracting Keyword and Generating Abstract
JP2017511922A (en) Method, system, and storage medium for realizing smart question answer
CN116991977B (en) Domain vector knowledge accurate retrieval method and device based on large language model
JP5461388B2 (en) Question answering system capable of descriptive answers using WWW as information source
CN110245349B (en) Syntax dependence analysis method and apparatus, and electronic device
CN101546331A (en) System and method for acquiring characteristics favorable for retrieval and evaluating value of related things
JP5718405B2 (en) Utterance selection apparatus, method and program, dialogue apparatus and method
JP2009122807A (en) Associative retrieval system
Blanco et al. Overview of NTCIR-13 Actionable Knowledge Graph (AKG) Task.
JP4116599B2 (en) Question answering system, method and program
Iqbal et al. CURE: Collection for urdu information retrieval evaluation and ranking
JP2008077252A (en) Document ranking method, document retrieval method, document ranking device, document retrieval device, and recording medium
KR102497151B1 (en) Applicant information filling system and method
Santos et al. Getting geographical answers from Wikipedia: the GikiP pilot at CLEF
JP4525433B2 (en) Document aggregation device and program
Kouylekov et al. Towards entailment-based question answering: ITC-irst at CLEF 2006
CN114417010A (en) Knowledge graph construction method and device for real-time workflow and storage medium
JP2003085181A (en) Encyclopedia system
JP2021114070A (en) Information retrieval device, information retrieval method, and information retrieval program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20120308

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20130903

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20131101

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20131217

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20140115

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees