JP6259377B2

JP6259377B2 - Dialog system evaluation method, dialog system evaluation apparatus, and program

Info

Publication number: JP6259377B2
Application number: JP2014170516A
Authority: JP
Inventors: 弘晃杉山; 豊美目黒; 東中　竜一郎; 竜一郎東中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-08-25
Filing date: 2014-08-25
Publication date: 2018-01-10
Anticipated expiration: 2034-08-25
Also published as: JP2016045769A

Description

この発明は、ユーザと自然言語を用いて対話するシステム（以下、対話システムという）において、対話システムが生成する発話文を自動的に評価する技術に関する。 The present invention relates to a technique for automatically evaluating an utterance sentence generated by a dialog system in a system that interacts with a user using a natural language (hereinafter referred to as a dialog system).

近年、特定のタスクを持たないオープンドメインな雑談を行う雑談対話システムへのニーズが高まっている。雑談対話システムを改善する上での課題の一つが、構築したシステムの評価である。タスクを遂行するための対話システムでは、タスクの達成率や達成にかかる時間などの明確な評価指標があるため、システムの評価は比較的容易である。しかし、雑談対話システムでは、システムが出力すべき正解が必ずしも自明ではない。そのため従来は、システムの出力文に対し人手でLikert尺度などの順序尺度の評価値を付与し、平均値をとる方法が主流であった。 In recent years, there is an increasing need for a chat dialogue system that performs open domain chat without specific tasks. One of the challenges in improving the chat dialogue system is the evaluation of the constructed system. In a dialogue system for performing a task, since there are clear evaluation indexes such as the achievement rate of the task and the time taken to achieve the task, the evaluation of the system is relatively easy. However, in the chat dialogue system, the correct answer to be output by the system is not always obvious. For this reason, conventionally, the method of manually assigning an evaluation value of an order scale such as the Likert scale to the output sentence of the system and taking the average value has been the mainstream.

しかし、順序尺度で付与される値は相対値であるため、順序関係は一貫性があるものの、評価毎に平均値は異なる可能性がある。すなわち、従来システムと提案システムを付与された評価値の平均値で比較するには、比較対象となる従来システムを再実装し、提案システムと同時に実験を行う必要がある。このように、既存研究との比較は容易ではないため、再現可能な形で自動的に評価値を付与できる仕組みが必要である。 However, since the value given by the order scale is a relative value, although the order relationship is consistent, the average value may be different for each evaluation. That is, in order to compare the conventional system and the proposed system with the average value of the given evaluation values, it is necessary to re-implement the conventional system to be compared and to perform an experiment simultaneously with the proposed system. Thus, since it is not easy to compare with existing research, a mechanism that can automatically assign evaluation values in a reproducible form is necessary.

タスク対話システムを自動的に評価する試みとして、非特許文献１で提案されたPARADISEという方法がある。これは、既に行われた対話に対し、対話から得られる発話文の長さや発言数などの特徴量に基づいて、その対話の質を評価する方法である。また、システムが出力する文の自動評価という枠組みとして、非特許文献２に記載の技術が挙げられる。 As an attempt to automatically evaluate a task dialogue system, there is a method called PARADISE proposed in Non-Patent Document 1. This is a method for evaluating the quality of an already-conversed dialogue based on features such as the length of the spoken sentence and the number of utterances obtained from the dialogue. Further, as a framework for automatic evaluation of a sentence output by the system, a technique described in Non-Patent Document 2 can be cited.

Marilyn Walker, Candace Kamm, Diane Litman, “Towards developing general models of usability with PARADISE”, Natural Language Engineering, vol. 6, no. 3-4, pp. 363-377, 2000.Marilyn Walker, Candace Kamm, Diane Litman, “Towards developing general models of usability with PARADISE”, Natural Language Engineering, vol. 6, no. 3-4, pp. 363-377, 2000. Alan Ritter, Colin Cherry, Bill Dolan, “Data-Driven Response Generation in Social Media”, In proceedings of EMNLP, 2011.Alan Ritter, Colin Cherry, Bill Dolan, “Data-Driven Response Generation in Social Media”, In proceedings of EMNLP, 2011.

しかしながら、非特許文献１に記載の方法では、対話システムを評価するために、その都度対話を行う必要がある。そのためには対話システムと対話を行う相手が必要となる。対話相手を人間とすると、人手による評価値付与と同様に、実験の都度評価値がばらつくという問題が生じる。対話相手を別の対話システムとすると、現時点で人間と同様に応答できる対話システムが存在しないことから、対話の質が対話相手となる対話システムに依存して悪化するおそれがある。このような観点から、実際の対話を介さない形式で評価を行う方法が望ましい。 However, in the method described in Non-Patent Document 1, it is necessary to perform a dialogue every time in order to evaluate the dialogue system. For that purpose, a partner who interacts with the dialogue system is required. When the conversation partner is a human, the problem arises that the evaluation value varies with each experiment as in the case of manually assigning the evaluation value. If the other party is a different dialogue system, there is no dialogue system that can respond at the same time as a human being, and the quality of the dialogue may deteriorate depending on the dialogue system that is the other party. From this point of view, a method of performing evaluation in a format that does not involve actual dialogue is desirable.

システムが出力する文の自動評価という枠組みは、機械翻訳の分野において盛んに研究が行われている。例えば、入力文に対してシステムが文を出力し、そのシステム出力文と一文のリファレンス文との距離を、例えばBLEUスコアやROUGEスコアなどの特殊な関数に基づいて計算し、評価値として出力する自動評価尺度が開発されている。これを雑談対話に用いる場合、雑談対話では機械翻訳と比較して正解とすべき文の範囲が広いため、一文のリファレンス文ではカバーしきれない。このように、雑談対話においてはリファレンス文との距離に基づく自動評価は困難である。 The framework for automatic evaluation of sentences output by the system has been actively researched in the field of machine translation. For example, the system outputs a sentence for the input sentence, calculates the distance between the system output sentence and one reference sentence based on a special function such as BLEU score or ROUGE score, and outputs it as an evaluation value An automatic rating scale has been developed. When this is used for chat conversation, the range of sentences that should be correct in the chat conversation is wider than machine translation, so a single reference sentence cannot be covered. Thus, automatic evaluation based on the distance from the reference sentence is difficult in the chat conversation.

この発明の目的は、このような技術的背景に鑑みて、タスクをもたない雑談対話システムを、人手を介さずに自動的に評価する対話システム評価技術を提供することである。 In view of such a technical background, an object of the present invention is to provide a dialogue system evaluation technique for automatically evaluating a chat dialogue system having no task without human intervention.

上記の課題を解決するために、この発明の対話システム評価方法は、出力文取得部が、特定のタスクをもたない対話システムへ入力文を入力し、対話システムからのシステム出力文を得る出力文取得ステップと、評価値計算部が、入力文に対して予め定めたリファレンス文に基づいてシステム出力文を評価するシステム評価値を計算する評価値計算ステップと、を含む。 In order to solve the above-mentioned problem, in the dialog system evaluation method of the present invention, an output sentence acquisition unit inputs an input sentence to a dialog system having no specific task and obtains a system output sentence from the dialog system. The sentence acquisition step and the evaluation value calculation unit include an evaluation value calculation step of calculating a system evaluation value for evaluating the system output sentence based on a reference sentence predetermined for the input sentence.

この発明の対話システム評価技術によれば、タスクを持たない雑談対話システムにおいて、人手を介さずに自動的に対話システムを評価することができる。これにより、高速かつ安価に対話システムを評価することができるため、対話システムを効率よく改善することが可能になる。 According to the dialogue system evaluation technique of the present invention, in a chat dialogue system having no task, the dialogue system can be automatically evaluated without any manual intervention. As a result, the dialog system can be evaluated at high speed and at low cost, and the dialog system can be improved efficiently.

図１は、第一実施形態の対話システム評価装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the dialogue system evaluation apparatus according to the first embodiment. 図２は、第一実施形態の対話システム評価方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the interactive system evaluation method according to the first embodiment. 図３は、第二実施形態及び第三実施形態の対話システム評価装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the dialogue system evaluation apparatus according to the second embodiment and the third embodiment. 図４は、第二実施形態及び第三実施形態の対話システム評価方法の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the interactive system evaluation method according to the second embodiment and the third embodiment. 図５は、第四実施形態の対話システム評価装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the dialogue system evaluation apparatus according to the fourth embodiment. 図６は、第四実施形態の対話システム評価方法の処理フローを例示する図である。FIG. 6 is a diagram illustrating a processing flow of the interactive system evaluation method according to the fourth embodiment. 図７は、第五実施形態の対話システム評価装置の機能構成を例示する図である。FIG. 7 is a diagram illustrating a functional configuration of the dialogue system evaluation apparatus according to the fifth embodiment. 図８は、第五実施形態の対話システム評価方法の処理フローを例示する図である。FIG. 8 is a diagram illustrating a processing flow of the interactive system evaluation method according to the fifth embodiment.

この発明は、対話システムへ文を入力し、その対話システムが出力した文をリファレンス文と比較することで評価値を計算する対話システム評価装置及び方法である。この発明では、従来技術の課題を、リファレンス文を数十〜数百文程度に大規模化し、正解となる発話の範囲をカバーすることで解決する。また、リファレンス文に予め評価値を付与し、これをシステム出力文の評価値（以下、システム評価値という）の推定に用いてもよい。リファレンス文に付与する評価値は、人手で直接値を付与する方法、リファレンス文のペアごとにどちらがよいかを比較し、その勝率を評価値とする方法などで得られる。 The present invention is a dialog system evaluation apparatus and method for calculating an evaluation value by inputting a sentence to a dialog system and comparing a sentence output from the dialog system with a reference sentence. In the present invention, the problem of the conventional technique is solved by enlarging the reference sentence to about several tens to several hundred sentences and covering the range of utterances that are correct. Further, an evaluation value may be given to the reference sentence in advance, and this may be used for estimation of an evaluation value of the system output sentence (hereinafter referred to as a system evaluation value). The evaluation value to be assigned to the reference sentence is obtained by a method of directly assigning a value manually, a method of comparing which is better for each pair of reference sentences, and using the winning percentage as an evaluation value.

以下では、リファレンス文は、人手で正解となるよう作成した文などの正例に加え、不正解となるように作成した文や、コーパスから自動的に抽出した文などの負例を含むものと想定し、説明する。ただし、そのどちらかを除外して正例のみや負例のみとして構成しても、この発明の対話システム評価装置及び方法は動作する。 In the following, reference sentences include positive examples such as sentences created manually to be correct, as well as negative examples such as sentences prepared to be incorrect and sentences automatically extracted from the corpus. Assume and explain. However, the dialog system evaluation apparatus and method of the present invention operate even if only one of them is excluded and only a positive example or only a negative example is configured.

以下では、用いるデータの種類に基づいて、五つの実施形態に分けて説明する。 The following description is divided into five embodiments based on the type of data used.

第一実施形態は、最もシンプルな形態であり、リファレンス文のみを利用する方法である。対話システムの出力文と各リファレンス文との間で、機械翻訳の自動評価で用いられるBLEUスコアやROUGEスコア、tf-idf重み付きコサイン距離、Word Error Rate（WER、単語誤り率）などの文間の類似度を表す尺度を計算し、上位N（Nは1〜7程度の自然数）個の平均値をシステム評価値とする方法である。BLUEスコアについての詳細は、「Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu, “BLEU: a method for Automatic Evaluation of Machine Translation”, ACL '02, pp. 311-318, 2002.（参考文献１）」を、ROUGEスコアについての詳細は、「Lin, Chin-Yew, and Eduard Hovy. “Automatic evaluation of summaries using n-gram co-occurrence statistics”, NAACL '03, vol. 1, pp. 71-78, 2003.（参考文献２）」を参照されたい。 The first embodiment is the simplest form and uses only a reference sentence. Between the sentence output from the dialogue system and each reference sentence, between sentences such as BLEU score, ROUGE score, tf-idf weighted cosine distance, and Word Error Rate (WER) used in automatic evaluation of machine translation This is a method for calculating a scale representing the degree of similarity and using the average value of the top N (N is a natural number of about 1 to 7) as a system evaluation value. For details on the BLUE score, see Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu, “BLEU: a method for Automatic Evaluation of Machine Translation”, ACL '02, pp. 311-318, 2002. 1) ”, for details on the ROUGE score, see“ Lin, Chin-Yew, and Eduard Hovy. “Automatic evaluation of summaries using n-gram co-occurrence statistics”, NAACL '03, vol. 1, pp. 71- 78, 2003. (Reference 2) ”.

第一実施形態の方法では、評価の低いリファレンス文が含まれている場合、本来システム評価値が低くなるべきシステム出力文であっても、システム評価値が高くなってしまう可能性がある。そのため、平均値を取得する際、そのリファレンス文が持つ評価値が閾値を下回った場合、これをシステム評価値への算入から除外してもよい。この方法で得られたシステム評価値は、リファレンス文に付与された評価値とスケールが一致していないため、これらを比較することはできない。 In the method of the first embodiment, when a reference sentence with a low evaluation is included, there is a possibility that the system evaluation value will be high even if the system output sentence should originally have a low system evaluation value. For this reason, when the average value is acquired, if the evaluation value of the reference sentence falls below the threshold value, it may be excluded from the system evaluation value. The system evaluation value obtained by this method does not match the evaluation value assigned to the reference sentence and the scale cannot be compared.

第二〜四実施形態は、リファレンス文に加えて、リファレンス文ごとに付与された評価値を用いる方法である。ここで、リファレンス文に付与する評価値は、人手で値を直接付与する方法や、リファレンス文のペアごとにどちらが適切かを人手で評価し、それらの勝率を評価値として付与する方法が考えられる。この種類のデータを用いてシステム評価値を計算する方法として、以下の３つの方法が考えられる。 The second to fourth embodiments are methods using an evaluation value assigned to each reference sentence in addition to the reference sentence. Here, as for the evaluation value to be assigned to the reference sentence, a method of directly assigning a value manually or a method of manually evaluating which one is appropriate for each pair of reference sentences and assigning the winning percentage as an evaluation value can be considered. . The following three methods can be considered as a method for calculating the system evaluation value using this type of data.

第二実施形態は、第一実施形態と同様に文間の類似度を表す尺度を計算し、類似度の平均を取る際に評価値が閾値以上のもののみに限定する方法である。第一実施形態では、評価の低いリファレンス文が含まれている場合、本来システム評価値が低くなるべきシステム出力文であってもシステム評価値が高くなってしまう可能性がある。この方法は、こうしたリファレンス文を除外し、より適切にシステム評価値を計算できると考えられる。 The second embodiment is a method of calculating a scale representing the similarity between sentences as in the first embodiment, and limiting the evaluation value only to a threshold value or higher when taking the average of the similarity. In the first embodiment, when a reference sentence with a low evaluation is included, there is a possibility that the system evaluation value will be high even if the system output sentence should have a low system evaluation value. This method is considered to be able to calculate the system evaluation value more appropriately, excluding such reference sentences.

第三実施形態は、得られた類似度で評価値を重み付けて足し合わせる方法である。このとき、全てを足し合わせるのではなく、上位N（Nは1〜7程度の自然数）個のみを足し合わせてもよい。第二実施形態と比べて直接的に評価値を利用するため、特に評価の低いリファレンス文との類似度が大きい場合に、適切に低いシステム評価値を付与できると予想される。また、得られたシステム出力文に対するシステム評価値は、リファレンス文に付与された評価値とスケールが一致しているため、これらを比較することができる。 The third embodiment is a method of weighting and adding the evaluation values with the obtained similarity. At this time, instead of adding all, only the top N (N is a natural number of about 1 to 7) may be added. Since the evaluation value is directly used as compared with the second embodiment, it is expected that an appropriately low system evaluation value can be given particularly when the degree of similarity with a reference sentence having a low evaluation is large. Moreover, since the system evaluation value with respect to the obtained system output sentence corresponds with the evaluation value provided to the reference sentence, the scale can be compared.

第四実施形態は、Support Vector Regression（SVR）などの回帰モデルを用いて、システム評価値を直接推定する方法である。SVRについての詳細は、「Smola, Alex J., and Bernhard Scholkopf. “A tutorial on support vector regression”, Statistics and computing, Vol. 14(3), pp 199-222, 2004.（参考文献３）」を参照されたい。回帰モデルとは、あらかじめ入力特徴量と出力値（ここではシステム評価値）のペアを正解として与え、その対応関係をパラメータとして保存しておき、未知の特徴量が入力された場合に対応する出力値を推定する方法である。この回帰モデルの特徴量には、リファレンス文やシステム出力文に含まれる単語や各リファレンス文に対するBLEUスコアなどの類似度などが考えられる。 The fourth embodiment is a method for directly estimating a system evaluation value using a regression model such as Support Vector Regression (SVR). For more information on SVR, see “Smola, Alex J., and Bernhard Scholkopf.“ A tutorial on support vector regression ”, Statistics and computing, Vol. 14 (3), pp 199-222, 2004. (reference 3)” Please refer to. A regression model is a model that gives a pair of input features and output values (system evaluation values in this case) as correct answers in advance, saves the correspondences as parameters, and outputs corresponding to the input of unknown features. This is a method for estimating the value. As the feature amount of the regression model, a word included in a reference sentence or a system output sentence, or a similarity such as a BLEU score for each reference sentence can be considered.

第五実現形態は、リファレンス文のペアごとの勝ち負けのみを評価値として用いる方法である。これは、付与されているリファレンス文のペアごとの勝ち負けを、Support Vector Machine（SVM）などの分類モデルで推定し、リファレンス文に対する勝率を改めて計算してシステム評価値とする方法である。SVMについての詳細は、「Cortes, Corinna, and Vladimir Vapnik, “Support-vector networks”, Machine learning, vol. 20(3), pp. 273-297, 1995.（参考文献４）」を参照されたい。 The fifth mode of realization is a method in which only winning or losing for each pair of reference sentences is used as an evaluation value. In this method, the winning or losing of each given reference sentence pair is estimated by a classification model such as Support Vector Machine (SVM), and the winning ratio for the reference sentence is calculated again to obtain a system evaluation value. For details about SVM, see "Cortes, Corinna, and Vladimir Vapnik," Support-vector networks ", Machine learning, vol. 20 (3), pp. 273-297, 1995. (reference 4). .

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
第一実施形態の対話システム評価装置１は、図１に示すように、リファレンス文データベース１０、出力文取得部１１、文間類似度計算部１２及び評価値計算部１３を例えば含む。 [First embodiment]
As shown in FIG. 1, the dialogue system evaluation device 1 according to the first embodiment includes, for example, a reference sentence database 10, an output sentence acquisition unit 11, an inter-sentence similarity calculation unit 12, and an evaluation value calculation unit 13.

対話システム評価装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。対話システム評価装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。対話システム評価装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。対話システム評価装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The dialogue system evaluation device is, for example, a special program configured by reading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. Device. For example, the dialogue system evaluation apparatus executes each process under the control of the central processing unit. Data input to the dialogue system evaluation device and data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary and used for other processing. Is done. At least a part of each processing unit of the dialogue system evaluation apparatus may be configured by hardware such as an integrated circuit.

対話システム評価装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。対話システム評価装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the interactive system evaluation device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as a relational database or key-value store. Each storage unit included in the dialog system evaluation device may be logically divided, and may be stored in one physical storage device.

対話システム評価装置１は、外部の対話システム９にアクセス可能なように構成されている。対話システム９は、特定のタスクを持たない雑談対話システムである。図１では対話システム評価装置と対話システムとが別々に構成された例を示したが、対話システム評価装置の備えるべき機能と対話システムの備えるべき機能とを兼ね備える一台の装置として構成しても構わない。 The dialogue system evaluation apparatus 1 is configured to be accessible to an external dialogue system 9. The dialogue system 9 is a chat dialogue system that does not have a specific task. FIG. 1 shows an example in which the dialogue system evaluation device and the dialogue system are configured separately. However, the dialogue system evaluation device and the dialogue system may be configured as a single device that has both the functions that the dialogue system evaluation device and the dialogue system should have. I do not care.

リファレンス文データベース１０には、入力文と、各入力文に対応する複数のリファレンス文とからなるリファレンス文データベースが記憶されている。 The reference sentence database 10 stores a reference sentence database including an input sentence and a plurality of reference sentences corresponding to each input sentence.

以下、リファレンス文データベースの作成方法を説明する。 Hereinafter, a method for creating a reference sentence database will be described.

まず、任意の入力文を用意する。入力文は、人手で記述して作成してもよいし、実際に行われた対話を書き起こしたものでもよいし、Twitter（登録商標）やブログのようなWebサービスで公開された記事から抽出してもよい。 First, an arbitrary input sentence is prepared. The input sentence may be created by manually describing it, or it may be a transcript of an actual dialogue, or extracted from an article published on a web service such as Twitter (registered trademark) or a blog. May be.

次に、各入力文に対するリファレンス文を作成する。リファレンス文は人手で記述して作成すればよい。リファレンス文は入力文に対して正解となるよう作成した文（以下、正例という）であるが、リファレンス文のカバー範囲を広げる目的で、入力文の一部を隠すなどして不正解となるように作成した文や、コーパスから自動的に抽出した文のような負例を加えてもよい。ただし、そのどちらかを除外しても対話システム評価装置１は動作する。また、ここでは、リファレンス文に対して正例であるか負例であるかを示すラベルを付与する必要はない。収集した入力文と、各入力文に対応するリファレンス文集合の組はリファレンス文データベース１０に記憶される。 Next, a reference sentence for each input sentence is created. The reference text can be created by hand. The reference sentence is a sentence created to be correct with respect to the input sentence (hereinafter referred to as the correct example), but it is incorrect by hiding a part of the input sentence for the purpose of expanding the coverage of the reference sentence. A negative example such as a sentence created as described above or a sentence automatically extracted from a corpus may be added. However, even if one of them is excluded, the dialogue system evaluation apparatus 1 operates. Here, it is not necessary to attach a label indicating whether the reference sentence is a positive example or a negative example. A set of collected input sentences and a reference sentence set corresponding to each input sentence is stored in the reference sentence database 10.

図２を参照して、第一実施形態の対話システム評価方法を説明する。 With reference to FIG. 2, the interactive system evaluation method of the first embodiment will be described.

ステップＳ１１において、出力文取得部１１は、リファレンス文データベース１０から取得した入力文を対話システム９へ入力し、対話システム９からのシステム出力文を得る。対話システム９から得られたシステム出力文と、リファレンス文データベース１０から取得した入力文とリファレンス文集合の組は、文間類似度計算部１２へ送られる。 In step S <b> 11, the output sentence acquisition unit 11 inputs the input sentence acquired from the reference sentence database 10 to the dialog system 9 and obtains a system output sentence from the dialog system 9. A system output sentence obtained from the dialogue system 9 and an input sentence and reference sentence set obtained from the reference sentence database 10 are sent to the inter-sentence similarity calculation unit 12.

ステップＳ１２において、文間類似度計算部１２は、入力文に対応するリファレンス文集合に含まれる各リファレンス文とシステム出力文との類似度を計算する。この類似度は、tf-idfで重み付けられたコサイン類似度や単語誤り率（Word Error Rate; WER）のような一般的な類似度であってもよいし、BLEUスコアやROUGEスコアのような単語の組み合わせを考慮した類似度を用いてもよい。また、各文に含まれる単語をそのまま用いる方法であってもよいし、「NTTコミュニケーション科学研究所監修、池原ほか編集、“日本語語彙大系”、岩波書店（参考文献５）」のような辞書を用いて単語概念の抽象化を行い、類似度を計算してもよい。得られた類似度と、システム出力文と、入力文とリファレンス文集合の組は、評価値計算部１３へ送られる。 In step S12, the inter-sentence similarity calculation unit 12 calculates the similarity between each reference sentence included in the reference sentence set corresponding to the input sentence and the system output sentence. This similarity may be a general similarity such as cosine similarity or word error rate (WER) weighted by tf-idf, or a word such as BLEU score or ROUGE score. You may use the similarity which considered the combination of. In addition, it is possible to use the words included in each sentence as they are, such as “Supervised by NTT Communication Science Laboratories, edited by Ikehara et al.,“ Japanese Vocabulary System ”, Iwanami Shoten (Reference 5)”. The word concept may be abstracted using a dictionary, and the similarity may be calculated. The obtained similarity, the system output sentence, and the combination of the input sentence and the reference sentence set are sent to the evaluation value calculation unit 13.

ステップＳ１３において、評価値計算部１３は、システム出力文と、入力文とリファレンス文集合の組と、リファレンス文の類似度の全てもしくは一部とに基づいて、システム評価値を計算する。具体的には、システム出力文ごとに類似度が上位N個（Nは１〜７程度の自然数）のリファレンス文を選び、N個の類似度の平均値を計算してシステム評価値とする。 In step S13, the evaluation value calculation unit 13 calculates a system evaluation value based on the system output sentence, the combination of the input sentence and the reference sentence set, and all or part of the similarity of the reference sentence. Specifically, the top N similarity sentences (N is a natural number of about 1 to 7) are selected for each system output sentence, and an average value of the N similarity degrees is calculated as a system evaluation value.

［第二実施形態］
第二実施形態の対話システム評価装置２は、図３に示すように、出力文取得部１１及び文間類似度計算部１２を第一実施形態と同様に含み、リファレンス文データベース２０及び評価値計算部２３をさらに含む。 [Second Embodiment]
As shown in FIG. 3, the dialogue system evaluation apparatus 2 of the second embodiment includes an output sentence acquisition unit 11 and an inter-sentence similarity calculation unit 12 as in the first embodiment, and includes a reference sentence database 20 and an evaluation value calculation. Furthermore, the unit 23 is further included.

対話システム評価装置２は、第一実施形態と同様に、外部の対話システム９にアクセス可能なように構成されている。 The dialogue system evaluation device 2 is configured to be accessible to an external dialogue system 9 as in the first embodiment.

リファレンス文データベース２０には、入力文と、各入力文に対応する複数のリファレンス文と、各リファレンス文に対応する評価値とからなるリファレンス文データベースが記憶されている。すなわち、リファレンス文データベース１０との違いは、各リファレンス文に対して評価値が付与されていることである。 The reference sentence database 20 stores a reference sentence database including an input sentence, a plurality of reference sentences corresponding to each input sentence, and an evaluation value corresponding to each reference sentence. That is, the difference from the reference sentence database 10 is that an evaluation value is assigned to each reference sentence.

以下、第二実施形態のリファレンス文データベースの作成方法を説明する。入力文とリファレンス文の作成方法は第一実施形態と同様であるので、ここでは説明を省略する。 Hereinafter, a method for creating a reference sentence database according to the second embodiment will be described. Since the method for creating the input sentence and the reference sentence is the same as in the first embodiment, the description thereof is omitted here.

評価値の付与方法は、例えば、人手で直接値を付与する方法、リファレンス文のペアごとにどちらがよいかを比較し、その勝率を評価値とする方法などを用いることができる。後者の場合、個々のペアの勝ち負けについては保存しなくともよい。得られた評価値は、入力文とリファレンス文集合と組にしてリファレンス文データベース２０へ保存する。 As a method for assigning an evaluation value, for example, a method of directly assigning a value manually, a method of comparing which is better for each pair of reference sentences, and using the winning percentage as an evaluation value can be used. In the latter case, it is not necessary to save the winning or losing of each pair. The obtained evaluation value is stored in the reference sentence database 20 as a pair with the input sentence and the reference sentence set.

図４を参照して、第二実施形態の対話システム評価方法を説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 4, the interactive system evaluation method of 2nd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ２３において、評価値計算部２３は、システム出力文と、入力文とリファレンス文集合と評価値集合の組と、リファレンス文の類似度の全てもしくは一部に基づいて、システム評価値を計算する。具体的には、評価値が予め定めた閾値以下のリファレンス文を除外して、システム出力文ごとに、類似度が上位N個（Nは１〜７程度の自然数）のリファレンス文を選び、N個の類似度もしくは評価値の平均値を計算してシステム評価値とする。 In step S23, the evaluation value calculation unit 23 calculates a system evaluation value based on a system output sentence, a set of an input sentence, a reference sentence set, an evaluation value set, and all or part of the similarity of the reference sentence. . Specifically, by excluding reference sentences whose evaluation values are equal to or less than a predetermined threshold, for each system output sentence, a reference sentence having the highest N similarity (N is a natural number of about 1 to 7) is selected, and N A system evaluation value is calculated by calculating the average of individual similarities or evaluation values.

［第三実施形態］
第三実施形態の対話システム評価装置３は、図３に示すように、リファレンス文データベース２０、出力文取得部１１及び文間類似度計算部１２を第二実施形態と同様に含み、評価値計算部３３をさらに含む。 [Third embodiment]
As shown in FIG. 3, the dialogue system evaluation device 3 of the third embodiment includes a reference sentence database 20, an output sentence acquisition unit 11, and an inter-sentence similarity calculation unit 12 as in the second embodiment, and calculates an evaluation value. A portion 33 is further included.

対話システム評価装置３は、上述の実施形態と同様に、外部の対話システム９にアクセス可能なように構成されている。 The dialogue system evaluation apparatus 3 is configured to be accessible to an external dialogue system 9 as in the above-described embodiment.

図４を参照して、第三実施形態の対話システム評価方法を説明する。以下では、上述の第二実施形態との相違点を中心に説明する。 With reference to FIG. 4, the interactive system evaluation method of 3rd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 2nd embodiment.

ステップＳ３３において、評価値計算部３３は、システム出力文と、入力文とリファレンス文集合と評価値集合の組と、リファレンス文の類似度の全てもしくは一部に基づいて、システム評価値を計算する。具体的には、評価値が予め定めた閾値以下のリファレンス文を除外して、システム出力文ごとに、類似度が上位N個（Nは１〜７程度の自然数）のリファレンス文を選び、各リファレンス文の評価値を類似度により重み付けした平均値を計算してシステム評価値とする。 In step S33, the evaluation value calculation unit 33 calculates a system evaluation value based on a system output sentence, a set of an input sentence, a reference sentence set, an evaluation value set, and all or part of the similarity of the reference sentence. . Specifically, by excluding reference sentences whose evaluation values are equal to or lower than a predetermined threshold, for each system output sentence, a reference sentence having the highest N similarity (N is a natural number of about 1 to 7) is selected, and each An average value obtained by weighting the evaluation value of the reference sentence by the similarity is calculated and used as a system evaluation value.

［第四実施形態］
第四実施形態の対話システム評価装置４は、図５に示すように、リファレンス文データベース２０、出力文取得部１１及び文間類似度計算部１２を第三実施形態と同様に含み、学習データ記憶部４０、回帰モデル学習部４１、回帰モデルパラメータ記憶部４２、特徴量抽出部４３及び評価値計算部４４をさらに含む。 [Fourth embodiment]
As shown in FIG. 5, the dialogue system evaluation device 4 of the fourth embodiment includes a reference sentence database 20, an output sentence acquisition unit 11, and an inter-sentence similarity calculation unit 12 as in the third embodiment, and stores learning data. Further included is a unit 40, a regression model learning unit 41, a regression model parameter storage unit 42, a feature amount extraction unit 43, and an evaluation value calculation unit 44.

対話システム評価装置４は、上述の実施形態と同様に、外部の対話システム９にアクセス可能なように構成されている。 The dialog system evaluation device 4 is configured to be accessible to an external dialog system 9 as in the above-described embodiment.

学習データ記憶部４０には、リファレンス文データベース２０に記憶されている各リファレンス文の特徴量と各リファレンス文に付与された評価値が対応付けて記憶されている。特徴量は、リファレンス文やシステム出力文に含まれる単語や、各リファレンス文に対するBLEUスコアなどの類似度などを用いることができる。 The learning data storage unit 40 stores the feature amount of each reference sentence stored in the reference sentence database 20 and the evaluation value given to each reference sentence in association with each other. As the feature amount, a word included in a reference sentence or a system output sentence, or a similarity such as a BLEU score for each reference sentence can be used.

回帰モデルパラメータ記憶部４２には、回帰モデルのパラメータが記憶されている。回帰モデルのパラメータは学習データ記憶部４０に記憶されている特徴量と評価値の組の集合を回帰モデル学習部４１へ入力し、ある特徴量を入力したときには対応する評価値を出力するように調整する。回帰モデルは、例えば、上述のSVRを用いることができる。 The regression model parameter storage unit 42 stores regression model parameters. As the parameters of the regression model, a set of feature values and evaluation values stored in the learning data storage unit 40 is input to the regression model learning unit 41, and when a certain feature value is input, a corresponding evaluation value is output. adjust. As the regression model, for example, the above-described SVR can be used.

図６を参照して、第四実施形態の対話システム評価方法を説明する。以下では、上述の第三実施形態との相違点を中心に説明する。 With reference to FIG. 6, the interactive system evaluation method of the fourth embodiment will be described. Below, it demonstrates centering on difference with the above-mentioned 3rd embodiment.

ステップＳ４３において、特徴量抽出部４３は、対話システム９のシステム出力文から特徴量を抽出する。抽出する特徴量は学習データ記憶部４０に記憶されたリファレンス文の特徴量と同様のものである。抽出した特徴量は評価値計算部４４へ送られる。 In step S <b> 43, the feature amount extraction unit 43 extracts a feature amount from the system output sentence of the dialogue system 9. The feature amount to be extracted is the same as the feature amount of the reference sentence stored in the learning data storage unit 40. The extracted feature amount is sent to the evaluation value calculation unit 44.

ステップＳ４４において、評価値計算部４４は、回帰モデルパラメータ記憶部４２から取得した回帰モデルのパラメータを用いて、システム出力文の特徴量に対する評価値を予測してシステム評価値とする。 In step S44, the evaluation value calculation unit 44 uses the regression model parameters acquired from the regression model parameter storage unit 42 to predict an evaluation value for the feature quantity of the system output sentence, and sets it as a system evaluation value.

［第五実施形態］
第五実施形態の対話システム評価装置５は、図７に示すように、出力文取得部１１、文間類似度計算部１２、学習データ記憶部３０、特徴量抽出部４３を第四実施形態と同様に含み、リファレンス文データベース５０、分類モデル学習部５１、分類モデルパラメータ記憶部５２及び評価値計算部５４をさらに含む。 [Fifth embodiment]
As shown in FIG. 7, the dialogue system evaluation apparatus 5 of the fifth embodiment includes an output sentence acquisition unit 11, an inter-sentence similarity calculation unit 12, a learning data storage unit 30, and a feature amount extraction unit 43 as compared with the fourth embodiment. Similarly, a reference sentence database 50, a classification model learning unit 51, a classification model parameter storage unit 52, and an evaluation value calculation unit 54 are further included.

対話システム評価装置５は、上述の実施形態と同様に、外部の対話システム９にアクセス可能なように構成されている。 The dialogue system evaluation apparatus 5 is configured to be accessible to an external dialogue system 9 as in the above-described embodiment.

リファレンス文データベース５０には、入力文と、各入力文に対応する複数のリファレンス文と、各リファレンス文に対応する評価値とからなるリファレンス文データベースが記憶されている。ただし、第五実施形態の評価値は計算方法が異なっている。第五実施形態の評価値はリファレンス文のペアに対し、どちらがより適切な応答かを勝ち負けとして人手で判断して付与したものに限定される。 The reference sentence database 50 stores a reference sentence database including an input sentence, a plurality of reference sentences corresponding to each input sentence, and an evaluation value corresponding to each reference sentence. However, the evaluation values of the fifth embodiment are different in calculation method. The evaluation values of the fifth embodiment are limited to those given to the reference sentence pair by manually determining which is the more appropriate response as a win or a loss.

分類モデルパラメータ記憶部５２には、分類モデルのパラメータが記憶されている。分類モデルのパラメータはリファレンス文ペアごとの特徴量と評価値（例えば、勝ち：１、負け：０など）を分類モデル学習部５１へ入力し、ある２つの特徴量を入力したときに対応する勝ち負けを示す評価値を出力するように調整する。分類モデルは、例えば、上述のSVMを用いることができる。 The classification model parameter storage unit 52 stores classification model parameters. As the parameters of the classification model, a feature value and an evaluation value (for example, winning: 1, losing: 0, etc.) for each reference sentence pair are input to the classification model learning unit 51, and the corresponding winning and losing when two certain feature values are input. Adjust to output an evaluation value indicating. As the classification model, for example, the above-described SVM can be used.

図８を参照して、第五実施形態の対話システム評価方法を説明する。以下では、上述の第四実施形態との相違点を中心に説明する。 With reference to FIG. 8, the dialogue system evaluation method of the fifth embodiment will be described. Below, it demonstrates centering around difference with the above-mentioned 4th embodiment.

ステップＳ５４において、評価値計算部５４は、分類モデルパラメータ記憶部５２から取得した分類モデルのパラメータを用いて、システム出力文の特徴量と各リファレンス文の特徴量から勝ち負けを示す評価値を予測し、予測された勝ち負けの勝率を計算してシステム評価値とする。 In step S54, the evaluation value calculation unit 54 uses the classification model parameters acquired from the classification model parameter storage unit 52 to predict an evaluation value indicating a win or loss from the feature amount of the system output sentence and the feature amount of each reference sentence. The predicted winning / losing win rate is calculated as a system evaluation value.

このように、この発明の対話システム評価装置及び方法によれば、大規模にリファレンス文を用意し、かつ評価値を併用するなど、それらを適切に利用することで、特定のタスクを持たず話題の広い雑談対話システムなどに対しても適切にシステム評価値を付与することができる。高速かつ安価に対話システムを評価することができるため、対話システムを効率よく改善することが可能になる。 As described above, according to the dialogue system evaluation apparatus and method of the present invention, a reference sentence is prepared on a large scale, and the evaluation value is used together. A system evaluation value can be appropriately assigned even to a chat dialogue system having a wide variety of conversations. Since the interactive system can be evaluated at high speed and at low cost, the interactive system can be improved efficiently.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１、２、３、４、５対話システム評価装置
９対話システム
１０、２０、５０リファレンス文データベース
１１出力文取得部
１２文間類似度計算部
１３、２３、３３、４４、５４評価値計算部
４０学習データ記憶部
４１回帰モデル学習部
４２回帰モデルパラメータ記憶部
４３特徴量抽出部
５１分類モデル学習部
５２分類モデルパラメータ記憶部 1, 2, 3, 4, 5 Dialog system evaluation device 9 Dialog system 10, 20, 50 Reference sentence database 11 Output sentence acquisition unit 12 Inter-sentence similarity calculation unit 13, 23, 33, 44, 54 Evaluation value calculation unit 40 Learning data storage unit 41 Regression model learning unit 42 Regression model parameter storage unit 43 Feature quantity extraction unit 51 Classification model learning unit 52 Classification model parameter storage unit

Claims

In the reference sentence database, an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value assigned to each of the reference sentences are stored,
An output sentence acquisition step, wherein the output sentence acquisition unit inputs the input sentence to a dialog system having no specific task, and obtains a system output sentence from the dialog system;
An inter-sentence similarity calculation unit calculates an inter-sentence similarity between the system output sentence and each of the reference sentences;
A system evaluation in which the evaluation value calculator selects the reference sentence based on the evaluation value and the similarity, and evaluates the system output sentence using at least one of the similarity or evaluation value of the selected reference sentence An evaluation value calculating step for calculating a value;
Dialog system evaluation method including

In the reference sentence database, an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value assigned to each of the reference sentences are stored,
The regression model parameter storage unit stores the parameters of the regression model that outputs the evaluation value corresponding to the feature value when the feature value is learned using the feature value extracted from the reference sentence and the evaluation value. And
An output sentence acquisition step, wherein the output sentence acquisition unit inputs the input sentence to a dialog system having no specific task, and obtains a system output sentence from the dialog system;
The evaluation value calculation unit inputs the feature amount extracted from the system output sentence to the regression model, and uses the evaluation value output from the regression model as a system evaluation value for evaluating the system output sentence. When,
Dialog system evaluation method including

In the reference sentence database, an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value representing a win or loss representing an appropriate one for each pair of the reference sentences are stored,
The classification model parameter storage unit stores the parameters of the classification model that outputs the winning or losing when the two feature quantities learned using the feature quantity extracted from the reference sentence and the evaluation value are input,
An output sentence acquisition step, wherein the output sentence acquisition unit inputs the input sentence to a dialog system having no specific task, and obtains a system output sentence from the dialog system;
The evaluation value calculation unit inputs the feature quantity extracted from the system output sentence and the feature quantity extracted from the reference sentence to the classification model, and calculates the winning percentage calculated from the winning or losing output from the classification model. An evaluation value calculation step which is a system evaluation value for evaluating a sentence;
Dialog system evaluation method including

A reference sentence database that stores an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value assigned to each of the reference sentences;
An input sentence is input to an interactive system that does not have a specific task, and an output sentence acquisition unit that obtains a system output sentence from the interactive system;
A sentence similarity calculator that calculates the similarity between sentences between the system output sentence and the reference sentence;
Evaluation value for selecting a reference sentence based on the evaluation value and the similarity, and calculating a system evaluation value for evaluating the system output sentence using at least one of the similarity or evaluation value of the selected reference sentence A calculation unit;
An interactive system evaluation device.

A reference sentence database that stores an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value assigned to each of the reference sentences;
A regression model parameter storage unit that stores parameters of a regression model that outputs an evaluation value corresponding to the feature value when the feature value is input, learned using the feature value extracted from the reference sentence and the evaluation value;
An input sentence acquisition unit that inputs the input sentence to a dialog system having no specific task and obtains a system output sentence from the dialog system;
An evaluation value calculation unit that inputs the feature amount extracted from the system output sentence to the regression model and sets the evaluation value output from the regression model as a system evaluation value for evaluating the system output sentence;
An interactive system evaluation device.

A reference sentence database for storing an input sentence, a plurality of reference sentences predetermined for each of the input sentences, and an evaluation value representing a win or loss representing an appropriate one for each pair of the reference sentences;
A classification model parameter storage unit that stores parameters of a classification model that outputs the winning or losing when inputting two feature quantities learned using the feature quantity extracted from the reference sentence and the evaluation value;
An input sentence acquisition unit that inputs the input sentence to a dialog system having no specific task and obtains a system output sentence from the dialog system;
A system evaluation that evaluates the system output sentence by inputting the feature quantity extracted from the system output sentence and the feature quantity extracted from the reference sentence to the classification model, and calculating the winning percentage calculated from the winning or losing output from the classification model. An evaluation value calculation unit as a value,
An interactive system evaluation device.

The program for making a computer perform each step of the dialog system evaluation method in any one of Claim 1 to 3.