JP6617053B2

JP6617053B2 - Utterance semantic analysis program, apparatus and method for improving understanding of context meaning by emotion classification

Info

Publication number: JP6617053B2
Application number: JP2016037563A
Authority: JP
Inventors: 剣明呉; 聿津湯; 智基矢崎
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2016-02-29
Filing date: 2016-02-29
Publication date: 2019-12-04
Anticipated expiration: 2036-02-29
Also published as: JP2017156854A

Description

本発明は、ユーザの発話によるテキストベースの文脈意味を理解する技術に関する。特に、対話シナリオを用いてユーザと対話する対話システムに適用する技術に関する。 The present invention relates to a technique for understanding a text-based context meaning based on a user's utterance. In particular, the present invention relates to a technology applied to a dialog system that interacts with a user using a dialog scenario.

人間に対して自然な対話を実現した対話システムが、特にスマートフォンやタブレット端末で一般的に普及しつつある。その中でも、対話シナリオに従って、ユーザとの間で対話を進める技術も多い。例えば、「Siri（登録商標）」や「しゃべってコンシェル（登録商標）」のような対話システムによれば、ユーザの発話音声をテキストベースに変換し、その文脈構成から発話意味を推定する。そして、その発話意味に対応する対話シナリオに基づいて、ユーザに応答する。 A dialogue system that realizes natural dialogue with human beings is becoming more popular, especially on smartphones and tablet terminals. Among them, there are many techniques for promoting a dialogue with a user according to a dialogue scenario. For example, according to an interactive system such as “Siri (registered trademark)” or “Talking Concier (registered trademark)”, the user's speech is converted into a text base, and the speech meaning is estimated from the context structure. And it responds to a user based on the dialogue scenario corresponding to the utterance meaning.

また、発話されたテキストからユーザの感情を推定し、その感情に応じた対話シナリオでユーザに応答する技術もある（例えば特許文献１参照）。この技術によれば、感情を「ポジティブ／ネガティブ／ニュートラル」に分類する。具体的には、感情語と感情分類とを対応付けた感情語辞書を用いて、発話されたテキストの中から感情語を抽出し、その感情語に対応する感情分類に基づいて対話シナリオを進行させる。これによって、ユーザは、対話してステムが自分に共感していると感じ、信頼感を持って対話を進行することができる。 There is also a technique for estimating a user's emotion from the spoken text and responding to the user in a dialogue scenario according to the emotion (see, for example, Patent Document 1). According to this technique, emotions are classified as “positive / negative / neutral”. Specifically, an emotion word dictionary that associates emotion words with emotion classifications is used to extract emotion words from spoken text, and a dialogue scenario proceeds based on the emotion classification corresponding to that emotion word Let This allows the user to interact and feel that the stem is sympathetic to him and proceed with the conversation with confidence.

特開２００６−１７８０６３号公報JP 2006-178063 A 特開２０１４−２１９９３７号公報JP 2014-219937 A 特許４７２２５７３号公報Japanese Patent No. 4722573

「Caffeで画像解析を始めるための基礎知識とインストール、基本的な使い方」、[online]、［平成２８年２月２２日検索］、インターネット＜URL:http://www.atmarkit.co.jp/ait/articles/1511/09/news008.html＞"Basic knowledge and installation to start image analysis with Caffe, basic usage", [online], [February 22, 2016 search], Internet <URL: http://www.atmarkit.co.jp /ait/articles/1511/09/news008.html>

人間同士の現実的な対話の場合、日本語の特性上、言葉そのもの（テキストベースの文脈に基づく意味理解）から、ユーザの本来の意図を理解することが困難な場合も多い。例えば以下のような述語（文節）の場合、ユーザの意図は、Yesなのか？Noなのか？を判断することも難しい。
「結構です」、「いいです」、「大丈夫です」
「〜できないわけではない」、「〜できるわけではない」 In the case of realistic dialogue between humans, it is often difficult to understand the user's original intention from the words themselves (understanding meaning based on text-based context) due to the characteristics of Japanese. For example, in the following predicate (clause), is the user's intention Yes? Is it No? It is also difficult to judge.
"It's fine", "I'm fine", "I'm fine"
"I can't do it", "I can't do it"

例えばユーザが「バーなんて結構ですね」と発話した場合、「結構です」をYesと判断すべきと考えられる。
一方で、例えばユーザが「残業があるので、呑み会なんて結構です」と発話した場合、「結構です」をYesと判断してしまうと、対話自体が完全に崩れてしまうこととなる。
このように、ユーザに対する音声対話システムの中で、特に肯定意味／否定意味の理解は、対話シナリオの進行に大きく影響する。 For example, if the user utters “A bar is fine”, “Yes” should be judged as “Yes”.
On the other hand, for example, if the user utters “Since there is overtime, it is fine to have a grudge meeting”, and if “Yes” is judged as “Yes”, the dialogue itself will be completely destroyed.
As described above, in the spoken dialogue system for the user, in particular, understanding of the positive meaning / negative meaning greatly affects the progress of the dialogue scenario.

これに対し、本願の発明者らは、これらの文脈意味を理解する上で、ユーザの生体情報（例えば顔表情や、脈拍、体温、血圧、汗度、声の韻律特徴量）を考慮することが有効ではないか？と考えた。即ち、人間同士の現実的な対話の中では、文脈に基づく意味理解以上に、ユーザの顔表情や雰囲気が優先されて、ユーザの気持ちが相手に伝わっているのではないか？と考えた。 On the other hand, the inventors of the present application consider the user's biological information (for example, facial expression, pulse, body temperature, blood pressure, sweat rate, prosodic feature value of voice) in understanding these context meanings. Isn't it effective? I thought. That is, in a realistic dialogue between humans, the user's facial expression and atmosphere are given priority over the meaning understanding based on the context, and the user's feelings are transmitted to the other party. I thought.

そこで、本発明によれば、ユーザの発話から文脈意味を理解する際に、ユーザの感情分類によって文脈意味の理解精度を高めることができる発話意味分析プログラム、装置及び方法を提供することを目的とする。特に、ユーザに対する対話システムについて、肯定意味／否定意味の理解の誤りによって、対話シナリオの進行が崩れないようにすることができる。 Therefore, according to the present invention, it is an object to provide an utterance meaning analysis program, an apparatus, and a method that can improve the understanding accuracy of context meaning by user emotion classification when understanding context meaning from user utterance. To do. In particular, with respect to the dialog system for the user, it is possible to prevent the progress of the dialog scenario from being disturbed by an error in understanding the positive meaning / negative meaning.

本発明によれば、ユーザの発話から文脈意味を理解するようにコンピュータを機能させる発話意味分析プログラムにおいて、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する文脈意味分析手段と、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する感情信頼度推定手段と、
最も高い感情信頼度が否定感情である場合、肯定意味の文脈信頼度に、「１−否定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
又は、最も高い感情信頼度が肯定感情である場合、肯定意味の文脈信頼度に、「肯定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「１−肯定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
信頼度が高い方の肯定意味又は否定意味を選択する選択する発話意味補完手段と
してコンピュータを機能させることを特徴とする。
又は、本発明によれば、ユーザの発話から文脈意味を理解するようにコンピュータを機能させる発話意味分析プログラムにおいて、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する文脈意味分析手段と、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する感情信頼度推定手段と、
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算した信頼度とを比較し、信頼度が高い方の肯定意味又は否定意味を選択する発話意味補完手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, in an utterance semantic analysis program for causing a computer to understand context meaning from a user's utterance,
Context semantic analysis means for outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion confidence level estimation means for estimating the emotional confidence level of each positive emotion or negative emotion ,
When the highest emotional confidence level is negative emotion, the confidence level obtained by multiplying the context reliability level of positive meaning by “1-highest emotional reliability level of negative emotion” and the context reliability level of negative meaning are expressed as “negative emotional level”. Compared to the confidence level multiplied by
Or, when the highest emotional confidence is a positive emotion, the reliability obtained by multiplying the contextual reliability of positive meaning by “the highest emotional reliability of positive emotion” and the contextual reliability of negative meaning are “1- Compared to the confidence level multiplied by the highest emotional confidence level of positive emotions,
Utterance meaning complementing means to select the positive meaning or negative meaning with higher reliability
And making the computer function.
Alternatively, according to the present invention, in an utterance meaning analysis program for causing a computer to understand context meaning from a user's utterance,
Context semantic analysis means for outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion confidence level estimation means for estimating the emotional confidence level of each positive emotion or negative emotion ,
The confidence level of the positive meaning is multiplied by the sum of the positive level of emotional confidence level and the sum of "1-negative level of emotional confidence level" and the negative level of contextual confidence level. Utterance meaning complementing means for comparing the reliability obtained by multiplying the sum of "degree" and "1-emotional confidence of positive emotion" and selecting the positive or negative meaning of the higher reliability
And making the computer function.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
感情信頼度推定手段は、Paul Ekman感情分類モデルに基づくものであり、
肯定感情として「幸福」「驚き」に更に区分し、
否定感情として「怒り」「嫌悪」「恐れ」「悲しみ」に区分する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
The emotion confidence estimation means is based on the Paul Ekman emotion classification model,
As positive feelings, it is further divided into “happiness” and “surprise”
It is also preferable to make the computer function so that negative emotions are classified into “anger”, “disgust”, “fear”, and “sadness”.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
発話意味補完手段について、
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算し、
否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
About utterance meaning completion means,
Multiplying the context confidence level of positive meaning by the sum of "emotion confidence level of positive emotion" and "1-emotional confidence level of negative emotion"
It is also preferable to cause the computer to function so as to multiply the context reliability of negative meaning by the sum of “emotion reliability of negative emotion” and “1—emotion reliability of positive emotion”.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
ユーザの生体情報から、所定タイミング毎に感情分類を分析する感情分類分析手段と
してコンピュータを更に機能させ、
感情信頼度推定手段は、感情分類分析手段から、ユーザの発話中における時系列の複数の感情分類を入力し、肯定感情又は否定感情それぞれの統計値を、感情信頼度として推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
From the user's biological information, the computer is further functioned as an emotion classification analysis means for analyzing the emotion classification at every predetermined timing,
The emotion reliability estimation means inputs a plurality of time-series emotion classifications during the user's utterance from the emotion classification analysis means, and estimates the statistical value of each positive emotion or negative emotion as the emotion reliability. It is also preferable to make it function.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
感情信頼度推定手段について、統計値とは、「平均値(mean)」、「中央値(median)」、「最頻値(mode)」、「最大値(maximum)」又は「機械学習による識別結果」である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
For emotional confidence estimation means, the statistical value is “mean”, “median”, “mode”, “maximum” or “identification by machine learning” It is also preferred to have the computer function so that it is “result”.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
デバイスは、ユーザの顔を撮影可能なカメラであり、
感情分類分析手段は、カメラによって取得された当該ユーザの生体情報としての顔画像から、所定タイミング毎に感情分類を分析する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
The device is a camera that can capture the user's face,
The emotion classification analysis means preferably causes the computer to function so as to analyze the emotion classification at predetermined timings from the face image as the user's biological information acquired by the camera.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
感情分類分析手段は、ユーザの顔画像の特徴量に所定以上の変化が発生した場合、所定タイミングに関わらず、感情分類を分析する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
The emotion classification analyzing means preferably causes the computer to function to analyze the emotion classification regardless of the predetermined timing when a change of a predetermined value or more occurs in the feature amount of the user's face image.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
感情分類分析手段は、畳み込みニューラルネットワーク(Convolutional Neural Network：CNN／ConvNet)によって、ユーザの顔画像から、所定タイミング毎に感情分類を分析するものであり、
畳み込みニューラルネットワークは、予め多数の顔画像に感情分類を付与した教師データによって学習パラメータを構築したものである
ようにコンピュータを機能させることもこのましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
The emotion classification analysis means analyzes the emotion classification from the user's face image at every predetermined timing by a convolutional neural network (CNN / ConvNet).
In the convolutional neural network, it is also preferable to make the computer function so that learning parameters are constructed by teacher data in which emotion classification is given to a large number of face images in advance.

本発明の発話意味分析プログラムにおける他の実施形態によれば、
デバイスは、ユーザの脈拍、体温、血圧若しくは汗度を計測可能なセンサ、又は、ユーザの声から韻律特徴量（声の高低、強弱、リズム・テンポ）を計測可能なマイクであり、
感情分類分析手段は、センサによって取得された当該ユーザの生体情報から、所定タイミング毎に感情分類を分析する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the utterance meaning analysis program of the present invention,
The device is a sensor that can measure the user's pulse, body temperature, blood pressure, or sweat rate, or a microphone that can measure prosody features (voice pitch, strength, rhythm and tempo) from the user's voice,
The emotion classification analysis means preferably causes the computer to function so as to analyze the emotion classification at every predetermined timing from the biological information of the user acquired by the sensor.

本発明によれば、前述した発話意味分析プログラムと、
発話意味補完手段から出力された肯定感情又は否定感情に応じて、ユーザに対して対話的に次のシナリオを進行する対話プログラムと
してコンピュータを機能させることを特徴とする。 According to the present invention, the utterance meaning analysis program described above,
The computer is caused to function as an interactive program that interactively advances the next scenario to the user in accordance with the positive emotion or negative emotion output from the utterance meaning complementing means.

本発明によれば、ユーザの発話から文脈意味を理解する装置において、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する文脈意味分析手段と、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する感情信頼度推定手段と、
最も高い感情信頼度が否定感情である場合、肯定意味の文脈信頼度に、「１−否定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
又は、最も高い感情信頼度が肯定感情である場合、肯定意味の文脈信頼度に、「肯定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「１−肯定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
信頼度が高い方の肯定意味又は否定意味を選択する選択する発話意味補完手段と
を有することを特徴とする。
又は、本発明によれば、ユーザの発話から文脈意味を理解する装置において、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する文脈意味分析手段と、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する感情信頼度推定手段と、
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算した信頼度とを比較し、信頼度が高い方の肯定意味又は否定意味を選択する発話意味補完手段と
を有することを特徴とする。 According to the present invention, in an apparatus for understanding context meaning from a user's utterance,
Context semantic analysis means for outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion confidence level estimation means for estimating the emotional confidence level of each positive emotion or negative emotion ,
When the highest emotional confidence level is negative emotion, the confidence level obtained by multiplying the context reliability level of positive meaning by “1-highest emotional reliability level of negative emotion” and the context reliability level of negative meaning are expressed as “negative emotional level”. Compared to the confidence level multiplied by
Or, when the highest emotional confidence is a positive emotion, the reliability obtained by multiplying the contextual reliability of positive meaning by “the highest emotional reliability of positive emotion” and the contextual reliability of negative meaning are “1- Compared to the confidence level multiplied by the highest emotional confidence level of positive emotions,
And an utterance meaning complementing means for selecting an affirmative meaning or a negative meaning with higher reliability .
Or, according to the present invention, in an apparatus for understanding context meaning from a user's utterance,
Context semantic analysis means for outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion confidence level estimation means for estimating the emotional confidence level of each positive emotion or negative emotion ,
The confidence level of the positive meaning is multiplied by the sum of the positive level of emotional confidence level and the sum of "1-negative level of emotional confidence level" and the negative level of contextual confidence level. And a utterance meaning complementing means for selecting a positive meaning or a negative meaning having a higher degree of reliability by comparing the degree of reliability with the sum of the sums of “degree” and “1-emotion emotion confidence level”. To do.

本発明によれば、ユーザの発話から文脈意味を理解する装置の発話意味分析方法において、
装置は、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する第１のステップと、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する第２のステップと、
最も高い感情信頼度が否定感情である場合、肯定意味の文脈信頼度に、「１−否定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
又は、最も高い感情信頼度が肯定感情である場合、肯定意味の文脈信頼度に、「肯定感情の最も高い感情信頼度」を乗算した信頼度と、否定意味の文脈信頼度に、「１−肯定感情の最も高い感情信頼度」を乗算した信頼度とを比較し、
信頼度が高い方の肯定意味又は否定意味を選択する選択する第３のステップと
を実行することを特徴とする。
又は、本発明によれば、ユーザの発話から文脈意味を理解する装置の発話意味分析方法において、
装置は、
ユーザの発話中のテキストについて、肯定意味又は否定意味それぞれの文脈信頼度を出力する第１のステップと、
ユーザの発話中に、デバイスによって取得したユーザの生体情報から、肯定感情又は否定感情それぞれの感情信頼度を推定する第２のステップと、
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算した信頼度と、否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算した信頼度とを比較し、信頼度が高い方の肯定意味又は否定意味を選択する第３のステップと
を実行することを特徴とする。
According to the present invention, in an utterance meaning analysis method of a device that understands context meaning from a user's utterance,
The device
A first step of outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
A second step of estimating the emotional reliability of each positive emotion or negative emotion from the user's biometric information acquired by the device during the user's utterance;
When the highest emotional confidence level is negative emotion, the confidence level obtained by multiplying the context reliability level of positive meaning by “1-highest emotional reliability level of negative emotion” and the context reliability level of negative meaning are expressed as “negative emotional level”. Compared to the confidence level multiplied by
Or, when the highest emotional confidence is a positive emotion, the reliability obtained by multiplying the contextual reliability of positive meaning by “the highest emotional reliability of positive emotion” and the contextual reliability of negative meaning are “1- Compared to the confidence level multiplied by the highest emotional confidence level of positive emotions,
And a third step of selecting a positive meaning or a negative meaning having a higher reliability .
Alternatively, according to the present invention, in the utterance meaning analysis method of the device that understands the context meaning from the utterance of the user,
The device
A first step of outputting context confidence for each of the positive or negative meanings of the text being spoken by the user;
A second step of estimating the emotional reliability of each positive emotion or negative emotion from the user's biometric information acquired by the device during the user's utterance;
The confidence level of the positive meaning is multiplied by the sum of the “confidence level of positive emotion” and the “1-level emotional confidence level of negative emotion”. And a third step of selecting a positive meaning or a negative meaning having a higher degree of reliability, by comparing the degree of reliability with the sum of the sum of “degree” and “1—emotion confidence level of positive emotion”. And

本発明の発話意味分析プログラム、装置及び方法によれば、ユーザの発話から文脈意味を理解する際に、ユーザの感情分類によって文脈意味の理解精度を高めることができる発話意味分析プログラム、装置及び方法を提供することを目的とする。特に、ユーザに対する対話システムについて、肯定意味／否定意味の理解の誤りによって、対話シナリオの進行が崩れないようにすることができる。 According to the utterance meaning analysis program, apparatus and method of the present invention, when understanding the context meaning from the user's utterance, the utterance meaning analysis program, apparatus and method can improve the understanding accuracy of the context meaning by the user's emotion classification. The purpose is to provide. In particular, with respect to the dialog system for the user, it is possible to prevent the progress of the dialog scenario from being disturbed by an error in understanding the positive meaning / negative meaning.

本発明における発話意味分析プログラムの構成図である。It is a block diagram of the utterance meaning analysis program in this invention. 基本的な肯定感情／否定感情を用いて意味を分析する説明図である。It is explanatory drawing which analyzes a meaning using basic positive emotion / negative emotion. ３つ以上の感情を用いて意味を分析する説明図である。It is explanatory drawing which analyzes a meaning using three or more emotions. カメラによって撮影された顔表情の感情から意味を分析する説明図である。It is explanatory drawing which analyzes a meaning from the emotion of the facial expression image | photographed with the camera. 畳み込みニューラルネットワークを表す説明図である。It is explanatory drawing showing a convolution neural network. 生体センサによって取得された感情から意味を分析する説明図である。It is explanatory drawing which analyzes a meaning from the emotion acquired by the biometric sensor.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明における発話意味分析プログラムの構成図である。 FIG. 1 is a configuration diagram of an utterance meaning analysis program according to the present invention.

本発明の発話意味分析プログラムは、端末に搭載されたコンピュータで実行されることによって、ユーザの発話から文脈意味を理解するように機能する。端末としては、例えばスマートフォンやタブレット端末のようなものであって、対話システムに適用される。 The utterance meaning analysis program of the present invention functions to understand the context meaning from the user's utterance by being executed by a computer mounted on the terminal. The terminal is, for example, a smartphone or a tablet terminal, and is applied to an interactive system.

図１によれば、端末１は、ユーザインタフェースの入力デバイス及び出力デバイスとして、マイク、スピーカ、カメラ及び生体センタを搭載している。
カメラは、ユーザの顔表情を撮影可能なものであり、その顔画像を、生体情報として取得する。
生体センサは、脈拍、体温、血圧若しくは汗度を計測可能なセンサであって、ユーザの生体情報を取得する。
マイクは、ユーザの発話音声を取得する。また、マイクは、ユーザの声における韻律特徴量（声の高低、強弱、リズム・テンポ）を、生体情報として取得するものであってもよい。
スピーカは、対話プログラムからの対話文を、音声によってユーザへ発声する。 According to FIG. 1, the terminal 1 is equipped with a microphone, a speaker, a camera, and a biological center as an input device and an output device of a user interface.
The camera is capable of photographing a user's facial expression, and acquires the facial image as biometric information.
The biometric sensor is a sensor capable of measuring a pulse, body temperature, blood pressure, or sweat rate, and acquires biometric information of the user.
The microphone acquires the user's uttered voice. In addition, the microphone may acquire prosodic feature values (voice level, strength, rhythm / tempo) in the user's voice as biological information.
The speaker utters a dialogue sentence from the dialogue program to the user by voice.

本発明の発話意味分析プログラムは、ユーザ操作の端末のディスプレイに「キャラクタ・エージェント」を表示し、ユーザとエージェントとが音声による対話を進める、音声対話システムに適用することが好ましい。ユーザは、端末のディスプレイに表示されたキャラクタに向かって対話するために、ユーザの生体情報を取得しやすい。特に、インカメラによってユーザの顔表情を取得しやすく、マイクによってユーザの音声も取得しやすい。 The utterance semantic analysis program of the present invention is preferably applied to a voice dialogue system in which a “character agent” is displayed on a display of a user-operated terminal, and a dialogue between the user and the agent proceeds by voice. Since the user interacts with the character displayed on the display of the terminal, it is easy to acquire the user's biological information. In particular, it is easy to acquire the user's facial expression with the in-camera, and it is easy to acquire the user's voice with the microphone.

端末１は、ユーザインタフェースに合わせて、入力音声変換部１０１と、出力信号変換部１０２と、生体情報入力処理部１０３とを有する。 The terminal 1 includes an input voice conversion unit 101, an output signal conversion unit 102, and a biological information input processing unit 103 in accordance with the user interface.

入力音声変換部１０１は、入力デバイスのマイクによって取得された音声信号を入力し、その音声信号をテキストに変換し、発話意味分析プログラムへ出力する。入力音声変換部１０１は、発話の開始から終了までを検知し、その発話区間における音声信号からテキストを抽出する。
出力音声変換部１０２は、対話プログラムから、ユーザに対する対話文を入力し、その対話文を音声信号に変換し、その音声信号を出力デバイスのスピーカへ出力する。
生体情報入力処理部１０３は、カメラ、生体センサ又はマイクから、ユーザの生体情報を入力し、プログラムによって処理可能なデータに変換し、そのデータを発話意味分析プログラムへ出力する。 The input voice conversion unit 101 inputs a voice signal acquired by a microphone of the input device, converts the voice signal into text, and outputs the text to an utterance semantic analysis program. The input voice conversion unit 101 detects from the start to the end of the utterance and extracts text from the voice signal in the utterance section.
The output voice conversion unit 102 inputs a dialogue sentence for the user from the dialogue program, converts the dialogue sentence into a voice signal, and outputs the voice signal to the speaker of the output device.
The biometric information input processing unit 103 inputs the biometric information of the user from the camera, the biometric sensor, or the microphone, converts it into data that can be processed by the program, and outputs the data to the utterance semantic analysis program.

図１における本発明の発話意味分析プログラムは、文脈意味分析部１１と、感情信頼度推定部１２と、発話意味補完部１３と、感情分類分析部１４として、コンピュータを機能させるものである。発話意味分析プログラムは、端末に予めインストールされたものであってもよいし、サーバにインストールされており、端末からネットワークを介して利用されるものであってもよい。
また、端末１は、対話プログラム１５を更に実装し、ユーザに対して対話システムとして機能する。
勿論、発話意味分析プログラムの各機能構成部は、端末及びサーバに分散的に構築されたものであってもよい。本発明によれば、各機能構成部が実行される対象装置を、特定するものではない。
尚、これら各機能構成部の処理の流れは、装置の発話意味分析方法としても理解できる。 The utterance meaning analysis program of the present invention in FIG. 1 causes a computer to function as the context meaning analysis unit 11, emotion confidence estimation unit 12, utterance meaning complementation unit 13, and emotion classification analysis unit 14. The utterance meaning analysis program may be preinstalled in the terminal, or may be installed in the server and used from the terminal via the network.
The terminal 1 further has a dialogue program 15 and functions as a dialogue system for the user.
Of course, each functional component of the utterance meaning analysis program may be constructed in a distributed manner on terminals and servers. According to the present invention, the target device on which each functional component is executed is not specified.
The process flow of each functional component can also be understood as an utterance meaning analysis method of the apparatus.

図２は、基本的な肯定感情／否定感情を用いて意味を分析する説明図である。 FIG. 2 is an explanatory diagram for analyzing meanings using basic positive / negative emotions.

［文脈意味分析部１１］
文脈意味分析部１１は、ユーザの発話中のテキストについて、所定の意味分類別の文脈信頼度を出力する。文脈意味分析部１１は、テキストベースの文脈の中で、文脈信頼度が高い意味分類を選択する既存技術である。 [Context / Semantic Analysis Unit 11]
The context semantic analysis unit 11 outputs context reliability for each predetermined semantic classification for the text being uttered by the user. The context semantic analysis unit 11 is an existing technology that selects a semantic classification having a high context reliability in a text-based context.

自然言語処理によれば、ユーザによって発話や記述されたテキストベースの自然言語を、コンピュータによってその意味を理解する技術である。この技術は、一般に、形態素解析と、構文解析と、文脈意味解析とによって実現される。
文脈意味解析によれば、語義の曖昧性を解消するという目的がある。従来技術として前述したように、「結構です」「いいです」「大丈夫です」の文節は、ユーザの意図が、Yesなのか？Noなのか？を判断することも難しい。現在の自然言語処理では、語義の曖昧性解消には、単語の共起関係を使って解消する技術が一般的である。 Natural language processing is a technique for understanding the meaning of a text-based natural language spoken or written by a user using a computer. This technique is generally realized by morphological analysis, syntax analysis, and context semantic analysis.
According to context semantic analysis, there is a purpose of eliminating ambiguity of meaning. As described in the prior art, is the user's intention "Yes" for the phrases "Nice", "Nice", and "Okay"? Is it No? It is also difficult to judge. In the current natural language processing, a technique for eliminating word ambiguity using a word co-occurrence relationship is generally used.

「意味分類」として、最も簡単には、少なくとも肯定意味又は否定意味に区分するものである。肯定意味／否定意味の理解精度を高めることは、対話シナリオの進行では最も重要なことである。
図２によれば、ユーザが発話した「結構です」に対して、文脈意味分析部１１は、その文脈から、以下のような文脈信頼度を分析している。
「〜結構です」：肯定意味Yes(52%)、否定意味No(48%) As the “semantic classification”, it is most simply classified into at least a positive meaning or a negative meaning. Increasing the accuracy of understanding the positive / negative meaning is the most important in the progress of the dialogue scenario.
According to FIG. 2, the context semantic analysis unit 11 analyzes the following context reliability based on the context for the “good” spoken by the user.
"~ It's fine": Affirmative meaning Yes (52%), negative meaning No (48%)

ユーザの発話期間のテキストについて、意味分類別それぞれの文脈信頼度は、発話意味補完部１３へ出力される。 For the text in the user's utterance period, the context reliability for each semantic category is output to the utterance meaning complementer 13.

［感情信頼度推定部１２］
感情信頼度推定部１２は、ユーザの発話中に、デバイスによって取得したユーザの生体情報から、所定の感情分類別の感情信頼度を推定する。
生体情報とは、ユーザの発話期間に計測されたものである。本発明によれば、生体情報として、ユーザの顔画像や、心拍等の生体計測値、発声の韻律特徴量であってもよい。
但し、本発明の本質は、いずれかの生体情報に特定されるものではなく、ユーザの生体情報から、感情分類別の感情信頼度が得られればよい。 [Emotion reliability estimation unit 12]
The emotion reliability estimation unit 12 estimates the emotion reliability for each predetermined emotion classification from the user's biometric information acquired by the device during the user's utterance.
The biometric information is measured during the user's utterance period. According to the present invention, the biometric information may be a face image of a user, a biometric measurement value such as a heartbeat, or a prosodic feature quantity of utterance.
However, the essence of the present invention is not limited to any biological information, and it is only necessary to obtain the emotion reliability for each emotion classification from the biological information of the user.

感情信頼度推定部１２は、後述する感情分類分析部１４から、ユーザの発話中における時系列の複数の感情分類を入力するものであってもよい。このとき、感情分類別（肯定感情又は否定感情）それぞれの統計値を、感情信頼度として推定する。 The emotion reliability estimation unit 12 may input a plurality of time-series emotion classifications during the user's utterance from the emotion classification analysis unit 14 described later. At this time, the statistical value for each emotion classification (positive emotion or negative emotion) is estimated as the emotion reliability.

図２によれば、感情信頼度推定部１２によれば、肯定感情又は否定感情それぞれの感情信頼度は、以下のように推定されている。
肯定感情Yes(35%)、否定感情No(80%) According to FIG. 2, according to the emotion reliability estimation unit 12, the emotion reliability of each positive emotion or negative emotion is estimated as follows.
Positive emotions Yes (35%), negative emotions No (80%)

ユーザの発話期間における感情分類別それぞれの感情信頼度は、発話意味補完部１３へ出力される。 The emotional confidence levels for each emotion classification in the user's utterance period are output to the utterance meaning complementer 13.

［発話意味補完部１３］
発話意味補完部１３は、意味分類別の文脈信頼度に、当該意味分類に応じた感情分類別の感情信頼度を重み付け、当該信頼度が最も高い意味分類を選択する。 [Speech meaning complementer 13]
The utterance meaning complementation unit 13 weights the context reliability for each semantic category by the emotional reliability for each emotion category according to the semantic category, and selects the semantic category having the highest reliability.

図２によれば、意味分類は、肯定意味／否定意味であり、感情分類は、肯定感情／否定感情である。
この場合、具体的に、発話意味補完部１３は、肯定意味の文脈信頼度に肯定感情の感情信頼度を重み付けた信頼度と、否定意味の文脈信頼度に否定感情の感情信頼度を重み付けた信頼度とを比較し、信頼度が高い方の肯定意味又は否定意味を選択する。
例えば以下のように、２つの実施形態がある。
＜感情信頼度が最も高い感情分類に基づいて発話意味を補完する第１の実施形態＞
＜感情分類全体の感情信頼度に基づいて発話意味を補完する第２の実施形態＞ According to FIG. 2, the semantic classification is a positive meaning / negative meaning, and the emotion classification is a positive feeling / negative feeling.
In this case, specifically, the utterance meaning complementation unit 13 weights the confidence level of the positive emotion to the contextual confidence level of the positive meaning and the emotional confidence level of the negative emotion to the context confidence level of the negative meaning. The reliability is compared, and the positive or negative meaning with the higher reliability is selected.
For example, there are two embodiments as follows.
<First Embodiment Complementing Utterance Meaning Based on Emotion Classification with Highest Emotional Reliability>
<Second Embodiment Complementing Utterance Meaning Based on Emotion Confidence of Overall Emotion Classification>

＜感情信頼度が最も高い感情分類に基づいて発話意味を補完する第１の実施形態＞
（ケース１）感情信頼度が高い感情分類が、否定感情である場合、
肯定意味の文脈信頼度に、「１−否定感情の感情信頼度」を乗算し、
否定意味の文脈信頼度に、「否定感情の感情信頼度」を乗算する。
（ケース２）感情信頼度が高い感情分類が、肯定感情である場合、
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」を乗算し、
否定意味の文脈信頼度に、「１−肯定感情の感情信頼度」を乗算する。 <First Embodiment Complementing Utterance Meaning Based on Emotion Classification with Highest Emotional Reliability>
(Case 1) If the emotion classification with high emotional confidence is negative emotion,
Multiply the contextual confidence level of affirmative meaning by “1-Emotional confidence level of negative emotion”
Multiply the context reliability of negative meaning by the emotional confidence of negative emotion.
(Case 2) When the emotion classification with high emotion confidence is positive emotion,
Multiply positive contextual confidence by "emotion confidence of positive emotion"
Multiply the context reliability of negative meaning by “1-emotional confidence of positive emotion”.

図２によれば、否定感情の感情信頼度(80%)の方が、肯定感情の感情信頼度(35%)よりも高い。この場合、ケース１を適用し、以下のように算出する。
肯定意味の文脈信頼度Yes(52%)×（１−否定感情の感情信頼度(80%)）
＝肯定意味の信頼度(10.4%)
否定意味の文脈信頼度Yes(48%)×否定感情の感情信頼度(80%)
＝否定意味の信頼度(38.4%)
肯定意味の信頼度(10.4%)＜否定意味の信頼度(38.4%)
この場合、「〜結構です」は、否定意味であると補完される。 According to FIG. 2, the emotional reliability of negative emotion (80%) is higher than the emotional reliability of positive emotion (35%). In this case, Case 1 is applied and calculation is performed as follows.
Context confidence of positive meaning Yes (52%) x (1-Emotional confidence of negative emotion (80%))
= Affirmative confidence (10.4%)
Context reliability of negative meaning Yes (48%) × emotional reliability of negative emotion (80%)
= Negative meaning of reliability (38.4%)
Reliability of positive meaning (10.4%) <Reliability of negative meaning (38.4%)
In this case, “to be fine” is complemented with a negative meaning.

＜感情分類全体の感情信頼度に基づいて発話意味を補完する第２の実施形態＞
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算する。
また、否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算する。 <Second Embodiment Complementing Utterance Meaning Based on Emotion Confidence of Overall Emotion Classification>
The context confidence level of a positive meaning is multiplied by the sum of “emotion confidence level of positive emotion” and “1-emotion confidence level of negative emotion”.
In addition, the context reliability of negative meaning is multiplied by the sum of “emotion reliability of negative emotion” and “1—emotion reliability of positive emotion”.

図２によれば、以下のように算出する。
肯定意味の文脈信頼度Yes(52%)×
｛肯定感情の感情信頼度(35%)＋（１−否定感情の感情信頼度(80%)｝
＝肯定意味の信頼度(28.6%)
否定意味の文脈信頼度Yes(48%)×
｛否定感情の感情信頼度(80%)＋（１−肯定感情の感情信頼度(35%)｝
＝肯定意味の信頼度(69.6%)
肯定意味の信頼度(28.6%)＜否定意味の信頼度(69.6%)
この場合、「〜結構です」は、否定意味であると補完される。 According to FIG. 2, it calculates as follows.
Affirmative context confidence Yes (52%) ×
{Emotion confidence in positive emotions (35%) + (1-Emotion reliability in negative emotions (80%)}
= Affirmative confidence (28.6%)
Context confidence of negative meaning Yes (48%) ×
{Emotion reliability of negative emotion (80%) + (1-Emotion reliability of positive emotion (35%)}
= Affirmative confidence (69.6%)
Reliability of positive meaning (28.6%) <Reliability of negative meaning (69.6%)
In this case, “to be fine” is complemented with a negative meaning.

図３は、３つ以上の感情を用いて意味を分析する説明図である。 FIG. 3 is an explanatory diagram for analyzing meaning using three or more emotions.

図３によれば、「感情分類」として、例えばPaul Ekman感情分類モデルに基づくものであってもよい。
具体的な感情分類として、例えば「幸福(Happy)」「驚き(Surprise)」「怒り(Angry)」「嫌悪(Disgust)」「恐れ(Fear)」「悲しみ(Sad)」に区分する。勿論、「平常(Neutral)」を含むものであってもよい。更に「面白さ」「軽蔑」「満足」「困惑」「興奮」「罪悪感」「功績に基づく自負心」「安心」「納得感」「恥」のような感情や、これら感情を複合したもの（例えば、怒り＋嫌悪、幸福+満足など）を含むものであってもよい。 According to FIG. 3, the “emotion classification” may be based on, for example, a Paul Ekman emotion classification model.
Specific emotion classifications include, for example, “Happy”, “Surprise”, “Angry”, “Disgust”, “Fear”, and “Sad”. Of course, it may include “Neutral”. In addition, emotions such as “fun”, “contempt”, “satisfaction”, “confused”, “excitement”, “guilt”, “self-confidence based on achievements”, “security”, “convincing”, “shame”, and a combination of these emotions (For example, anger + aversion, happiness + satisfaction, etc.) may be included.

また、肯定感情及び否定感情を、Paul Ekman感情分類モデルに対応させて、例えば以下のように区分することもできる。
肯定意味：「幸福(Happy)」「驚き(Surprise)」
否定意味：「怒り(Angry)」「嫌悪(Disgust)」「恐れ(Fear)」「悲しみ(Sad)」 Also, positive emotions and negative emotions can be classified as follows, for example, in correspondence with the Paul Ekman emotion classification model.
Positive meaning: “Happy” “Surprise”
Negative meaning: “Angry” “Disgust” “Fear” “Sad”

図３によれば、感情分類別の感情信頼度（感情信頼度ベクトル(Confidence)）は、以下のように推定されている。
［感情分類］［感情信頼度］
（肯定感情）幸福(Happy) ： 10%
（肯定感情）驚き(Surprise) ： 5%
（否定感情）怒り(Angry) ： 70%
（否定感情）嫌悪(Disgust) ： 30%
（否定感情）恐れ(Fear) ： 40%
（否定感情）悲しみ(Sad) ： 30%
平常(Neutral) ： 5%
尚、各分類別の感情信頼度は、教師画像データとの類似度から直接算出した値であるが、確率分布モデルによって総和が１になるように調整してもよい。 According to FIG. 3, the emotional reliability (emotion reliability vector (Confidence)) for each emotion classification is estimated as follows.
[Emotion classification] [Emotion confidence]
(Positive emotion) Happy: 10%
(Positive emotion) Surprise: 5%
(Negative sentiment) Angry: 70%
(Negative sentiment) Disgust: 30%
(Negative feeling) Fear: 40%
(Negative sentiment) Sad: 30%
Normal: 5%
The emotional reliability for each classification is a value directly calculated from the similarity with the teacher image data, but may be adjusted so that the sum is 1 by a probability distribution model.

＜感情信頼度が最も高い感情分類に基づいて発話意味を補完する第１の実施形態＞
（ケース１）感情信頼度が最も高い感情分類が、否定感情である場合、
肯定意味の文脈信頼度に、「１−否定感情の最も高い感情信頼度」を乗算し、
否定意味の文脈信頼度に、「否定感情の最も高い感情信頼度」を乗算する。
（ケース２）感情信頼度が最も高い感情分類が、肯定感情である場合、
肯定意味の文脈信頼度に、「肯定感情の最も高い感情信頼度」を乗算し、
否定意味の文脈信頼度に、「１−肯定感情の最も高い感情信頼度」を乗算する。 <First Embodiment Complementing Utterance Meaning Based on Emotion Classification with Highest Emotional Reliability>
(Case 1) When the emotion classification with the highest emotional confidence is negative emotion,
Multiply positive contextual confidence by "1-highest emotional confidence of negative emotion"
Multiply the context confidence of negative meaning by "the highest emotional confidence of negative emotion".
(Case 2) When the emotion classification with the highest emotion confidence is positive emotion,
Multiply positive contextual confidence by "highest emotional confidence of positive emotion"
Multiply the context reliability of negative meaning by “1-highest emotional reliability of positive emotion”.

図３によれば、感情分類「怒り」の感情信頼度(70%)が最も高い。この場合、感情分類は否定感情であるので、ケース１を適用し、以下のように算出する。
肯定意味の文脈信頼度Yes(52%)×（１−否定感情「怒り」の感情信頼度(70%)）
＝肯定意味の信頼度(15.6%)
否定意味の文脈信頼度Yes(48%)×否定感情「怒り」の感情信頼度(70%)
＝否定意味の信頼度(33.6%)
肯定意味の信頼度(15.6%)＜否定意味の信頼度(33.6%)
この場合、「〜結構です」は、否定意味であると補完される。 According to FIG. 3, the emotional confidence level (70%) of the emotion classification “anger” is the highest. In this case, since the emotion classification is negative emotion, Case 1 is applied and the calculation is performed as follows.
Context reliability of positive meaning Yes (52%) × (1-emotional confidence of negative emotion “anger” (70%))
= Affirmative confidence (15.6%)
Context reliability of negative meaning Yes (48%) × Emotion reliability of negative emotion “anger” (70%)
= Negative meaning of reliability (33.6%)
Reliability of positive meaning (15.6%) <Reliability of negative meaning (33.6%)
In this case, “to be fine” is complemented with a negative meaning.

＜感情分類全体の感情信頼度に基づいて発話意味を補完する第２の実施形態＞
肯定意味の文脈信頼度に、「肯定感情の感情信頼度」及び「１−否定感情の感情信頼度」の総和を乗算する。
否定意味の文脈信頼度に、「否定感情の感情信頼度」及び「１−肯定感情の感情信頼度」の総和を乗算する。 <Second Embodiment Complementing Utterance Meaning Based on Emotion Confidence of Overall Emotion Classification>
The context confidence level of a positive meaning is multiplied by the sum of “emotion confidence level of positive emotion” and “1-emotion confidence level of negative emotion”.
The context reliability of negative meaning is multiplied by the sum of “emotion reliability of negative emotion” and “1—emotion reliability of positive emotion”.

図３によれば、以下のように算出される。
肯定意味の文脈信頼度Yes(52%)×
｛肯定感情「幸福」の感情信頼度(10%)
＋肯定感情「驚き」の感情信頼度(5%)
＋（１−否定感情「怒り」の感情信頼度(70%)）
＋（１−否定感情「嫌悪」の感情信頼度(30%)）
＋（１−否定感情「恐れ」の感情信頼度(40%)）
＋（１−否定感情「悲しみ」の感情信頼度(30%)）}
＝肯定意味の信頼度(127.4%)
否定意味の文脈信頼度Yes(48%)×
｛否定感情「怒り」の感情信頼度(70%)）
＋否定感情「嫌悪」の感情信頼度(30%)
＋否定感情「恐れ」の感情信頼度(40%)
＋否定感情「悲しみ」の感情信頼度(30%)
＋（１−肯定感情「幸福」の感情信頼度(10%)）
＋（１−肯定感情「驚き」の感情信頼度(5%)）
＝否定意味の信頼度(170.4%)
肯定意味の信頼度(127.4%)＜否定意味の信頼度(170.4%)
この場合、「〜結構です」は、否定意味であると補完される。 According to FIG. 3, it is calculated as follows.
Affirmative context confidence Yes (52%) ×
{Emotional reliability of positive emotion "happiness" (10%)
+ Emotional confidence of positive emotion “surprise” (5%)
+ (1-Negative emotion "anger" emotional confidence (70%))
+ (1-Emotional reliability of negative emotion "disgust" (30%))
+ (1-Emotional reliability of negative emotion "fear" (40%))
+ (1-emotional confidence of negative emotion "sadness" (30%))}
= Affirmative confidence (127.4%)
Context confidence of negative meaning Yes (48%) ×
{Emotional reliability of negative emotion "anger" (70%))
+ Emotional reliability of negative emotion "hate" (30%)
+ Emotional reliability of negative emotion "fear" (40%)
+ Emotional reliability of negative emotion “sadness” (30%)
+ (1-Emotional reliability of positive emotion "happiness" (10%))
+ (1-Emotional confidence level of positive emotion "surprise" (5%))
= Negative meaning of reliability (170.4%)
Reliability of positive meaning (127.4%) <Reliability of negative meaning (170.4%)
In this case, “to be fine” is complemented with a negative meaning.

そして、発話意味補完部１３は、信頼度が高い意味分類を、対話プログラム１５へ出力する。これによって、対話プログラム１５は、高い精度の肯定意味／否定意味を、対話シナリオの進行に利用することができる。 Then, the utterance meaning complementation unit 13 outputs a semantic category with high reliability to the dialogue program 15. Thereby, the dialogue program 15 can use the positive / negative meaning with high accuracy for the progress of the dialogue scenario.

［感情分類分析部１４］
感情分類分析部１４は、ユーザの生体情報から、所定タイミング毎に感情分類を分析する。感情分類分析部１４は、本発明の本質的部分（文脈意味分析部１１、感情信頼度推定部１２、発話意味補完部１３）ではないが、ユーザの生体情報から感情信頼度推定部１２によって推定するために必要となるものである。 [Emotion classification analysis unit 14]
The emotion classification analysis unit 14 analyzes the emotion classification at every predetermined timing from the user's biological information. The emotion classification analysis unit 14 is not an essential part of the present invention (context semantic analysis unit 11, emotion reliability estimation unit 12, utterance meaning complementation unit 13), but is estimated by the emotion reliability estimation unit 12 from the user's biological information. It is necessary to do.

感情分類分析部１４は、大きく以下の２種類の生体情報から、所定タイミング毎に、感情信頼度推定部１２で処理可能な感情分類を分析する。
＜カメラによって撮影された顔画像からの感情分類の分析＞
＜生体センサによって取得された生体計測値からの感情分類の分析＞ The emotion classification analysis unit 14 analyzes emotion classifications that can be processed by the emotion reliability estimation unit 12 at predetermined timings from the following two types of biological information.
<Analysis of emotion classification from facial images taken by camera>
<Analysis of emotion classification from biological measurement values acquired by biological sensors>

＜カメラによって撮影された顔画像からの感情分類の分析＞
図４は、カメラによって撮影された顔表情の感情から意味を分析する説明図である。 <Analysis of emotion classification from facial images taken by camera>
FIG. 4 is an explanatory diagram for analyzing the meaning from the emotion of the facial expression photographed by the camera.

図４によれば、入力音声変換部１０１は、ユーザから発話された「残業があるので、呑み会なんて結構です」をテキストに変換し、そのテキストは、文脈意味分析部１１に入力される。
一方で、その発話期間に、カメラによってユーザの顔表情が撮影され、その画像フレームは、生体情報入力処理部１０３へ入力される。生体情報入力処理部１０３は、所定タイミング毎（例えば１００ｍｓ毎）に画像フレームを抽出し、それら時系列の画像フレームを、感情分類分析部１４へ入力する。
但し、生体情報入力処理部１０３は、所定タイミング毎に画像フレームを抽出するが、その所定タイミングでなくても、画像フレームの特徴量が所定閾値以上の変化を生じた際には、その画像フレームを出力するものであってもよい。そのとき、ユーザの顔表情にも変化が生じたと判断することができる。 Referring to FIG. 4, the input speech conversion unit 101 converts a speech uttered by the user “There is overtime, so it is fine to have a grudge meeting” into text, and the text is input to the context semantic analysis unit 11.
On the other hand, the facial expression of the user is photographed by the camera during the utterance period, and the image frame is input to the biometric information input processing unit 103. The biological information input processing unit 103 extracts image frames at every predetermined timing (for example, every 100 ms), and inputs the time-series image frames to the emotion classification analysis unit 14.
However, the biometric information input processing unit 103 extracts an image frame at every predetermined timing. However, when the feature amount of the image frame changes more than a predetermined threshold, the image frame is extracted even at the predetermined timing. May be output. At that time, it can be determined that a change has occurred in the facial expression of the user.

感情分類分析部１４は、畳み込みニューラルネットワーク(Convolutional Neural Network：CNN／ConvNet、以下「ＣＮＮ」と称す)によって、ユーザの顔画像から、所定タイミング毎に感情分類を分析する。ＣＮＮのエンジンとしては、具体的にはcaffe deep learning frameworkを用いることができる（例えば非特許文献１参照）。感情分類分析部１４は、撮影されたユーザの顔表情の画像フレーム毎に、ＣＮＮによって推定した感情分類を出力する。 The emotion classification analysis unit 14 analyzes emotion classification from a user's face image at every predetermined timing by a convolutional neural network (CNN / ConvNet, hereinafter referred to as “CNN”). Specifically, as a CNN engine, a caffe deep learning framework can be used (see, for example, Non-Patent Document 1). The emotion classification analysis unit 14 outputs the emotion classification estimated by the CNN for each captured image frame of the facial expression of the user.

図５は、畳み込みニューラルネットワークを表す説明図である。 FIG. 5 is an explanatory diagram showing a convolutional neural network.

ＣＮＮは、順伝播型人工ニューラルネットワークの一種であって、脳の視覚野の構造における知見に基づくものである。基本的に、画像の局所的な特徴抽出を担う畳み込み層と、局所毎に特徴をまとめあげるプーリング層（サブサンプリング層）とを繰り返した構造となっている。
ＣＮＮの各層によれば、複数のニューロン(Neuron)を所持し、個々のニューロンが視覚野と対応するような形で配置されている。それぞれのニューロンの基本的な働きは、信号の入力と出力とからなる。但し、各層のニューロン間は、相互に信号を伝達する際に、入力された信号をそのまま出力するのではなく、それぞれの入力に対して結合荷重を設定し、その重み付きの入力の総和が、各ニューロン設定されている閾値を超えた時に次の層のニューロンに信号を出力する。学習データからこれらニューロン間の結合荷重を算出しておく。これによって、リアルタイムのデータを入力することによって、出力値の推定が可能となる。 CNN is a kind of forward-propagating artificial neural network, and is based on knowledge in the structure of the visual cortex of the brain. Basically, it has a structure in which a convolution layer that performs local feature extraction of an image and a pooling layer (subsampling layer) that collects the features for each region are repeated.
Each layer of the CNN has a plurality of neurons, and each neuron is arranged so as to correspond to the visual cortex. The basic function of each neuron consists of signal input and output. However, between the neurons in each layer, when transmitting signals to each other, the input signal is not output as it is, but the connection weight is set for each input, and the sum of the weighted inputs is When the threshold value set for each neuron is exceeded, a signal is output to the neuron in the next layer. The connection weight between these neurons is calculated from the learning data. As a result, the output value can be estimated by inputting real-time data.

感情分類分析部１４のＣＮＮは、予め多数の顔表情が写る画像に、感情分類（ラベル）を付与した「教師データ」によって学習パラメータを構築したものである。教師データは、具体的には、例えば３万枚程度の画像を用意している。教師データによって、感情分類別の画像を学習し、ＣＮＮ内の学習モデル（各ニューロンの結合荷重や、ネットワーク構成のパラメータ等）を作成する。 The CNN of the emotion classification analysis unit 14 is constructed by constructing learning parameters by “teacher data” in which emotion classification (label) is added to an image in which a large number of facial expressions are captured in advance. Specifically, for example, about 30,000 images are prepared as the teacher data. An image for each emotion classification is learned using the teacher data, and a learning model (connection weight of each neuron, network configuration parameters, etc.) in the CNN is created.

他の実施形態として、学習用の教師データと、顔画像に対する生体情報入力部１０３とは、画像フレームから人の顔領域を追跡し、その顔領域を分離（トリミング）したものであってもよい。利用環境の照度などの多様性に対して認識精度を高めるため、画像フレームに対して、解像度のResizeや明るさの算術平均による正規化を実行する。例えば、その正規化された顔画像が、一定間隔／顔特徴量の変化が閾値を超えた時にのみ、生体情報入力部１０３から画像フレームを出力し、その画像フレームに対して顔表情を分析することも好ましい。 As another embodiment, the learning teacher data and the biometric information input unit 103 for the face image may be obtained by tracking a human face area from an image frame and separating (trimming) the face area. . In order to improve the recognition accuracy for various illuminances in the usage environment, normalization is performed on the image frame by resize of resolution and arithmetic average of brightness. For example, an image frame is output from the biometric information input unit 103 only when the normalized face image has a predetermined interval / change in facial feature quantity exceeds a threshold value, and facial expression is analyzed for the image frame. It is also preferable.

また、他の実施形態として、感情分類分析部１４のＣＮＮは、性別や、年齢、国籍など、そのユーザの特徴量に合わせて複数の学習パラメータを構築しておくことも好ましい。ユーザのプロファイルに応じた顔表情から、感情分類を判断することができる。 As another embodiment, it is also preferable that the CNN of the emotion classification analysis unit 14 constructs a plurality of learning parameters according to the feature amount of the user such as gender, age, nationality, and the like. The emotion classification can be determined from the facial expression according to the user's profile.

更に、他の実施形態として、顔画像から、特定の顔表情のみ（例えば「笑顔」）のみに着目して自動的に検出する技術もある。この技術は、カメラの自動シャッター機能として周知なものである。例えば顔画像解析に基づく「スマイルスキャン（登録商標）」（オムロン社）がある。 Furthermore, as another embodiment, there is a technique of automatically detecting from a face image by paying attention only to a specific facial expression (for example, “smile”). This technique is well known as an automatic shutter function of a camera. For example, there is “Smile Scan (registered trademark)” (OMRON) based on facial image analysis.

＜生体センサによって取得された生体計測値からの感情分類の分析＞
図６は、生体センサによって取得された感情から意味を分析する説明図である。 <Analysis of emotion classification from biological measurement values acquired by biological sensors>
FIG. 6 is an explanatory diagram for analyzing meaning from emotions acquired by the biometric sensor.

図６によれば、図４と同様に、ユーザから「残業があるので、呑み会なんて結構です」と発話されている。
一方で、その発話期間に、生体センサによって取得された生体計測値が、生体情報入力処理部１０３へ入力される。生体情報入力処理部１０３は、所定タイミング毎（例えば１００ｍｓ毎）に生体情報を、感情分類分析部１４へ入力する。 According to FIG. 6, as in FIG. 4, the user says “Since there is overtime, a grudge is fine”.
On the other hand, during the utterance period, a biological measurement value acquired by the biological sensor is input to the biological information input processing unit 103. The biometric information input processing unit 103 inputs biometric information to the emotion classification analysis unit 14 at every predetermined timing (for example, every 100 ms).

生体センサによって取得される生体計測値としては、脈拍、体温、血圧若しくは汗度、
声における韻律特徴量（声の高低、強弱、リズム・テンポ）がある。このような生体計測値の変動から、ユーザの感情分類を推定することができる。
既存技術として、生体情報（脈拍、脳波、瞳孔、視線）から、ユーザの感情を判断する技術がある（例えば特許文献２参照）。例えば簡単に、脈拍が所定以上高ければ、否定感情が高いと判定することもできる。
また、皮膚電位センサを用いて、頭皮耳介及び耳介周囲の頭皮電位の変化から、目の動きと顔の表情を抽出する顔面情報検出装置の技術もある（例えば特許文献３参照）。この技術によれば、喜怒哀楽の表情及び目の動きを検出することができる。
更に、例えば顔表情「笑い」を測定するシステムとして、喉（音）で測定する「爆笑計」（大阪電気通信大学）や「アッハ・メーター（登録商標）」（プロジェクトaH）、喉（音）と表情筋電と横隔膜筋電との３つを同時に計測する「横隔膜式笑い測定システム」（関西大学、プロジェクトaH）もある。 Biometric values acquired by the biosensor include pulse, body temperature, blood pressure or sweat rate,
There are prosodic features (voice pitch, strength, rhythm and tempo) in the voice. A user's emotion classification | category can be estimated from the fluctuation | variation of such a biological measurement value.
As an existing technique, there is a technique for determining a user's emotion from biological information (pulse, brain wave, pupil, line of sight) (see, for example, Patent Document 2). For example, if the pulse is higher than a predetermined value, it can be determined that negative emotion is high.
There is also a technique of a face information detection device that uses a skin potential sensor to extract eye movements and facial expressions from changes in scalp pinna and scalp potential around the pinna (see, for example, Patent Document 3). According to this technique, emotional expressions and eye movements can be detected.
Furthermore, for example, as a system for measuring facial expression “laughter”, a “laughing meter” (Osaka Electro-Communication University), “Ach Meter (registered trademark)” (project aH), throat (sound) There is also a “diaphragm-based laughter measurement system” (Kansai University, Project aH) that simultaneously measures facial expression, electromyography and diaphragm myoelectricity.

前述した図４及び図６によれば、感情信頼度推定部１２は、所定タイミング毎に、感情分類分析部１４から感情分類を入力する。感情信頼度推定部１２は、最も簡単には、ユーザの発話期間における感情分類を集計し、その比率を統計値としたものであってもよい。また、感情分類別の「平均値(mean)」であってもよい。更に、他の実施形態としては、「中央値(median)」、「最頻値(mode)」、「最大値(maximum)」又は「機械学習による識別結果」であってもよい。これによって、感情信頼度推定部１２は、感情分類別に、統計値としての感情信頼度をベクトルとして、発話意味補完部１３へ出力することができる。 According to FIG. 4 and FIG. 6 described above, the emotion reliability estimation unit 12 inputs an emotion classification from the emotion classification analysis unit 14 at every predetermined timing. The emotion reliability estimation unit 12 may most simply add up emotion classifications during the user's utterance period and use the ratio as a statistical value. Further, it may be an “mean value” for each emotion classification. Further, as another embodiment, “median”, “mode”, “maximum”, or “discrimination result by machine learning” may be used. Thereby, the emotion reliability estimation unit 12 can output the emotion reliability as a statistical value as a vector to the utterance meaning complementation unit 13 for each emotion classification.

［対話プログラム１５］
対話プログラム１５は、ユーザへ応答すべき文を対応付けた複数の対話シナリオを蓄積し、ユーザの回答に応じて次の対話シナリオを選択する。「対話シナリオ」とは、質問及び回答からなる対話ノードをツリー状に構成したものである。
本発明によれば、発話意味補完部１４から出力された、信頼度が最も高い感情分類に応じて、ユーザに対して対話的に次のシナリオを進行することができる。 [Dialogue program 15]
The dialogue program 15 accumulates a plurality of dialogue scenarios in which sentences to be answered to the user are associated, and selects the next dialogue scenario according to the user's answer. The “dialog scenario” is a dialog node composed of questions and answers configured in a tree shape.
According to the present invention, the next scenario can be advanced interactively to the user according to the emotion classification with the highest reliability output from the utterance meaning complementation unit 14.

以上、詳細に説明したように、本発明の発話意味分析プログラム、装置及び方法によれば、ユーザの発話から文脈意味を理解する際に、ユーザの感情分類によって文脈意味の理解精度を高めることができる発話意味分析プログラム、装置及び方法を提供することを目的とする。特に、ユーザに対する対話システムについて、肯定意味／否定意味の理解の誤りによって、対話シナリオの進行が崩れないようにすることができる。 As described above in detail, according to the utterance meaning analysis program, apparatus, and method of the present invention, when understanding the context meaning from the user's utterance, the understanding accuracy of the context meaning can be improved by the user's emotion classification. An object of the present invention is to provide an utterance semantic analysis program, apparatus and method. In particular, with respect to the dialog system for the user, it is possible to prevent the progress of the dialog scenario from being disturbed by an error in understanding the positive meaning / negative meaning.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１端末、装置
１０１入力音声変換部
１０２出力信号変換部
１０３生体情報入力処理部
１１文脈意味分析部
１２感情信頼度推定部
１３発話意味補完部
１４感情分類分析部
１５対話プログラム DESCRIPTION OF SYMBOLS 1 Terminal, apparatus 101 Input voice conversion part 102 Output signal conversion part 103 Biometric information input process part 11 Context meaning analysis part 12 Emotion reliability estimation part 13 Utterance meaning complementation part 14 Emotion classification analysis part 15 Dialog program

Claims

In an utterance semantic analysis program that allows a computer to function to understand contextual meaning from user utterances,
Context semantic analysis means for outputting the context reliability of each of the positive meaning or negative meaning for the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion reliability estimation means for estimating the emotion reliability of each positive emotion or negative emotion ,
When the highest emotional reliability is negative emotion, the reliability obtained by multiplying the context reliability of the positive meaning by “1-highest emotional reliability of negative emotion” and the context reliability of the negative meaning are “ Compared with the confidence level multiplied by the highest emotional confidence level of negative emotions,
Or, when the highest emotional confidence is a positive emotion, the contextual reliability of the positive meaning is multiplied by "the highest emotional reliability of the positive emotion" and the contextual reliability of the negative meaning is " Compared with the reliability multiplied by “1-highest emotional confidence of positive emotion”
Utterance meaning complementing means to select the positive meaning or negative meaning with higher reliability
Utterance semantic analysis program characterized by making a computer function.

In an utterance semantic analysis program that allows a computer to function to understand contextual meaning from user utterances,
Context semantic analysis means for outputting the context reliability of each of the positive meaning or negative meaning for the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion reliability estimation means for estimating the emotion reliability of each positive emotion or negative emotion ,
The confidence level obtained by multiplying the context confidence level of the positive meaning by the sum of the emotional confidence level of the positive emotion and the emotional confidence level of 1-negative emotion, and the context confidence level of the negative meaning, An utterance meaning complementing means for comparing the reliability obtained by multiplying the sum of "emotion reliability" and "1-emotion emotion reliability" and selecting the positive or negative meaning of the higher reliability
Utterance semantic analysis program characterized by making a computer function.

The emotion reliability estimation means is based on a Paul Ekman emotion classification model,
The positive emotions are further divided into “happiness” and “surprise”
The utterance meaning analysis program according to claim 1 or 2 , wherein the computer is caused to function so as to be classified into "anger", "disgust", "fear", and "sadness" as the negative emotion.

From the biometric information of the user, the computer further functions as an emotion classification analysis means for analyzing the emotion classification at every predetermined timing,
The emotion reliability estimation means inputs a plurality of time-series emotion classifications during the user's utterance from the emotion classification analysis means, and estimates statistical values of positive emotions or negative emotions as emotion reliability. The utterance meaning analysis program according to any one of claims 1 to 3 , wherein the computer is caused to function.

For the emotional reliability estimation means, the statistical value is “mean”, “median”, “mode”, “maximum” or “machine learning” The utterance meaning analysis program according to claim 4 , wherein the computer is caused to function so as to be “a result of identification by”.

The device is a camera capable of photographing a user's face,
The sentiment classification analysis means, from a face image as biometric information of the user acquired by the camera, to claim 4 or 5, characterized in that causes a computer to function to analyze the emotional classification every predetermined timing The utterance semantic analysis program described.

The emotion classification analysis unit causes a computer to function to analyze an emotion classification regardless of the predetermined timing when a change greater than or equal to a predetermined value occurs in the feature amount of the user's face image. 6. The utterance semantic analysis program according to 6 .

The emotion classification analysis means is for analyzing emotion classification at predetermined timings from the user's face image by a convolutional neural network (Convolutional Neural Network: CNN / ConvNet).
8. The speech semantic analysis according to claim 6 or 7 , wherein the convolutional neural network causes a computer to function so that learning parameters are constructed by teacher data in which emotion classification is given to a large number of face images in advance. program.

The device is a sensor that can measure a user's pulse, body temperature, blood pressure, or sweat rate, or a microphone that can measure prosody features (voice pitch, strength, rhythm and tempo) from the user's voice,
The utterance meaning according to claim 6 or 7 , wherein the emotion classification analysis unit causes a computer to function to analyze emotion classification at predetermined timings from the user's biological information acquired by the sensor. Analysis program.

The utterance meaning analysis program according to any one of claims 1 to 9 ,
An interactive program that causes a computer to function as an interactive program that interactively advances the next scenario to the user in accordance with an affirmative or negative emotion output from the utterance meaning complementing means.

In a device that understands context meaning from user utterances,
Context semantic analysis means for outputting the context reliability of each of the positive meaning or negative meaning for the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion reliability estimation means for estimating the emotion reliability of each positive emotion or negative emotion ,
When the highest emotional reliability is negative emotion, the reliability obtained by multiplying the context reliability of the positive meaning by “1-highest emotional reliability of negative emotion” and the context reliability of the negative meaning are “ Compared with the confidence level multiplied by the highest emotional confidence level of negative emotions,
Or, when the highest emotional confidence is a positive emotion, the contextual reliability of the positive meaning is multiplied by "the highest emotional reliability of the positive emotion" and the contextual reliability of the negative meaning is " Compared with the reliability multiplied by “1-highest emotional confidence of positive emotion”
And an utterance meaning complementing means for selecting an affirmative meaning or a negative meaning having a higher reliability .

In a device that understands context meaning from user utterances,
Context semantic analysis means for outputting the context reliability of each of the positive meaning or negative meaning for the text being spoken by the user;
During the user's utterance, from the user's biometric information acquired by the device, emotion reliability estimation means for estimating the emotion reliability of each positive emotion or negative emotion ,
The confidence level obtained by multiplying the context confidence level of the positive meaning by the sum of the emotional confidence level of the positive emotion and the emotional confidence level of 1-negative emotion, and the context confidence level of the negative meaning, It has the utterance meaning complementation means which compares the reliability which multiplied the sum total of "emotion reliability" and "1-emotional reliability of a positive emotion", and selects the positive meaning or negative meaning of a higher reliability. Features device.

In the utterance meaning analysis method of the device that understands the context meaning from the user's utterance,
The device is
A first step of outputting a context confidence for each of the positive or negative meanings of the text being spoken by the user;
A second step of estimating the emotional reliability of each positive emotion or negative emotion from the user's biometric information acquired by the device during the user's utterance;
When the highest emotional reliability is negative emotion, the reliability obtained by multiplying the context reliability of the positive meaning by “1-highest emotional reliability of negative emotion” and the context reliability of the negative meaning are “ Compared with the confidence level multiplied by the highest emotional confidence level of negative emotions,
Or, when the highest emotional confidence is a positive emotion, the contextual reliability of the positive meaning is multiplied by "the highest emotional reliability of the positive emotion" and the contextual reliability of the negative meaning is " Compared with the reliability multiplied by “1-highest emotional confidence of positive emotion”
And a third step of selecting and selecting a positive meaning or negative meaning having a higher degree of reliability .

In the utterance meaning analysis method of the device that understands the context meaning from the user's utterance,
The device is
A first step of outputting a context confidence for each of the positive or negative meanings of the text being spoken by the user;
A second step of estimating the emotional reliability of each positive emotion or negative emotion from the user's biometric information acquired by the device during the user's utterance;
The confidence level obtained by multiplying the context confidence level of the positive meaning by the sum of the emotional confidence level of the positive emotion and the emotional confidence level of 1-negative emotion, and the context confidence level of the negative meaning, A third step of comparing the reliability obtained by multiplying the sum of "emotion reliability" and "1-emotion emotion reliability" and selecting a positive meaning or negative meaning having a higher reliability. Utterance semantic analysis method for a device characterized by