JP2017156688A

JP2017156688A - Conversation evaluation device and program

Info

Publication number: JP2017156688A
Application number: JP2016042271A
Authority: JP
Inventors: 英樹阪梨; Hideki Sakanashi; 嘉山　啓; Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2016-03-04
Filing date: 2016-03-04
Publication date: 2017-09-07
Anticipated expiration: 2036-03-04
Also published as: JP6746963B2

Abstract

PROBLEM TO BE SOLVED: To provide a conversation evaluation device which objectively evaluates conversations by sounds.SOLUTION: A conversation evaluation device 100 includes: a sound collector 22 for generating a sound signal X1 representing a sound V1 produced by a user U1; a sound collector 24 for generating a sound signal X2 representing a sound V2 produced by a user U2; a feature acquisition unit 32 for acquiring the feature amount of the sounds forming a conversation; an information generation unit 34 for generating information R related to the conversation other than the feature amount; and a conversation evaluation unit 36 for evaluating the conversation based on the feature amount and the related information R.SELECTED DRAWING: Figure 1

Description

本発明は、音声による会話を評価する技術に関する。 The present invention relates to a technique for evaluating a voice conversation.

発話音声の解析により種々の事柄を評価する技術が従来から提案されている。例えば特許文献１には、発話音声の音程シーケンスにおける基音の間隔から話者の心理的または生理的な状態を推定する技術が開示されている。 Conventionally, techniques for evaluating various matters by analyzing uttered speech have been proposed. For example, Patent Document 1 discloses a technique for estimating a speaker's psychological or physiological state from the interval of fundamental sounds in a pitch sequence of speech sound.

特許第４４９５９０７号公報Japanese Patent No. 4495907

しかし、特許文献１の技術では、特定の話者の状態が推定されるに過ぎず、例えば複数の話者間の音声による会話（例えば発話に対する応答の音声の印象）を客観的に評価することはできない。以上の事情を考慮して、本発明は、音声による会話を客観的に評価することを目的とする。 However, in the technique of Patent Document 1, only the state of a specific speaker is estimated, and for example, a speech conversation between a plurality of speakers (for example, an impression of a sound of a response to an utterance) is objectively evaluated. I can't. In view of the above circumstances, an object of the present invention is to objectively evaluate a voice conversation.

以上の課題を解決するために、本発明の第１態様に係る会話評価装置は、会話を構成する音声の特徴量を取得する特徴取得部と、会話について特徴量とは別種の関連情報を生成する情報生成部と、特徴量と関連情報とに応じて会話を評価する会話評価部とを具備する。以上の態様では、会話を構成する音声の特徴量に応じて当該会話を客観的に評価することが可能である。また、特徴量とは別種の関連情報が特徴量とともに会話の評価に加味されるから、特徴量のみを評価に反映させる構成と比較して会話を適切に評価することが可能である。 In order to solve the above problems, a conversation evaluation apparatus according to the first aspect of the present invention generates a feature acquisition unit that acquires a feature amount of speech constituting a conversation, and generates related information different from the feature amount for the conversation. And a conversation evaluation unit that evaluates conversation according to the feature amount and the related information. In the above aspect, it is possible to objectively evaluate the conversation according to the feature amount of the voice constituting the conversation. In addition, since related information different from the feature quantity is added to the conversation evaluation together with the feature quantity, it is possible to appropriately evaluate the conversation in comparison with a configuration in which only the feature quantity is reflected in the evaluation.

本発明の好適な態様において、特徴取得部は、関連情報に応じた条件で特徴量を取得する。以上の態様では、関連情報に応じた条件で特徴量が取得されるから、特徴量の取得に関連情報を利用しない構成と比較して特徴量を適切に取得できるという利点がある。 In a preferred aspect of the present invention, the feature acquisition unit acquires a feature amount under conditions according to the related information. In the above aspect, since the feature amount is acquired under conditions according to the related information, there is an advantage that the feature amount can be appropriately acquired as compared with a configuration in which the related information is not used for acquiring the feature amount.

本発明の第２態様に係る会話評価装置は、会話を構成する音声の特徴量を取得する特徴取得部と、会話について特徴量とは別種の関連情報を生成する情報生成部と、特徴量に応じて会話を評価する会話評価部とを具備し、特徴取得部は、関連情報に応じた条件で特徴量を取得する。以上の態様では、会話を構成する音声の特徴量に応じて当該会話を客観的に評価することが可能である。また、関連情報に応じた条件で特徴量が取得されるから、特徴量の取得に関連情報を利用しない構成と比較して特徴量を適切に取得できるという利点がある。 A conversation evaluation device according to a second aspect of the present invention includes a feature acquisition unit that acquires a feature amount of speech that constitutes a conversation, an information generation unit that generates related information different from the feature amount for conversation, and a feature amount And a conversation evaluation unit that evaluates the conversation in response, and the feature acquisition unit acquires the feature amount under conditions according to the related information. In the above aspect, it is possible to objectively evaluate the conversation according to the feature amount of the voice constituting the conversation. In addition, since the feature amount is acquired under conditions according to the related information, there is an advantage that the feature amount can be appropriately acquired as compared with a configuration in which the related information is not used for acquiring the feature amount.

前述の各態様に係る会話評価装置の好適例において、特徴取得部は、会話を構成する第１音声および第２音声の各々の音高を特徴量として取得し、会話評価部は、第１音声と第２音声との音高差に応じて会話を評価する。以上の態様では、第１音声と第２音声との音高差に応じて会話が評価されるから、発話音声の音高に対する応答音声の音高の関係という観点から応答音声の印象の良否を客観的に評価することが可能である。 In a preferred example of the conversation evaluation device according to each aspect described above, the feature acquisition unit acquires the pitch of each of the first voice and the second voice constituting the conversation as a feature amount, and the conversation evaluation unit receives the first voice. The conversation is evaluated according to the pitch difference between the voice and the second voice. In the above aspect, since the conversation is evaluated according to the pitch difference between the first voice and the second voice, the quality of the response voice is determined from the viewpoint of the relationship between the pitch of the response voice and the pitch of the utterance voice. It is possible to evaluate objectively.

前述の各態様において、関連情報は、例えば、会話の時間的な状況、会話の話者間における過去の会話の履歴、会話の話者間の関係、および、会話の各話者の属性の少なくともひとつを示す情報である。 In each of the above-described aspects, the related information includes, for example, at least a conversational time situation, a history of past conversations between conversational speakers, a relationship between conversational speakers, and an attribute of each conversational speaker. This is one piece of information.

第１実施形態の会話評価装置の構成図である。It is a block diagram of the conversation evaluation apparatus of 1st Embodiment. 会話評価処理のフローチャートである。It is a flowchart of a conversation evaluation process. 第２実施形態の会話評価装置の構成図である。It is a block diagram of the conversation evaluation apparatus of 2nd Embodiment. 第３実施形態の会話評価装置の構成図である。It is a block diagram of the conversation evaluation apparatus of 3rd Embodiment. 第４実施形態の会話評価装置の構成図である。It is a block diagram of the conversation evaluation apparatus of 4th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る会話評価装置１００の構成図である。第１実施形態の会話評価装置１００は、利用者Ｕ1と利用者Ｕ2との間の会話を評価する解析装置であり、例えば好印象な会話の訓練に好適に使用される。利用者Ｕ1が発音する音声Ｖ1（第１音声の例示）と利用者Ｕ2が発音する音声Ｖ2（第２音声の例示）とで会話が構成される。 <First Embodiment>
FIG. 1 is a configuration diagram of a conversation evaluation apparatus 100 according to the first embodiment of the present invention. The conversation evaluation apparatus 100 according to the first embodiment is an analysis apparatus that evaluates a conversation between the user U1 and the user U2, and is suitably used for training a good impression, for example. A conversation is composed of the voice V1 (example of the first voice) that the user U1 pronounces and the voice V2 (example of the second voice) that the user U2 pronounces.

第１実施形態では、例えば問掛けおよび話掛けを含む発話の音声Ｖ1を利用者Ｕ1が発音し、利用者Ｕ1からの問掛けに対する回答や話掛けに対する受応えを含む応答の音声Ｖ2を利用者Ｕ2が発音する場合を想定する。利用者Ｕ2が発音する音声Ｖ2は、例えば間投詞を意味する音声である。例えば、「うん」「ええ」等の相鎚や、「え〜と」「あの〜」等の言淀み（応答の停滞）、「はい」「いいえ」等の回答（質問に対する肯定／否定）、話者の感動を表す「ああ」「おお」等の語句、あるいは、発話に対する問返し（聞き直し）を意味する「え？」「なに？」等の語句が、間投詞として例示され得る。 In the first embodiment, for example, the voice U1 of an utterance including a question and a talk is pronounced by the user U1, and a voice V2 of a response including an answer to the question from the user U1 and a response to the talk is received by the user. Assume that U2 is pronounced. The voice V2 generated by the user U2 is a voice meaning an interjection, for example. For example, “Yes”, “Yes”, etc., “E-to”, “Ano ~” etc. (stagnation of response), “Yes”, “No” answer (affirmative / negative) Phrases such as “Oh” and “O” representing the impression of the speaker, or phrases such as “E?” And “What?” That mean a question answer (re-listening) to the utterance may be exemplified as interjections.

図１に例示される通り、第１実施形態の会話評価装置１００は、制御装置１２と記憶装置１４と表示装置１６と入力装置１８と収音装置２２と収音装置２４とを具備するコンピュータシステムで実現される。例えば携帯電話機やスマートフォン等の可搬型の情報処理装置またはパーソナルコンピュータ等の情報処理装置で会話評価装置１００は実現され得る。なお、相互に別体で構成された複数の装置により会話評価装置１００を実現することも可能である。 As illustrated in FIG. 1, the conversation evaluation apparatus 100 according to the first embodiment includes a control device 12, a storage device 14, a display device 16, an input device 18, a sound collection device 22, and a sound collection device 24. It is realized with. For example, the conversation evaluation apparatus 100 can be realized by a portable information processing apparatus such as a mobile phone or a smartphone or an information processing apparatus such as a personal computer. Note that the conversation evaluation apparatus 100 can be realized by a plurality of apparatuses configured separately from each other.

収音装置２２および収音装置２４は、周囲の音響を収音する音声入力機器である。収音装置２２は、利用者Ｕ1が発音した音声Ｖ1を表す音声信号Ｘ1を生成し、収音装置２４は、利用者Ｕ2が発音した音声Ｖ2を表す音声信号Ｘ2を生成する。なお、音声信号Ｘ1および音声信号Ｘ2の各々をアナログからデジタルに変換するＡ/Ｄ変換器の図示は便宜的に省略した。 The sound collecting device 22 and the sound collecting device 24 are sound input devices that pick up surrounding sounds. The sound collecting device 22 generates a sound signal X1 representing the sound V1 sounded by the user U1, and the sound collecting device 24 generates a sound signal X2 representing the sound V2 sounded by the user U2. The A / D converter for converting each of the audio signal X1 and the audio signal X2 from analog to digital is not shown for convenience.

制御装置１２は、例えばＣＰＵ（Central Processing Unit）等の処理回路を含んで構成され、会話評価装置１００の各要素を統括的に制御する。具体的には、制御装置１２は、収音装置２２が生成する音声信号Ｘ1と収音装置２４が生成する音声信号Ｘ2とを解析することで、利用者Ｕ1と利用者Ｕ2との会話を評価する。第１実施形態の制御装置１２は、利用者Ｕ1の発話に対する利用者Ｕ2の応答について印象の良否の指標（以下「評価値」という）Ｓを算定する。 The control device 12 includes a processing circuit such as a CPU (Central Processing Unit), for example, and comprehensively controls each element of the conversation evaluation device 100. Specifically, the control device 12 evaluates the conversation between the user U1 and the user U2 by analyzing the audio signal X1 generated by the sound collection device 22 and the audio signal X2 generated by the sound collection device 24. To do. The control device 12 according to the first embodiment calculates an index (hereinafter referred to as “evaluation value”) S of impression quality for the response of the user U2 to the utterance of the user U1.

表示装置１６（例えば液晶表示パネル）は、制御装置１２による制御のもとで各種の画像を表示する。例えば、利用者Ｕ1と利用者Ｕ2との会話の評価結果（評価値Ｓ）が表示装置１６に表示される。入力装置１８は、会話評価装置１００に対する利用者Ｕ（例えば利用者Ｕ1や利用者Ｕ2）からの指示を受付ける。例えば利用者Ｕ（Ｕ1，Ｕ2）が操作する複数の操作子や、表示装置１６の表示面に対する接触を検知するタッチパネルが入力装置１８として好適に利用される。 The display device 16 (for example, a liquid crystal display panel) displays various images under the control of the control device 12. For example, the evaluation result (evaluation value S) of the conversation between the user U1 and the user U2 is displayed on the display device 16. The input device 18 receives an instruction from the user U (for example, the user U1 or the user U2) to the conversation evaluation device 100. For example, a plurality of operators operated by the user U (U 1, U 2) and a touch panel for detecting contact with the display surface of the display device 16 are preferably used as the input device 18.

記憶装置１４は、制御装置１２が実行するプログラムや制御装置１２が使用する各種のデータを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体、あるいは、複数の記録媒体の組合せが記憶装置１４として任意に採用され得る。第１実施形態の制御装置１２は、記憶装置１４に記憶されたプログラムを実行することで、利用者Ｕ1と利用者Ｕ2との会話を評価するための複数の機能（特徴取得部３２，情報生成部３４，会話評価部３６）を実現する。なお、制御装置１２の機能を複数の装置に分散した構成や、制御装置１２の機能の一部または全部を専用の電子回路が実現する構成も採用され得る。 The storage device 14 stores a program executed by the control device 12 and various data used by the control device 12. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording media can be arbitrarily employed as the storage device 14. The control device 12 according to the first embodiment executes a program stored in the storage device 14 to thereby execute a plurality of functions (feature acquisition unit 32, information generation for evaluating conversation between the user U1 and the user U2). Unit 34 and conversation evaluation unit 36). A configuration in which the functions of the control device 12 are distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit realizes part or all of the functions of the control device 12 may be employed.

特徴取得部３２は、利用者Ｕ1の音声Ｖ1の特徴量と利用者Ｕ2の音声Ｖ2の特徴量とを取得する。第１実施形態の特徴取得部３２は、音声信号Ｘ1の解析により利用者Ｕ1の音声Ｖ1の特徴量を抽出し、音声信号Ｘ2の解析により利用者Ｕ2の音声Ｖ2の特徴量を抽出する。具体的には、音声Ｖ1および音声Ｖ2の各々について韻律に関する特徴量が抽出される。韻律は、受聴者が知覚し得る言語学的および音声学的な特性であり、言語の一般的な表記のみからでは把握できない性質を意味する。 The feature acquisition unit 32 acquires the feature amount of the voice V1 of the user U1 and the feature amount of the voice V2 of the user U2. The feature acquisition unit 32 of the first embodiment extracts the feature amount of the voice V1 of the user U1 by analyzing the voice signal X1, and extracts the feature amount of the voice V2 of the user U2 by analyzing the voice signal X2. Specifically, feature values related to prosody are extracted for each of the voice V1 and the voice V2. Prosody is a linguistic and phonetic characteristic that can be perceived by the listener, and means a property that cannot be grasped only by the general notation of the language.

第１実施形態の特徴取得部３２は、利用者Ｕ1の音声Ｖ1の音高Ｐ1と利用者Ｕ2の音声Ｖ2の音高Ｐ2とを特徴量として抽出する。例えば、特徴取得部３２は、音声信号Ｘ1の発話区間内の平均的な音高Ｐ1と音声信号Ｘ2の発話区間内の平均的な音高Ｐ2とを抽出する。発話区間は、発話が継続する区間（一連の発話の始点から終点までの区間）である。音高Ｐ1および音高Ｐ2の抽出には公知の音声解析技術が任意に採用され得る。 The feature acquisition unit 32 of the first embodiment extracts the pitch P1 of the voice V1 of the user U1 and the pitch P2 of the voice V2 of the user U2 as feature amounts. For example, the feature acquisition unit 32 extracts an average pitch P1 in the speech section of the speech signal X1 and an average pitch P2 in the speech section of the speech signal X2. The utterance section is a section where the utterance continues (a section from the start point to the end point of a series of utterances). For the extraction of the pitch P1 and the pitch P2, a known voice analysis technique can be arbitrarily adopted.

利用者Ｕ1が発話した音声Ｖ1の音高Ｐ1に対して特定の関係にある音高Ｐ2の音声Ｖ2で利用者Ｕ2が応答した場合に、利用者Ｕ1は、利用者Ｕ2の音声Ｖ2が心地良く安心感のある好印象な応答であると知覚する、という傾向がある。具体的には、利用者Ｕ1の音高Ｐ1に対して協和関係にある音高Ｐ2を利用者Ｕ2が発音した場合に、利用者Ｕ2の応答は良好な印象と感取される。また、利用者Ｕ2による応答の印象に特に大きく影響するのは、利用者Ｕ1による音声Ｖ1の発話区間のうち音声Ｖ2の発話区間の始点に近い末尾側の区間である。そこで、第１実施形態の特徴取得部３２は、利用者Ｕ1の音声Ｖ1の発話区間のうち当該発話区間の末尾に位置する所定長（例えば180msec）の区間の音高Ｐ1を特定する。 When the user U2 responds with the voice V2 of the pitch P2 having a specific relationship with the pitch P1 of the voice V1 uttered by the user U1, the user U1 is comfortable with the voice V2 of the user U2. There is a tendency to perceive it as a good impression with a sense of security. Specifically, when the user U2 pronounces the pitch P2 having a cooperative relationship with the pitch P1 of the user U1, the response of the user U2 is felt as a good impression. In addition, what particularly affects the impression of the response by the user U2 is the end section close to the start point of the speech section of the voice V2 among the speech sections of the voice V1 by the user U1. Therefore, the feature acquisition unit 32 of the first embodiment identifies the pitch P1 of a predetermined length (for example, 180 msec) section located at the end of the utterance section of the utterance section of the voice V1 of the user U1.

図１の情報生成部３４は、利用者Ｕ1と利用者Ｕ2との会話について、特徴取得部３２が抽出する特徴量とは別種の情報（以下「関連情報」という）Ｒを生成する。関連情報Ｒは、利用者Ｕ1と利用者Ｕ2との会話に関連する情報である。第１実施形態では、会話の時間的な状況を示す関連情報Ｒを例示する。具体的には、会話日時（例えば日付や時刻）と会話継続長（例えば会話開始からの経過時間）とを会話状況として示す関連情報Ｒを情報生成部３４は生成する。例えば、情報生成部３４は、計時回路（図示略）が計時する時刻を参照して会話日時および会話継続長を特定する。すなわち、例えば音声Ｖ1または音声Ｖ2における最新の発話区間の始点の日時が会話日時として特定され、音声Ｖ1または音声Ｖ2における最先の発話区間の開始時刻から現在時刻までの経過時間が会話継続長として特定される。 The information generation unit 34 in FIG. 1 generates information (hereinafter referred to as “related information”) R different from the feature amount extracted by the feature acquisition unit 32 for the conversation between the user U1 and the user U2. The related information R is information related to the conversation between the user U1 and the user U2. In 1st Embodiment, the relevant information R which shows the time condition of conversation is illustrated. Specifically, the information generation unit 34 generates related information R indicating the conversation date and time (for example, date and time) and the conversation duration (for example, elapsed time from the start of conversation) as the conversation status. For example, the information generation unit 34 specifies the conversation date and time and the conversation continuation length with reference to the time counted by a timing circuit (not shown). That is, for example, the start date and time of the latest utterance section in the voice V1 or voice V2 is specified as the conversation date and time, and the elapsed time from the start time of the earliest utterance section in the voice V1 or voice V2 to the current time is set as the conversation continuation length. Identified.

会話評価部３６は、特徴取得部３２が抽出した特徴量（音高Ｐ1，音高Ｐ2）と情報生成部３４が生成した関連情報Ｒとに応じて利用者Ｕ1と利用者Ｕ2との会話を評価する。すなわち、会話評価部３６は、音高Ｐ1および音高Ｐ2と関連情報Ｒとに応じた評価値Ｓを算定する。以上の説明から理解される通り、第１実施形態では、会話を構成する音声（Ｖ1，Ｖ2）自体の特徴量だけでなく特徴量以外の関連情報Ｒも会話の評価に加味される。会話評価部３６が算定した評価値Ｓが表示装置１６に表示される。 The conversation evaluation unit 36 performs a conversation between the user U1 and the user U2 according to the feature amount (pitch P1, pitch P2) extracted by the feature acquisition unit 32 and the related information R generated by the information generation unit 34. evaluate. That is, the conversation evaluation unit 36 calculates the evaluation value S according to the pitch P1, the pitch P2, and the related information R. As understood from the above description, in the first embodiment, not only the feature amount of the voice (V1, V2) constituting the conversation but also related information R other than the feature amount is considered in the conversation evaluation. The evaluation value S calculated by the conversation evaluation unit 36 is displayed on the display device 16.

前述の通り、利用者Ｕ1の音声Ｖ1の音高Ｐ1に対して協和関係にある音高Ｐ2の音声Ｖ2で利用者Ｕ2が応答した場合に利用者Ｕ1は良好な印象を感取するという傾向がある。以上の傾向を考慮して、第１実施形態の会話評価部３６は、音高Ｐ1と音高Ｐ2との音高差ΔＰ（ΔＰ＝|Ｐ1−Ｐ2|）に応じて評価値Ｓを算定する。具体的には、音高差ΔＰが協和関係に近いほど評価値Ｓが大きい数値となるように会話評価部３６は評価値Ｓを算定する。第１実施形態で例示する協和関係は、例えば周波数比が整数比に近い音程の関係（例えば完全一度，完全八度，完全五度，完全四度）である。 As described above, when the user U2 responds with the voice V2 of the pitch P2 having a cooperative relationship with the pitch P1 of the voice V1 of the user U1, the user U1 tends to feel a good impression. is there. Considering the above tendency, the conversation evaluation unit 36 of the first embodiment calculates the evaluation value S according to the pitch difference ΔP (ΔP = | P1−P2 |) between the pitch P1 and the pitch P2. . Specifically, the conversation evaluation unit 36 calculates the evaluation value S so that the evaluation value S becomes larger as the pitch difference ΔP is closer to the cooperative relationship. The cooperative relationship exemplified in the first embodiment is, for example, a pitch relationship in which the frequency ratio is close to an integer ratio (for example, complete once, complete eighth, complete fifth, and complete fourth).

以上の例示の通り、第１実施形態の会話評価部３６は、音高Ｐ1と音高Ｐ2とを評価値Ｓに反映させるほか、利用者Ｕ1と利用者Ｕ2との会話に関する関連情報Ｒも加味して評価値Ｓを算定する。関連情報Ｒ（会話日時，会話継続長）と評価値Ｓとの具体的な関係を以下に例示する。 As described above, the conversation evaluation unit 36 of the first embodiment reflects the pitch P1 and the pitch P2 in the evaluation value S, and also includes related information R related to the conversation between the user U1 and the user U2. Then, the evaluation value S is calculated. A specific relationship between the related information R (conversation date and time, conversation continuation length) and the evaluation value S is illustrated below.

夜間や休日の会話は、親密な友人同士の会話である可能性が高いから、例えば平日の日中の会話（典型的には業務上の会話等）と比較して、利用者Ｕ1が利用者Ｕ2に好印象を感取する可能性が高いという傾向がある。以上の傾向を考慮して、関連情報Ｒで指定される会話日時が夜間や休日に該当する場合には、会話日時が平日の日中に該当する場合と比較して評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。例えば、会話評価部３６は、会話日時が夜間や休日に該当する場合に所定値を評価値Ｓに加点する。 Since conversations at night and on holidays are likely to be conversations between close friends, the user U1 is the user compared to, for example, weekday daytime conversations (typically business conversations, etc.). There is a tendency that U2 has a high possibility of taking a good impression. Considering the above tendency, when the conversation date and time specified by the related information R corresponds to nighttime or holiday, the evaluation value S is larger than the case where the conversation date and time corresponds to daytime on weekdays. Thus, the conversation evaluation unit 36 calculates an evaluation value S. For example, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the conversation date / time corresponds to nighttime or a holiday.

また、長時間にわたり会話が継続している場合には、相互に良好な印象を感取しながら利用者Ｕ1と利用者Ｕ2との会話が盛上がっている可能性が高い。以上の傾向を考慮して、関連情報Ｒで指定される会話継続長が長いほど評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。例えば、会話評価部３６は、会話継続長が所定の閾値を上回る場合に所定値を評価値Ｓに加点する。他方、会話が過度に長時間にわたる場合には、利用者Ｕ1および利用者Ｕ2の疲労により相互間の印象が悪化する可能性がある。以上の傾向を考慮すると、関連情報Ｒで指定される会話継続長が所定の閾値を上回る場合に評価値Ｓを減点することも可能である。なお、相異なる複数の閾値を利用することも可能である。例えば、第１閾値と第２閾値とを設定し（第１閾値＜第２閾値）、会話継続長が第１閾値と第２閾値との間の数値である場合に評価値Ｓを加点する一方、会話継続長が第１閾値を下回る場合または第２閾値を上回る場合には評価値Ｓを減点する構成が想定される。各閾値を境界とする範囲毎に評価値Ｓに対する加点値または減点値を段階的に変化させることも可能である。 In addition, when the conversation continues for a long time, it is highly likely that the conversation between the user U1 and the user U2 is alive while feeling a good impression of each other. Considering the above tendency, the conversation evaluation unit 36 calculates the evaluation value S so that the evaluation value S becomes larger as the conversation duration length specified by the related information R is longer. For example, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the conversation continuation length exceeds a predetermined threshold. On the other hand, when the conversation is excessively long, the impression between the users U1 and U2 may deteriorate due to fatigue of the users U1 and U2. Considering the above tendency, the evaluation value S can be deducted when the conversation duration specified by the related information R exceeds a predetermined threshold. A plurality of different threshold values can be used. For example, the first threshold value and the second threshold value are set (first threshold value <second threshold value), and the evaluation value S is added when the conversation duration is a numerical value between the first threshold value and the second threshold value. When the conversation continuation length is lower than the first threshold value or higher than the second threshold value, a configuration in which the evaluation value S is deducted is assumed. It is also possible to change the added point value or the deducted value for the evaluation value S step by step for each range having each threshold as a boundary.

図２は、第１実施形態の制御装置１２が利用者Ｕ1と利用者Ｕ2との会話を評価する処理（以下「会話評価処理」という）のフローチャートである。例えば入力装置１８に対する利用者Ｕ（Ｕ1，Ｕ2）からの指示や利用者Ｕによる発話の開始を契機として会話評価処理が開始される。 FIG. 2 is a flowchart of processing (hereinafter referred to as “conversation evaluation processing”) in which the control device 12 of the first embodiment evaluates the conversation between the user U1 and the user U2. For example, the conversation evaluation process is started with an instruction from the user U (U1, U2) to the input device 18 or the start of utterance by the user U.

図２の会話評価処理を開始すると、特徴取得部３２は、音声信号Ｘ1および音声信号Ｘ2の解析により利用者Ｕ1の音声Ｖ1の音高Ｐ1と利用者Ｕ2の音声Ｖ2の音高Ｐ2とを順次に抽出する（ＳA1）。また、情報生成部３４は、例えば計時回路が計時する時刻を参照して関連情報Ｒ（第１実施形態では会話日時および会話継続長）を生成する（ＳA2）。会話評価部３６は、特徴取得部３２が抽出した特徴量（音高Ｐ1，音高Ｐ2）と情報生成部３４が生成した関連情報Ｒとに応じた評価値Ｓを算定する（ＳA3）。なお、特徴取得部３２による特徴量の抽出（ＳA1）と情報生成部３４による関連情報Ｒの生成（ＳA2）との先後は逆転され得る。 When the conversation evaluation process of FIG. 2 is started, the feature acquisition unit 32 sequentially analyzes the pitch P1 of the voice V1 of the user U1 and the pitch P2 of the voice V2 of the user U2 by analyzing the voice signal X1 and the voice signal X2. (SA1). In addition, the information generation unit 34 generates the related information R (the conversation date and time and the conversation continuation length in the first embodiment) with reference to the time counted by the timing circuit, for example (SA2). The conversation evaluation unit 36 calculates an evaluation value S according to the feature amount (pitch P1, pitch P2) extracted by the feature acquisition unit 32 and the related information R generated by the information generation unit 34 (SA3). The feature extraction by the feature acquisition unit 32 (SA1) and the generation of the related information R by the information generation unit 34 (SA2) can be reversed.

以上に例示した通り、第１実施形態では、会話を構成する音声Ｖ1および音声Ｖ2の特徴量に応じて利用者Ｕ1と利用者Ｕ2との間の会話を客観的に評価することが可能である。また、会話を構成する音声Ｖ1および音声Ｖ2の特徴量のほかに当該会話の関連情報Ｒも加味して会話が評価されるから、特徴量のみを評価結果に反映させる構成と比較して会話を適切に評価することが可能である。第１実施形態では特に、音声Ｖ1の音高Ｐ1と音声Ｖ2の音高Ｐ2との音高差ΔＰに応じて会話が評価されるから、利用者Ｕ1の音声Ｖ1に対する利用者Ｕ2の音声Ｖ2の音程（すなわち音高差）という観点から、利用者Ｕ2による応答の印象の良否を客観的に評価することが可能である。 As illustrated above, in the first embodiment, it is possible to objectively evaluate the conversation between the user U1 and the user U2 according to the feature amounts of the voice V1 and the voice V2 constituting the conversation. . Further, since the conversation is evaluated in consideration of the related information R of the conversation in addition to the feature amounts of the voice V1 and the voice V2 constituting the conversation, the conversation is compared with the configuration in which only the feature amount is reflected in the evaluation result. Appropriate evaluation is possible. In the first embodiment, in particular, since the conversation is evaluated according to the pitch difference ΔP between the pitch P1 of the voice V1 and the pitch P2 of the voice V2, the voice V2 of the user U2 with respect to the voice V1 of the user U1. From the viewpoint of the pitch (ie, pitch difference), it is possible to objectively evaluate the quality of the impression of the response by the user U2.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

図３は、第２実施形態の会話評価装置１００の構成図である。図３に例示される通り、第２実施形態の会話評価装置１００の記憶装置１４は、利用者Ｕの組合せ毎に履歴情報Ｈを記憶する。履歴情報Ｈは、利用者Ｕ間の過去の会話の履歴（会話履歴）に関する情報である。具体的には、第２実施形態の履歴情報Ｈは、利用者Ｕ間で過去に実施された会話の頻度（以下「会話頻度」という）と利用者Ｕ間での最初の会話からの経過時間（以下「関係期間」という）とを指定する。会話頻度は、所定長の期間（例えば１ヶ月間）内における会話の回数を意味する。利用者Ｕ間の会話毎に関連情報Ｒの会話頻度は更新される。会話頻度や関係期間は、利用者Ｕ間の親密度の指標とも換言され得る。 FIG. 3 is a configuration diagram of the conversation evaluation apparatus 100 according to the second embodiment. As illustrated in FIG. 3, the storage device 14 of the conversation evaluation device 100 according to the second embodiment stores history information H for each combination of users U. The history information H is information relating to the history of conversations between the users U (conversation history). Specifically, the history information H of the second embodiment includes the frequency of conversations between users U in the past (hereinafter referred to as “conversation frequency”) and the elapsed time from the first conversation between users U. (Hereinafter referred to as “related period”). The conversation frequency means the number of conversations within a predetermined length period (for example, one month). The conversation frequency of the related information R is updated for each conversation between the users U. The conversation frequency and the relationship period can also be restated as an index of intimacy between users U.

第２実施形態の情報生成部３４は、記憶装置１４に記憶された履歴情報Ｈを参照して関連情報Ｒを生成する。例えば、利用者Ｕ1および利用者Ｕ2は、入力装置１８を適宜に操作することで自身の識別情報を会話評価装置１００に指示する。情報生成部３４は、識別情報が示す利用者Ｕ1と利用者Ｕ2との間の履歴情報Ｈを記憶装置１４から検索し、当該履歴情報Ｈで指定された会話頻度と関係期間を含む関連情報Ｒを生成する。なお、特徴取得部３２が音声Ｖ1の音高Ｐ1と音声Ｖ2の音高Ｐ2とを抽出する動作は第１実施形態と同様である。 The information generation unit 34 of the second embodiment generates the related information R with reference to the history information H stored in the storage device 14. For example, the user U1 and the user U2 operate the input device 18 as appropriate to instruct the conversation evaluation device 100 about their identification information. The information generation unit 34 searches the storage device 14 for history information H between the user U1 and the user U2 indicated by the identification information, and related information R including the conversation frequency and the related period specified by the history information H. Is generated. Note that the operation of the feature acquisition unit 32 extracting the pitch P1 of the voice V1 and the pitch P2 of the voice V2 is the same as in the first embodiment.

会話頻度が高い場合には、利用者Ｕ1と利用者Ｕ2とが良好な関係にあり、相互に良好な印象を維持しながら会話している可能性が高い。以上の傾向を考慮して、関連情報Ｒで指定される会話頻度が高いほど評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。例えば、会話評価部３６は、会話頻度が所定の閾値を上回る場合に所定値を評価値Ｓに加点する。なお、相異なる複数の閾値を利用することも可能である。例えば、会話頻度が第１閾値と第２閾値（第１閾値＜第２閾値）との間の数値である場合に評価値Ｓを加点する一方、会話頻度が第１閾値を下回る場合または第２閾値を上回る場合には評価値Ｓを減点する構成が想定される。各閾値を境界とする範囲毎に評価値Ｓに対する加点値または減点値を段階的に変化させることも可能である。 When the conversation frequency is high, the user U1 and the user U2 are in a good relationship, and there is a high possibility that the user U1 and the user U2 are talking while maintaining a good impression. Considering the above tendency, the conversation evaluation unit 36 calculates the evaluation value S so that the higher the conversation frequency specified by the related information R is, the larger the evaluation value S becomes. For example, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the conversation frequency exceeds a predetermined threshold. A plurality of different threshold values can be used. For example, when the conversation frequency is a numerical value between the first threshold value and the second threshold value (first threshold value <second threshold value), the evaluation value S is added, while the conversation frequency falls below the first threshold value or the second When exceeding a threshold value, the structure which deducts evaluation value S is assumed. It is also possible to change the added point value or the deducted value for the evaluation value S step by step for each range having each threshold as a boundary.

また、最初の会話からの関係期間が長い場合にも同様に、利用者Ｕ1と利用者Ｕ2とが良好な関係にあり、相互に良好な印象を維持しながら会話している可能性が高い。以上の傾向を考慮して、関連情報Ｒで指定される関係期間が長いほど評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。例えば、会話評価部３６は、関係期間が所定の閾値を上回る場合に所定値を評価値Ｓに加点する。なお、相異なる複数の閾値を利用することも可能である。例えば、関係期間が第１閾値と第２閾値（第１閾値＜第２閾値）との間の数値である場合に評価値Ｓを加点する一方、関係期間が第１閾値を下回る場合または第２閾値を上回る場合には評価値Ｓを減点する構成が想定される。各閾値を境界とする範囲毎に評価値Ｓに対する加点値または減点値を段階的に変化させることも可能である。 Similarly, when the relationship period from the first conversation is long, there is a high possibility that the user U1 and the user U2 have a good relationship and are talking while maintaining a good impression. In consideration of the above tendency, the conversation evaluation unit 36 calculates the evaluation value S so that the longer the relation period specified by the related information R, the larger the evaluation value S. For example, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the relationship period exceeds a predetermined threshold. A plurality of different threshold values can be used. For example, when the relationship period is a numerical value between the first threshold value and the second threshold value (first threshold value <second threshold value), the evaluation value S is added, while the relationship period falls below the first threshold value or the second When exceeding a threshold value, the structure which deducts evaluation value S is assumed. It is also possible to change the added point value or the deducted value for the evaluation value S step by step for each range having each threshold as a boundary.

以上に例示した通り、第２実施形態においても、会話を構成する音声Ｖ1および音声Ｖ2の特徴量のほかに当該会話の関連情報Ｒも加味して会話が評価される。したがって、第１実施形態と同様に、特徴量のみを評価に反映させる構成と比較して会話を適切に評価することが可能である。第２実施形態では特に、利用者Ｕ間の会話履歴（例えば会話頻度や関係期間）が関連情報Ｒとして利用されるから、利用者Ｕ間の過去の会話の傾向を踏まえた適切な評価が実現される。 As exemplified above, also in the second embodiment, the conversation is evaluated in consideration of the related information R of the conversation in addition to the feature amounts of the voice V1 and the voice V2 constituting the conversation. Therefore, similarly to the first embodiment, it is possible to appropriately evaluate the conversation as compared with the configuration in which only the feature amount is reflected in the evaluation. Particularly in the second embodiment, since the conversation history (for example, conversation frequency and relationship period) between the users U is used as the related information R, appropriate evaluation based on the tendency of past conversations between the users U is realized. Is done.

なお、以上の説明では、入力装置１８に対する操作で指示された識別情報を利用したが、利用者Ｕ1および利用者Ｕ2を特定する方法は任意である。例えば、音声信号Ｘ1に対する話者識別で利用者Ｕ1を特定するとともに音声信号Ｘ2に対する話者識別で利用者Ｕ2を特定し、利用者Ｕ1と利用者Ｕ2との間の履歴情報Ｈを検索することも可能である。利用者Ｕ1および利用者Ｕ2の話者識別には公知の認識技術が任意に採用され得る。 In the above description, the identification information instructed by the operation on the input device 18 is used, but the method for specifying the user U1 and the user U2 is arbitrary. For example, the user U1 is specified by speaker identification with respect to the audio signal X1 and the user U2 is specified by speaker identification with respect to the audio signal X2, and the history information H between the user U1 and the user U2 is searched. Is also possible. A known recognition technique can be arbitrarily adopted for speaker identification of the user U1 and the user U2.

＜第３実施形態＞
図４は、第３実施形態の会話評価装置１００の構成図である。図４に例示される通り、第３実施形態の会話評価装置１００の記憶装置１４は、利用者Ｕの組合せ毎に話者情報Ｑを記憶する。話者情報Ｑは、利用者Ｕ間の関係を示す情報である。具体的には、第３実施形態の話者情報Ｑは、利用者Ｕ間の相互関係（友人，家族，知人，同僚等）と利用者Ｕ間の親密度とを指定する。相互関係および親密度は、具体的には入力装置１８に対する利用者Ｕからの操作で指示され得るが、例えばＳＮＳ（Social Networking Service）に登録された情報から話者情報Ｑに反映させることも可能である。 <Third Embodiment>
FIG. 4 is a configuration diagram of the conversation evaluation apparatus 100 of the third embodiment. As illustrated in FIG. 4, the storage device 14 of the conversation evaluation device 100 according to the third embodiment stores speaker information Q for each combination of users U. The speaker information Q is information indicating the relationship between the users U. Specifically, the speaker information Q of the third embodiment specifies the mutual relationship (friend, family, acquaintance, colleague, etc.) between the users U and the closeness between the users U. The interrelationship and closeness can be specifically instructed by an operation from the user U with respect to the input device 18, but can be reflected in the speaker information Q from information registered in SNS (Social Networking Service), for example. It is.

第３実施形態の情報生成部３４は、記憶装置１４に記憶された話者情報Ｑを参照して関連情報Ｒを生成する。例えば、情報生成部３４は、第２実施形態と同様に識別情報の入力や話者識別で特定された利用者Ｕ1と利用者Ｕ2と間の話者情報Ｑを記憶装置１４から検索し、当該話者情報Ｑで指定された相互関係と親密度とを含む関連情報Ｒを生成する。なお、特徴取得部３２が音声Ｖ1の音高Ｐ1と音声Ｖ2の音高Ｐ2とを抽出する動作は第１実施形態と同様である。 The information generation unit 34 of the third embodiment generates related information R with reference to the speaker information Q stored in the storage device 14. For example, the information generation unit 34 searches the storage device 14 for speaker information Q between the user U1 and the user U2 specified by the input of identification information or speaker identification as in the second embodiment, and The related information R including the mutual relationship and the intimacy specified by the speaker information Q is generated. Note that the operation of the feature acquisition unit 32 extracting the pitch P1 of the voice V1 and the pitch P2 of the voice V2 is the same as in the first embodiment.

相互関係が友人である場合には、利用者Ｕ1と利用者Ｕ2とが良好な関係にあり、相互に良好な印象を維持しながら会話している可能性が高い。以上の傾向を考慮して、関連情報Ｒで指定される相互関係が友人である場合には、相互関係が他の関係である場合と比較して評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。具体的には、会話評価部３６は、利用者Ｕ1と利用者Ｕ2との相互関係が友人である場合に所定値を評価値Ｓに加点する。 When the mutual relationship is a friend, there is a high possibility that the user U1 and the user U2 have a good relationship and are talking while maintaining a good impression. In consideration of the above tendency, when the mutual relationship specified by the related information R is a friend, the conversation value is set so that the evaluation value S is larger than that when the mutual relationship is another relationship. The evaluation unit 36 calculates an evaluation value S. Specifically, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the mutual relationship between the user U1 and the user U2 is a friend.

また、親密度が高い場合にも同様に、利用者Ｕ1と利用者Ｕ2とが良好な関係にあり、相互に良好な印象を維持しながら会話している可能性が高い。以上の傾向を考慮して、関連情報Ｒで指定される親密度が高いほど評価値Ｓが大きい数値となるように、会話評価部３６は評価値Ｓを算定する。例えば、会話評価部３６は、親密度が所定の閾値を上回る場合に所定値を評価値Ｓに加点する。なお、相異なる複数の閾値を利用することも可能である。例えば、親密度が第１閾値と第２閾値（第１閾値＜第２閾値）との間の数値である場合に評価値Ｓを加点する一方、親密度が第１閾値を下回る場合または第２閾値を上回る場合には評価値Ｓを減点する構成が想定される。各閾値を境界とする範囲毎に評価値Ｓに対する加点値または減点値を段階的に変化させることも可能である。 Similarly, when the intimacy is high, the user U1 and the user U2 are in a good relationship, and there is a high possibility that they are talking while maintaining a good impression. In consideration of the above tendency, the conversation evaluation unit 36 calculates the evaluation value S so that the higher the familiarity specified by the related information R, the higher the evaluation value S. For example, the conversation evaluation unit 36 adds a predetermined value to the evaluation value S when the familiarity exceeds a predetermined threshold. A plurality of different threshold values can be used. For example, when the familiarity is a numerical value between the first threshold value and the second threshold value (first threshold value <second threshold value), the evaluation value S is added, while the familiarity value is lower than the first threshold value or the second When exceeding a threshold value, the structure which deducts evaluation value S is assumed. It is also possible to change the added point value or the deducted value for the evaluation value S step by step for each range having each threshold as a boundary.

以上に例示した通り、第３実施形態においても、会話を構成する音声Ｖ1および音声Ｖ2の特徴量のほかに当該会話の関連情報Ｒも加味して会話が評価される。したがって、第１実施形態と同様に、特徴量のみを評価に反映させる構成と比較して会話を適切に評価することが可能である。第３実施形態では特に、利用者Ｕ間の話者情報Ｑ（例えば相互関係や親密度）が関連情報Ｒとして利用されるから、利用者Ｕ間の実際の関係を踏まえた適切な評価が実現される。 As exemplified above, also in the third embodiment, the conversation is evaluated in consideration of the related information R of the conversation in addition to the feature amounts of the voice V1 and the voice V2 constituting the conversation. Therefore, similarly to the first embodiment, it is possible to appropriately evaluate the conversation as compared with the configuration in which only the feature amount is reflected in the evaluation. Particularly in the third embodiment, since the speaker information Q (for example, mutual relationship and intimacy) between the users U is used as the related information R, an appropriate evaluation based on the actual relationship between the users U is realized. Is done.

＜第４実施形態＞
図５は、第４実施形態の会話評価装置１００の構成図である。図５に例示される通り、第４実施形態の会話評価装置１００の記憶装置１４は、利用者Ｕ毎に属性情報Ａを記憶する。属性情報Ａは、利用者Ｕの属性（特徴や性質）を示す情報である。利用者Ｕが発音する音声に依存する情報が属性情報Ａとして特に好適である。第４実施形態の属性情報Ａは、利用者Ｕの発音周波数を指定する。発音周波数は、利用者Ｕが発音する音声の平均的な音高（平均ピッチ）である。 <Fourth embodiment>
FIG. 5 is a configuration diagram of the conversation evaluation apparatus 100 of the fourth embodiment. As illustrated in FIG. 5, the storage device 14 of the conversation evaluation apparatus 100 according to the fourth embodiment stores attribute information A for each user U. The attribute information A is information indicating the attributes (features and properties) of the user U. Information that depends on the sound produced by the user U is particularly suitable as the attribute information A. The attribute information A of the fourth embodiment specifies the sound frequency of the user U. The sound generation frequency is an average pitch (average pitch) of the sound generated by the user U.

第４実施形態の情報生成部３４は、第２実施形態と同様に識別情報の入力や話者識別で特定された利用者Ｕ1および利用者Ｕ2の各々の属性情報Ａを記憶装置１４から検索し、各属性情報Ａで指定された発音周波数を含む関連情報Ｒを生成する。すなわち、第４実施形態の関連情報Ｒは、評価対象の会話を実施する各利用者Ｕの情報であり、第１実施形態から第３実施形態で例示した関連情報Ｒと同様に、特徴取得部３２が抽出する特徴量とは別個の種類の情報の一例である。 As in the second embodiment, the information generation unit 34 of the fourth embodiment searches the storage device 14 for the attribute information A of each of the user U1 and the user U2 specified by the input of identification information and speaker identification. Then, the related information R including the sound generation frequency designated by each attribute information A is generated. That is, the related information R of the fourth embodiment is information of each user U who carries out the conversation to be evaluated, and, similar to the related information R exemplified in the first to third embodiments, the feature acquisition unit This is an example of information of a type different from the feature quantity extracted by 32.

第１実施形態から第３実施形態では、情報生成部３４が生成した関連情報Ｒを会話評価部３６による会話の評価に反映させる構成を例示したが、第４実施形態では、特徴取得部３２による特徴量の抽出に関連情報Ｒが反映される。すなわち、第４実施形態の特徴取得部３２は、情報生成部３４が生成した関連情報Ｒに応じた条件で特徴量を抽出する。 In the first to third embodiments, the configuration in which the related information R generated by the information generation unit 34 is reflected in the evaluation of the conversation by the conversation evaluation unit 36 is exemplified. However, in the fourth embodiment, the feature acquisition unit 32 The related information R is reflected in the feature quantity extraction. That is, the feature acquisition unit 32 of the fourth embodiment extracts feature amounts under conditions according to the related information R generated by the information generation unit 34.

具体的には、特徴取得部３２は、収音装置２２が生成する音声信号Ｘ1のうち、関連情報Ｒが指定する利用者Ｕ1の発音周波数を含む所定帯域内の音響成分を抽出し、抽出後の音響成分から音高Ｐ1を特定する。すなわち、利用者Ｕ1が平常的に発音する音域に限定して音高Ｐ1が特定される。同様に、特徴取得部３２は、収音装置２４が生成する音声信号Ｘ2のうち関連情報Ｒが指定する利用者Ｕ2の発音周波数を含む所定帯域内の音響成分から音高Ｐ2を特定する。なお、利用者Ｕの発音域を指定する属性情報Ａを関連情報Ｒとして利用することも可能である。 Specifically, the feature acquisition unit 32 extracts an acoustic component within a predetermined band including the sounding frequency of the user U1 specified by the related information R from the audio signal X1 generated by the sound collection device 22, and after the extraction The pitch P1 is specified from the acoustic component. That is, the pitch P1 is specified only in the sound range that the user U1 normally generates. Similarly, the feature acquisition unit 32 specifies the pitch P2 from the sound component within a predetermined band including the sounding frequency of the user U2 specified by the related information R in the audio signal X2 generated by the sound collection device 24. It is also possible to use the attribute information A that specifies the user U's pronunciation range as the related information R.

会話評価部３６は、以上の例示のように特徴取得部３２が関連情報Ｒを使用して特定した音高Ｐ1および音高Ｐ2に応じて利用者Ｕ1と利用者Ｕ2との会話を評価する。具体的には、会話評価部３６は、音高Ｐ1と音高Ｐ2との音高差ΔＰに応じて評価値Ｓを算定する。第４実施形態における会話評価部３６による評価には関連情報Ｒは加味されない。ただし、第１実施形態から第３実施形態の例示と同様に、第４実施形態でも会話評価部３６による評価に関連情報Ｒを加味することは可能である。 The conversation evaluation unit 36 evaluates the conversation between the user U1 and the user U2 according to the pitch P1 and the pitch P2 specified by the feature acquisition unit 32 using the related information R as illustrated above. Specifically, the conversation evaluation unit 36 calculates the evaluation value S according to the pitch difference ΔP between the pitch P1 and the pitch P2. The related information R is not added to the evaluation by the conversation evaluation unit 36 in the fourth embodiment. However, as in the examples of the first to third embodiments, the related information R can be added to the evaluation by the conversation evaluation unit 36 in the fourth embodiment.

以上に例示した通り、第４実施形態においても第１実施形態と同様に、会話を構成する音声Ｖ1および音声Ｖ2の特徴量に応じて利用者Ｕ1と利用者Ｕ2との間の会話を客観的に評価することが可能である。また、第４実施形態では、会話に関する関連情報Ｒに応じた条件で特徴量（音高Ｐ1，音高Ｐ2）が抽出されるから、特徴量の抽出に関連情報Ｒを利用しない構成と比較して特徴量を適切に抽出できるという利点がある。例えば、第４実施形態では、関連情報Ｒが指定する発音周波数に対応した周波数帯域に制限することで特徴量を高精度に抽出することが可能である。 As illustrated above, in the fourth embodiment, as in the first embodiment, the conversation between the user U1 and the user U2 is objective according to the feature amounts of the voice V1 and the voice V2 constituting the conversation. It is possible to evaluate. Further, in the fourth embodiment, since feature quantities (pitch P1, pitch P2) are extracted under conditions according to the related information R related to conversation, the comparison is made with a configuration that does not use the related information R for feature quantity extraction. Thus, there is an advantage that the feature amount can be appropriately extracted. For example, in the fourth embodiment, it is possible to extract feature quantities with high accuracy by limiting to a frequency band corresponding to the sound generation frequency specified by the related information R.

なお、以上の例示では、利用者Ｕの発音周波数を属性情報Ａとして例示したが、属性情報Ａの内容は以上の例示に限定されない。例えば、利用者Ｕの性別を指定する属性情報Ａを利用することも可能である。特徴取得部３２は、関連情報Ｒが指定する性別について想定される周波数帯域内で音高Ｐを特定する。例えば、特徴取得部３２は、関連情報Ｒが指定する利用者Ｕ1の性別が女性である場合には、音声信号Ｘ1のうち女性に想定される高音域の音響成分から音高Ｐ1を抽出し、利用者Ｕ1の性別が男性である場合には、音声信号Ｘ1のうち男性に想定される低音域の音響成分から音高Ｐ1を抽出する。 In the above example, the pronunciation frequency of the user U is exemplified as the attribute information A, but the content of the attribute information A is not limited to the above example. For example, it is possible to use attribute information A that specifies the sex of the user U. The feature acquisition unit 32 specifies the pitch P within the frequency band assumed for the sex specified by the related information R. For example, when the sex of the user U1 specified by the related information R is female, the feature acquisition unit 32 extracts the pitch P1 from the sound component of the high frequency range expected for the female from the audio signal X1, When the gender of the user U1 is male, the pitch P1 is extracted from the sound component of the bass range assumed for males in the audio signal X1.

なお、利用者Ｕ1の音声Ｖ1と利用者Ｕ2の音声Ｖ2との音高差が１オクターブを上回る場合に、音高Ｐ1および音高Ｐ2の一方を他方に対して１オクターブの整数倍だけ近付けることで両者間の音高差を１オクターブ以内に補正（以下「音高補正」という）する構成が好適である。利用者Ｕ1と利用者Ｕ2とで性別が相違する場合（すなわち音高差が大きい場合）には、音高補正の必要性が高いと推定される。以上の傾向を考慮すると、関連情報Ｒが指定する性別が利用者Ｕ1と利用者Ｕ2とで相違する場合には特徴取得部３２が音高補正を実行し、利用者Ｕ1と利用者Ｕ2とで性別が共通する場合には特徴取得部３２が音高補正を省略する構成も好適である。 When the pitch difference between the voice V1 of the user U1 and the voice V2 of the user U2 exceeds 1 octave, one of the pitch P1 and the pitch P2 is brought closer to the other by an integer multiple of 1 octave. Therefore, it is preferable to correct the pitch difference between the two within one octave (hereinafter referred to as “pitch correction”). When the gender is different between the user U1 and the user U2 (that is, when the pitch difference is large), it is estimated that the necessity for pitch correction is high. Considering the above tendency, if the gender specified by the related information R is different between the user U1 and the user U2, the feature acquisition unit 32 executes pitch correction, and the user U1 and the user U2 A configuration in which the feature acquisition unit 32 omits pitch correction when the gender is common is also suitable.

＜第５実施形態＞
第５実施形態の情報生成部３４は、第３実施形態と同様に、記憶装置１４に記憶された話者情報Ｑを参照することで、利用者Ｕ1と利用者Ｕ2との間の親密度を指定する関連情報Ｒを生成する。特徴取得部３２は、第４実施形態と同様に、情報生成部３４が生成した関連情報Ｒに応じた条件で利用者Ｕ1の音声Ｖ1および利用者Ｕ2の音声Ｖ2の各々の特徴量（音高Ｐ1，音高Ｐ2）を抽出する。具体的には、特徴取得部３２は、関連情報Ｒが指定する親密度に応じた頻度で特徴量を抽出する。 <Fifth Embodiment>
As in the third embodiment, the information generation unit 34 according to the fifth embodiment refers to the speaker information Q stored in the storage device 14 to thereby determine the familiarity between the user U1 and the user U2. The related information R to be specified is generated. As in the fourth embodiment, the feature acquisition unit 32 uses the feature amounts (pitches) of the voice V1 of the user U1 and the voice V2 of the user U2 under conditions according to the related information R generated by the information generation unit 34. P1, P2) is extracted. Specifically, the feature acquisition unit 32 extracts feature amounts at a frequency according to the familiarity specified by the related information R.

例えば親密度が高い場合には、利用者Ｕ1と利用者Ｕ2とが良好な関係にあるから、評価値Ｓは比較的に大きい数値になると予想される。親密度が低い場合には、評価値Ｓの大小の予想は困難である。したがって、親密度が高い場合には頻繁に会話を評価する必要性は低く、親密度が低い場合には頻繁に会話を評価する必要がある、という傾向が想定される。以上の傾向を考慮して、第５実施形態の特徴取得部３２は、関連情報Ｒで指定される親密度が高いほど、特徴量（音高Ｐ1，音高Ｐ2）の抽出の頻度を低下させる。 For example, when the intimacy is high, the user U1 and the user U2 are in a good relationship, so the evaluation value S is expected to be a relatively large numerical value. When the familiarity is low, it is difficult to predict the magnitude of the evaluation value S. Therefore, it is assumed that the necessity of frequently evaluating conversations is low when the intimacy is high, and that it is necessary to frequently evaluate conversations when the intimacy is low. In consideration of the above tendency, the feature acquisition unit 32 of the fifth embodiment decreases the frequency of extraction of feature amounts (pitch P1, pitch P2) as the familiarity specified by the related information R is higher. .

具体的には、特徴取得部３２は、親密度が所定の閾値を上回る場合に、親密度が閾値を下回る場合と比較して低い頻度で特徴量を抽出する。例えば、親密度が閾値を下回る場合には、音声Ｖ1および音声Ｖ2の相前後する発話区間の１組毎に（すなわち、利用者Ｕ1による発話と利用者Ｕ2による応答との組毎に１回の頻度で）音高Ｐ1および音高Ｐ2が抽出される。他方、親密度が閾値を上回る場合には、音声Ｖ1および音声Ｖ2の発話区間の複数組毎に（すなわち、利用者Ｕ1による発話と利用者Ｕ2による応答との複数回毎に１回の頻度で）音高Ｐ1および音高Ｐ2が抽出される。会話評価部３６による会話の評価は特徴取得部３２による特徴量の抽出毎に実行されるから、関連情報Ｒで指定される親密度が高いほど、会話評価部３６による評価の頻度（さらには表示装置１６に表示される評価値Ｓの更新の頻度）は低下する。なお、相異なる複数の閾値を利用することも可能である。例えば、複数の閾値の各々を境界とする範囲毎に頻度を設定し、複数の範囲のうち親密度が属する範囲に対応した頻度で特徴取得部３２が特徴量を抽出する構成が想定される。 Specifically, the feature acquisition unit 32 extracts a feature amount at a lower frequency when the familiarity exceeds a predetermined threshold than when the familiarity falls below the threshold. For example, when the intimacy is lower than the threshold value, it is once for each set of utterance sections adjacent to each other of the voice V1 and the voice V2 (that is, once for each pair of the utterance by the user U1 and the response by the user U2). Pitch P1 and pitch P2 are extracted (in frequency). On the other hand, if the intimacy exceeds the threshold, the frequency of speech V1 and speech V2 is set to a plurality of sets of speech segments (that is, once for each of a plurality of times of a speech by the user U1 and a response by the user U2). ) Pitch P1 and P2 are extracted. Since the evaluation of the conversation by the conversation evaluation unit 36 is performed every time the feature amount is extracted by the feature acquisition unit 32, the higher the familiarity specified by the related information R, the higher the frequency of evaluation by the conversation evaluation unit 36 (and also the display) The frequency of updating the evaluation value S displayed on the device 16 decreases. A plurality of different threshold values can be used. For example, a configuration is assumed in which the frequency is set for each range having each of a plurality of thresholds as a boundary, and the feature acquisition unit 32 extracts the feature amount at a frequency corresponding to the range to which the familiarity belongs among the plurality of ranges.

以上に例示した通り、第５実施形態においても第１実施形態と同様に、会話を構成する音声Ｖ1および音声Ｖ2の特徴量に応じて利用者Ｕ1と利用者Ｕ2との間の会話を客観的に評価することが可能である。また、第５実施形態では、関連情報Ｒに応じた条件で特徴量（音高Ｐ1，音高Ｐ2）が抽出されるから、第４実施形態と同様に、特徴量の抽出に関連情報Ｒを利用しない構成と比較して特徴量を適切に抽出できるという利点がある。例えば第５実施形態では、特徴量の抽出の頻度が関連情報Ｒに応じて制御されるから、特徴量の抽出に関連情報Ｒを利用しない構成と比較して、特徴取得部３２による特徴量の抽出と会話評価部３６による会話の評価とに必要な演算量を削減することが可能である。 As illustrated above, in the fifth embodiment, as in the first embodiment, the conversation between the user U1 and the user U2 is objective according to the feature amounts of the voice V1 and the voice V2 constituting the conversation. It is possible to evaluate. Further, in the fifth embodiment, feature amounts (pitch P1, pitch P2) are extracted under the conditions according to the related information R. Therefore, as in the fourth embodiment, the related information R is extracted for feature amount extraction. There is an advantage that the feature amount can be appropriately extracted as compared with the configuration that is not used. For example, in the fifth embodiment, since the frequency of feature quantity extraction is controlled according to the related information R, the feature quantity obtained by the feature acquisition unit 32 is compared with a configuration that does not use the related information R for feature quantity extraction. It is possible to reduce the amount of calculation required for extraction and conversation evaluation by the conversation evaluation unit 36.

なお、第５実施形態の例示では、利用者Ｕ間の親密度に応じて特徴量の抽出条件（具体的には頻度）を制御したが、特徴量の抽出条件に反映させる関連情報Ｒの内容は以上の例示に限定されない。例えば、第１実施形態から第３実施形態で例示した任意の関連情報Ｒを、特徴量の抽出条件の制御に適用することが可能である。例えば、第２実施形態で例示した会話頻度や関係期間に応じて特徴量の抽出条件を制御する構成（例えば、会話頻度が高いほど、または、関係期間が長いほど、特徴量の抽出の頻度を低下させる構成）も想定される。 In the illustration of the fifth embodiment, the feature quantity extraction condition (specifically, the frequency) is controlled according to the familiarity between the users U, but the content of the related information R to be reflected in the feature quantity extraction condition. Is not limited to the above examples. For example, any related information R exemplified in the first to third embodiments can be applied to control of the feature amount extraction condition. For example, the configuration for controlling the feature quantity extraction condition according to the conversation frequency and the relation period exemplified in the second embodiment (for example, the higher the conversation frequency or the longer the relation period, the more the feature quantity extraction frequency is A configuration to be reduced) is also assumed.

また、会話評価部３６による会話の評価に関連情報Ｒを加味する第１実施形態から第３実施形態の構成と、特徴取得部３２による特徴量の抽出条件を関連情報Ｒに応じて制御する第４実施形態および第５実施形態の構成とを併合することも可能である。会話評価部３６による会話の評価と特徴量の抽出条件の制御とには、相異なる種類の関連情報Ｒが好適に適用され得るが、関連情報Ｒを共通に適用することも可能である。例えば、関連情報Ｒが会話頻度を含む構成では、特徴取得部３２による特徴量の抽出頻度を会話頻度に応じて制御するとともに、第２実施形態の例示のように会話評価部３６による会話の評価にも会話頻度を流用することが可能である。 In addition, the configurations of the first to third embodiments in which the related information R is added to the conversation evaluation by the conversation evaluation unit 36 and the feature amount extraction condition by the feature acquisition unit 32 are controlled according to the related information R. It is also possible to merge the configurations of the fourth embodiment and the fifth embodiment. Different types of related information R can be suitably applied to the conversation evaluation and the feature value extraction condition control by the conversation evaluation unit 36, but the related information R can also be applied in common. For example, in a configuration in which the related information R includes the conversation frequency, the feature amount extraction frequency by the feature acquisition unit 32 is controlled according to the conversation frequency, and the conversation evaluation unit 36 evaluates the conversation as illustrated in the second embodiment. It is also possible to divert conversation frequency.

＜変形例＞
以上に例示した各態様は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
Each aspect illustrated above can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.

（１）前述の各形態では、利用者Ｕ1が発音した音声Ｖ1と利用者Ｕ2が発音した音声Ｖ2とで構成される会話を評価したが、会話評価装置１００が評価する音声は、利用者Ｕによる発声音（すなわち肉声）に限定されない。具体的には、音声Ｖ1および音声Ｖ2の一方を、公知の音声合成技術により生成された合成音声とすることも可能である。例えば、利用者Ｕ1が発音した音声Ｖ1と音声合成で生成された音声Ｖ2とで構成される会話の評価にも前述の各形態と同様の構成が採用され得る。すなわち、利用者Ｕ1の音声Ｖ1に対する音声認識により発話内容を解析することで、利用者Ｕ1の発話に対する適切な応答の音声Ｖ2が生成される。事前に収録された複数の音声を選択的に音声Ｖ2として選択することも可能である。また、音声合成で生成された音声Ｖ1と利用者Ｕ2が発音した音声Ｖ2とで構成される会話を評価する構成や、音声合成で生成された音声Ｖ1および音声Ｖ2で構成される会話を評価する構成も採用され得る。 (1) In each of the above-described embodiments, the conversation composed of the voice V1 sounded by the user U1 and the voice V2 sounded by the user U2 is evaluated, but the voice evaluated by the conversation evaluation device 100 is the user U1. It is not limited to the utterance sound (ie, real voice). Specifically, one of the voice V1 and the voice V2 can be a synthesized voice generated by a known voice synthesis technique. For example, the same configuration as that of each of the above-described embodiments can be adopted for evaluating a conversation composed of the voice V1 generated by the user U1 and the voice V2 generated by voice synthesis. That is, by analyzing the utterance content by voice recognition of the voice U1 of the user U1, a voice V2 having an appropriate response to the utterance of the user U1 is generated. It is also possible to selectively select a plurality of voices recorded in advance as the voice V2. Also, a configuration for evaluating a conversation composed of the speech V1 generated by speech synthesis and a speech V2 generated by the user U2, and a conversation composed of the speech V1 and the speech V2 generated by speech synthesis are evaluated. Configurations can also be employed.

前述のように音声合成で音声Ｖ1および音声Ｖ2を生成する構成では、収音装置２２や収音装置２４が省略される。また、合成音声を利用する構成では、音声の音響的な特性（例えば音高や音量）を指定する音声合成用のパラメータを音声Ｖ1や音声Ｖ2の特徴量として特徴取得部３２が取得することも可能である。以上の構成では、音声信号Ｘ1の解析で音声Ｖ1の特徴量を抽出する処理や、音声信号Ｘ2の解析で音声Ｖ2の特徴量を抽出する処理は省略され得る。以上の説明から理解される通り、特徴取得部３２は、会話を構成する音声（Ｖ1，Ｖ2）の特徴量を取得する要素として包括的に表現され、特徴抽出のための解析処理により特徴量を音声信号から抽出する要素のほか、抽出以外の任意の方法で特徴量を取得する要素も包含する。すなわち、特徴量の「抽出」は特徴量の「取得」の一例である。 As described above, in the configuration in which the speech V1 and the speech V2 are generated by speech synthesis, the sound collection device 22 and the sound collection device 24 are omitted. Further, in a configuration using synthesized speech, the feature acquisition unit 32 may acquire speech synthesis parameters that specify acoustic characteristics (for example, pitch and volume) of speech as feature quantities of speech V1 and speech V2. Is possible. In the above configuration, the process of extracting the feature quantity of the voice V1 by analyzing the voice signal X1 and the process of extracting the feature quantity of the voice V2 by analyzing the voice signal X2 can be omitted. As understood from the above description, the feature acquisition unit 32 is comprehensively expressed as an element for acquiring the feature amounts of the voices (V1, V2) constituting the conversation, and the feature amounts are obtained by analysis processing for feature extraction. In addition to elements extracted from audio signals, elements that acquire feature values by any method other than extraction are also included. That is, the feature value “extraction” is an example of the feature value “acquisition”.

（２）前述の各形態では、会話評価部３６が算定した評価値Ｓを表示装置１６に表示させたが、会話評価部３６による評価結果の形態は評価値Ｓに限定されない。例えば、評価値Ｓに応じた評価コメントを表示装置１６に表示させる（評価値Ｓの表示の有無は不問）ことも可能である。また、評価結果の出力方法は表示に限定されない。例えば、評価値Ｓや評価コメントを音声で出力することも可能である。 (2) In each embodiment described above, the evaluation value S calculated by the conversation evaluation unit 36 is displayed on the display device 16, but the form of the evaluation result by the conversation evaluation unit 36 is not limited to the evaluation value S. For example, an evaluation comment corresponding to the evaluation value S can be displayed on the display device 16 (whether or not the evaluation value S is displayed is irrelevant). Moreover, the output method of an evaluation result is not limited to display. For example, the evaluation value S and the evaluation comment can be output by voice.

（３）特徴量（音高Ｐ1，音高Ｐ2）と関連情報Ｒとに応じて評価値Ｓを算定する方法は、前述の各形態での例示に限定されない。例えば、特徴量に応じて会話を評価した数値と関連情報Ｒに応じて算定された数値とを適用した演算（例えば加重和）により評価値Ｓを算定することも可能である。また、例えば特徴量と評価値Ｓとの関係（例えば両者間の関係を規定する演算式の種類や係数）を関連情報Ｒに応じて変化させる構成でも、特徴量と関連情報Ｒとの双方に応じた評価値Ｓを算定することが可能である。 (3) The method of calculating the evaluation value S according to the feature amount (pitch P1, pitch P2) and the related information R is not limited to the examples in the above-described embodiments. For example, the evaluation value S can be calculated by an operation (for example, a weighted sum) in which a numerical value obtained by evaluating conversation according to a feature amount and a numerical value calculated according to related information R are applied. Further, for example, even in a configuration in which the relationship between the feature quantity and the evaluation value S (for example, the type or coefficient of an arithmetic expression that defines the relationship between the two) is changed according to the related information R, both the feature quantity and the related information R are used. A corresponding evaluation value S can be calculated.

（４）特徴取得部３２が抽出する特徴量は音高（Ｐ1，Ｐ2）に限定されない。例えば、音声Ｖ1および音声Ｖ2の各々の音量を特徴量として特徴取得部３２が抽出することも可能である。会話評価部３６は、例えば、音声Ｖ1と音声Ｖ2との間の音量差に応じて会話を評価する。例えば、音声Ｖ1と音声Ｖ2との間の音量差が所定値に近いほど評価値Ｓが大きい数値となるように会話評価部３６は評価値Ｓを算定する。 (4) The feature quantity extracted by the feature acquisition unit 32 is not limited to the pitch (P1, P2). For example, the feature acquisition unit 32 can extract the volume of each of the voice V1 and the voice V2 as a feature amount. The conversation evaluation unit 36 evaluates the conversation according to the volume difference between the voice V1 and the voice V2, for example. For example, the conversation evaluation unit 36 calculates the evaluation value S so that the evaluation value S increases as the volume difference between the voice V1 and the voice V2 approaches a predetermined value.

音声Ｖ1の発話区間と音声Ｖ2の発話区間との間隔（以下「発話間隔」という）を特徴取得部３２が特徴量として抽出することも可能である。会話時の発話間隔が適切である場合には、会話の相手の音声が安心感のある好印象な発話であると知覚される、という傾向がある。以上の傾向を考慮すると、発話間隔が所定値に近いほど評価値Ｓが大きい数値となるように、会話評価部３６が評価値Ｓを算定する構成が好適である。 It is also possible for the feature acquisition unit 32 to extract the interval between the speech segment of the voice V1 and the speech segment of the voice V2 (hereinafter referred to as “speech interval”) as a feature amount. When the speech interval at the time of conversation is appropriate, there is a tendency that the voice of the conversation partner is perceived as a good impression with a sense of security. Considering the above tendency, a configuration in which the conversation evaluation unit 36 calculates the evaluation value S so that the evaluation value S becomes a larger value as the utterance interval is closer to the predetermined value is preferable.

（５）情報生成部３４が関連情報Ｒを生成する方法は前述の各形態の例示に限定されない。具体的には、音声信号Ｘ1および音声信号Ｘ2を解析した結果から情報生成部３４が関連情報Ｒを生成することも可能である。例えば、特徴取得部３２が音声Ｖ1の音高Ｐ1と音声Ｖ2の音高Ｐ2とを特定した結果を利用して、情報生成部３４が、利用者Ｕ1および利用者Ｕ2の各々の性別を推定し、第４実施形態と同様に、利用者Ｕ1および利用者Ｕ2の性別を指定した関連情報Ｒを生成することも可能である。 (5) The method by which the information generation unit 34 generates the related information R is not limited to the examples of the above-described embodiments. Specifically, the information generation unit 34 can generate the related information R from the result of analyzing the audio signal X1 and the audio signal X2. For example, using the result of the feature acquisition unit 32 specifying the pitch P1 of the voice V1 and the pitch P2 of the voice V2, the information generation unit 34 estimates the gender of each of the user U1 and the user U2. Similarly to the fourth embodiment, it is also possible to generate the related information R specifying the sexes of the user U1 and the user U2.

（６）携帯電話機やスマートフォン等の端末装置と通信するサーバ装置（単体の装置または複数の装置で構成されるサーバシステム）で会話評価装置１００を実現することも可能である。例えば、会話評価装置１００は、音声信号Ｘ1と音声信号Ｘ2とを端末装置から受信し、前述の各形態と同様の方法で利用者Ｕ1と利用者Ｕ2との会話を評価した結果（例えば評価値Ｓ）を端末装置に送信する。 (6) The conversation evaluation device 100 can be realized by a server device (a single device or a server system including a plurality of devices) that communicates with a terminal device such as a mobile phone or a smartphone. For example, the conversation evaluation device 100 receives the audio signal X1 and the audio signal X2 from the terminal device, and evaluates the conversation between the user U1 and the user U2 in the same manner as the above-described embodiments (for example, the evaluation value) S) is transmitted to the terminal device.

（７）前述の各形態で例示した会話評価装置１００は、前述の通り、制御装置１２とプログラムとの協働で実現され得る。例えば第１実施形態から第３実施形態に対応する第１態様のプログラムは、制御装置１２等のコンピュータ（例えば単数または複数の処理回路）を、会話を構成する音声の特徴量を取得する特徴取得部３２、会話について特徴量とは別種の関連情報Ｒを生成する情報生成部３４、および、特徴量と関連情報Ｒとに応じて会話を評価する会話評価部３６として機能させる。 (7) As described above, the conversation evaluation device 100 exemplified in each of the above embodiments can be realized by the cooperation of the control device 12 and a program. For example, the program of the first aspect corresponding to the first to third embodiments is a feature acquisition for acquiring a feature amount of speech constituting a conversation by a computer (for example, one or more processing circuits) such as the control device 12. The unit 32 functions as an information generation unit 34 that generates related information R that is different from the feature amount, and a conversation evaluation unit 36 that evaluates the conversation according to the feature amount and the related information R.

また、第４実施形態または第５実施形態に対応する第２態様のプログラムは、制御装置１２等のコンピュータ（例えば単数または複数の処理回路）を、会話を構成する音声の特徴量を取得する特徴取得部３２、会話について特徴量とは別種の関連情報Ｒを生成する情報生成部３４、および、特徴量に応じて会話を評価する会話評価部３６として機能させるプログラムであり、特徴取得部３２は、関連情報Ｒに応じた条件で特徴量を取得する。 Further, the program of the second aspect corresponding to the fourth embodiment or the fifth embodiment is a feature in which a computer (for example, one or a plurality of processing circuits) such as the control device 12 acquires a feature amount of speech constituting a conversation. The acquisition unit 32 is a program that functions as an information generation unit 34 that generates related information R that is different from the feature amount of the conversation, and a conversation evaluation unit 36 that evaluates the conversation according to the feature amount. The feature amount is acquired under conditions according to the related information R.

以上に例示した各態様のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、通信網を介した配信の形態でプログラムをコンピュータに配信することも可能である。 The program of each aspect illustrated above can be provided in the form stored in the computer-readable recording medium, and can be installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. It is also possible to distribute the program to a computer in the form of distribution via a communication network.

（８）本発明の好適な態様は、前述の各形態で例示した会話評価装置１００の動作方法（会話評価方法）としても特定され得る。第１態様に係る会話評価方法は、コンピュータ（単体のコンピュータまたは複数のコンピュータで構成されるシステム）が、会話を構成する音声の特徴量を取得し、会話について特徴量とは別種の関連情報Ｒを生成し、特徴量と関連情報Ｒとに応じて会話を評価する。第２態様に係る会話評価方法は、コンピュータが、会話を構成する音声の特徴量を取得し、会話について特徴量とは別種の関連情報Ｒを生成し、特徴量と関連情報Ｒとに応じて会話を評価する方法であって、特徴量の取得においては、関連情報Ｒに応じた条件で特徴量を取得する。 (8) A preferred aspect of the present invention can also be specified as an operation method (conversation evaluation method) of the conversation evaluation apparatus 100 exemplified in each of the above embodiments. In the conversation evaluation method according to the first aspect, a computer (a single computer or a system composed of a plurality of computers) acquires a feature amount of speech constituting a conversation, and the related information R is different from the feature amount of the conversation. And the conversation is evaluated according to the feature amount and the related information R. In the conversation evaluation method according to the second aspect, the computer acquires the feature quantity of the speech constituting the conversation, generates related information R different from the feature quantity for the conversation, and according to the feature quantity and the related information R This is a method for evaluating conversation, and in acquiring feature values, the feature values are acquired under conditions according to the related information R.

１００…会話評価装置、１２…制御装置、１４…記憶装置、１６…表示装置、１８…入力装置、２２…収音装置、２４…収音装置、３２…特徴取得部、３４…情報生成部、３６…会話評価部。

DESCRIPTION OF SYMBOLS 100 ... Conversation evaluation apparatus, 12 ... Control apparatus, 14 ... Memory | storage device, 16 ... Display apparatus, 18 ... Input device, 22 ... Sound collection apparatus, 24 ... Sound collection apparatus, 32 ... Feature acquisition part, 34 ... Information generation part, 36 ... Conversation evaluation department.

Claims

A feature acquisition unit for acquiring the feature amount of the speech constituting the conversation;
An information generating unit that generates related information different from the feature amount for the conversation;
A conversation evaluation unit comprising: a conversation evaluation unit that evaluates the conversation according to the feature amount and the related information.

The conversation evaluation apparatus according to claim 1, wherein the feature acquisition unit acquires the feature amount under a condition corresponding to the related information.

A feature acquisition unit for acquiring the feature amount of the speech constituting the conversation;
An information generating unit that generates related information different from the feature amount for the conversation;
A conversation evaluation unit that evaluates the conversation according to the feature amount;
The feature acquisition unit acquires the feature amount under a condition according to the related information.

The feature acquisition unit acquires the pitch of each of the first voice and the second voice constituting the conversation as the feature amount,
The conversation evaluation apparatus according to claim 1, wherein the conversation evaluation unit evaluates the conversation according to a pitch difference between the first sound and the second sound.

The related information indicates at least one of a temporal situation of the conversation, a history of past conversations between speakers of the conversation, a relationship between speakers of the conversation, and attributes of each speaker of the conversation. It is information. The conversation evaluation apparatus in any one of Claims 1-4.

Computer
A feature acquisition unit for acquiring feature amounts of speech constituting a conversation;
An information generation unit that generates related information different from the feature amount for the conversation; and
A program that functions as a conversation evaluation unit that evaluates the conversation according to the feature amount and the related information.

Computer
A feature acquisition unit for acquiring feature amounts of speech constituting a conversation;
An information generation unit that generates related information different from the feature amount for the conversation; and
A program that functions as a conversation evaluation unit that evaluates the conversation according to the feature amount;
The feature acquisition unit is a program for acquiring the feature amount under a condition corresponding to the related information.