JP2005091463A

JP2005091463A - Information processing device

Info

Publication number: JP2005091463A
Application number: JP2003321460A
Authority: JP
Inventors: Satoshi Arai; 智荒井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-09-12
Filing date: 2003-09-12
Publication date: 2005-04-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processing device in which speaker's feeling is displayed together with his or her expression. <P>SOLUTION: This information processing device is provided with a display part (240) for displaying the speaker's expression, analysis parts (110, 220) for analyzing the feeling from the speech information of the speaker, and a control part (110) for instructing the display part to display the result of the analysis. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、情報処理装置に関し、特に、話者の表情と共にその感情を表示させる情報処理装置に関する。 The present invention relates to an information processing device, and more particularly, to an information processing device that displays a speaker's facial expression along with emotions.

発話の情景の実写映像を表示する映像表示装置がＴＶ電話やＴＶ会議などのシステムで用いられている。ここで、複数の人物の映像を表示する場合、常に全ての人物の顔が十分な大きさで映像に現れるとは限らない。例えば、多人数の会議の中継では出席者の中にはカメラに背を向けてしまう人もいると考えられる。このように顔が映像に現れなかったり、現れても小さかったりする場合がある。すなわち、映像を見る人が映像の中のどの人物が何を話しているのかを視覚によって認識することができない場合がある。 Video display devices that display live-action images of utterance scenes are used in systems such as TV telephones and TV conferences. Here, when displaying images of a plurality of persons, the faces of all the persons do not always appear in the image with a sufficient size. For example, in a conference with many people, some attendees may turn their backs to the camera. In this way, the face may not appear in the video or may appear small even if it appears. That is, there are cases where the person who sees the video cannot visually recognize what person in the video is talking about.

このため、従来の映像表示装置では、まず、ＴＶ会議の状況を映す主映像を表示する。この他に、二次元等の顔モデルを変形させることによって人物の発話状況を表現するアニメーション映像を生成し、主映像の補助情報（誰が話をしているか）として表示している。（例えば特許文献１参照）。
特開２００２−１５０３１７号公報（第２〜４頁、図２） For this reason, in the conventional video display device, first, a main video showing the status of the TV conference is displayed. In addition to this, an animation image expressing a person's utterance situation is generated by deforming a two-dimensional face model and displayed as auxiliary information (who is talking) of the main image. (For example, refer to Patent Document 1).
JP 2002-150317 A (pages 2 to 4, FIG. 2)

しかし、現在だれが話しているかわかっても、ＴＶ画面からでは話者の感情などを読み取るのは難しい。これは話者の表情が映される画面が小さければなおさらである。 However, even if you know who is currently speaking, it is difficult to read the speaker's emotions from the TV screen. This is even more so if the screen on which the speaker's facial expression is projected is small.

本発明の目的は、話者と共にその感情をも表示させる情報処理装置を提供することである。 The objective of this invention is providing the information processing apparatus which displays the emotion with a speaker.

第１の発明は、話者の表情を表示する表示部と、話者の音声情報から感情を解析する解析部と、前記解析結果も前記表示部に表示するように指示する制御部とを備える情報処理装置である。 1st invention is provided with the display part which displays a speaker's facial expression, the analysis part which analyzes an emotion from a speaker's audio | voice information, and the control part which instruct | indicates that the said analysis result is also displayed on the said display part Information processing apparatus.

第２の発明は、前記解析部は、前記音声情報の周波数の変化から感情を解析することを特徴とする第１の発明記載の情報処理装置である。 A second invention is the information processing apparatus according to the first invention, wherein the analysis unit analyzes an emotion from a change in frequency of the voice information.

第３の発明は、前記解析部は、前記音声情報の音量の変化から感情を解析することを特徴とする第１の発明記載の情報処理装置である。 A third invention is the information processing apparatus according to the first invention, wherein the analysis unit analyzes an emotion from a change in volume of the voice information.

第４の発明は、前記解析部は、前記音声情報の会話の間隔の変化から感情を解析することを特徴とする第１の発明記載の情報処理装置である。 A fourth invention is the information processing apparatus according to the first invention, wherein the analysis unit analyzes an emotion from a change in a conversation interval of the voice information.

第５の発明は、前記解析結果は、前記話者の感情を示すマークであることを特徴とする第１の発明記載の情報処理装置である。 A fifth invention is the information processing apparatus according to the first invention, wherein the analysis result is a mark indicating the emotion of the speaker.

本発明によれば、話者の表情と共にその感情をも表示させる情報処理装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the information processing apparatus which displays the emotion with the expression of a speaker can be provided.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の実施形態に係る情報処理装置の概略ブロック図である。ＭＰＵ１１０はＲＯＭ２１０からＴＶ電話システム用のアプリケーションソフトを起動させる。そしてＮｅｔｗｏｒｋ接続部１２０からＴＶ電話の話し相手（話者）からの情報を取得する。この情報には相手の表情及び音声の情報の他に、現在話している人が誰なのかを特定するためにＩＰアドレスの情報が含まれている。 FIG. 1 is a schematic block diagram of an information processing apparatus according to an embodiment of the present invention. The MPU 110 starts application software for the TV phone system from the ROM 210. Then, information from the other party (speaker) of the TV phone is acquired from the network connection unit 120. In addition to the facial expression and voice information of the other party, this information includes IP address information for specifying who is currently speaking.

相手の表情は主映像としてＲＡＭ１３０、画像出力部２３０を経由してディスプレイ２４０に表示される。また、相手の音声はデコードされてＲＡＭ１３０、音響出力部２５０を経由してスピーカ２６０から出力される。 The other party's facial expression is displayed on the display 240 via the RAM 130 and the image output unit 230 as the main video. The other party's voice is decoded and output from the speaker 260 via the RAM 130 and the sound output unit 250.

一方、ユーザの表情はカメラ１４０、画像入力部１５０を、音声はマイク１６０、音声入力部１７０を、さらにキーボード１８０やマウス１９０などの入力装置からの情報はＩ／Ｏ制御部２００をそれぞれ経由してエンコードされてＮｅｔｗｏｒｋ接続部１２０から相手側の情報処理装置へ出力される。以上のようにして通常のＴＶ電話システムは動作する。 On the other hand, the user's facial expression passes through the camera 140 and the image input unit 150, the voice passes through the microphone 160 and the voice input unit 170, and information from the input device such as the keyboard 180 and the mouse 190 passes through the I / O control unit 200. Are encoded and output from the network connection unit 120 to the information processing apparatus on the other side. The normal TV phone system operates as described above.

本実施形態では、さらにＮｅｔｗｏｒｋ接続部１２０から取得した相手の音声をＲＡＭ１３０に一時保存し、相手の感情を推定する感情推定ソフトをＲＯＭ２１０から起動してＭＰＵ１１０が実行する。そして相手の感情をＨＤＤなどのディスク２２０に記録された感情データベースから推定する。推定した感情に基づく補助映像（文字、記号、画像などのマーク）を相手の映像（表情）と共にディスプレイ２４０に表示するようにＭＰＵ１１０が指示する。これによって、ユーザはディスプレイに映し出される相手の表情からでは判断しにくい感情を補助映像によってビジュアル化でき、理解し易くなる。これは特に、ディスプレイが小さく、話し相手の表情から感情を判断することが難しい場合に有効である。 In the present embodiment, the other party's voice acquired from the network connection unit 120 is temporarily stored in the RAM 130, and emotion estimation software for estimating the other party's emotion is started from the ROM 210 and executed by the MPU 110. Then, the other party's emotion is estimated from an emotion database recorded on a disk 220 such as an HDD. The MPU 110 instructs the display 240 to display the auxiliary video (marks such as characters, symbols, and images) based on the estimated emotion together with the video (facial expression) of the other party. As a result, the user can visualize emotions that are difficult to judge from the facial expression of the other person displayed on the display by the auxiliary image, and can easily understand. This is particularly effective when the display is small and it is difficult to judge the emotion from the face of the other party.

図２は、話し手の感情変化を推定するフローチャートである。まず、ネットワーク経由で送信されてきた話し手の音声データを音声認識システムでデコードする（Ｓ１１０）。次に複数の話し相手と会話している場合にはＩＰアドレス情報から話し手を特定する（Ｓ１２０）。そして以下の３つの解析方法で話し手の感情が変化しているかを推定する。 FIG. 2 is a flowchart for estimating a speaker's emotional change. First, the voice data of the speaker transmitted via the network is decoded by the voice recognition system (S110). Next, when talking with a plurality of speaking partners, the speaker is specified from the IP address information (S120). Then, it is estimated whether the speaker's emotion has changed by the following three analysis methods.

第１の解析方法は音声の周波数を用いる。音声周波数を解析して、音声周波数分布を作成する（Ｓ１３０）。次に過去１０秒間の音声周波数分布と秒単位で比較する（Ｓ１４０）。そして音声周波数分布が変化していれば（Ｓ１５０）、話し手の感情が変化していると推定する（Ｓ１６０）。一方、変化していなければ、現状の感情に変化なしと推定する（Ｓ１７０）。 The first analysis method uses audio frequency. The voice frequency is analyzed to create a voice frequency distribution (S130). Next, the audio frequency distribution of the past 10 seconds is compared with the unit of seconds (S140). If the voice frequency distribution has changed (S150), it is estimated that the speaker's emotion has changed (S160). On the other hand, if it has not changed, it is estimated that there is no change in the current emotion (S170).

第２の解析方法は音量レベルで行う。まず過去１０秒間の音量レベルと秒単位で比較し（Ｓ１８０）、音量レベルに変化がある場合（Ｓ１９０のＹｅｓ）は感情が変化していると推定し、音量レベルに変化がない場合（Ｓ１９０のＮｏ）は感情の変化なしと推定する。 The second analysis method is performed at the volume level. First, the volume level for the past 10 seconds is compared with the unit of seconds (S180). If the volume level has changed (Yes in S190), it is estimated that the emotion has changed, and the volume level has not changed (S190). No) presumes that there is no change in emotion.

第３の解析方法は自分のエンコードと相手の話のデコード要求の間隔を用いる。この間隔の過去１０秒間のデータ傾向を調べる（Ｓ２００）。間隔が減少または拡大傾向にある場合（Ｓ２１０のＹｅｓ）は感情が変化していると推定し、このような傾向がない場合（Ｓ２１０のＮｏ）は感情の変化なしと推定する。すなわち、通常の会話では自分が話した後すぐに相手は返答するので、自分の声をエンコードした後に相手の話のデコード要求が起きる。しかし、相手が怒っている場合、自分が話している最中に相手は自分の話を否定しようとして話を開始する。つまり、自分の声をデコードしている最中に相手の話のデコード要求が起きる。一方、相手が悲しんでいる場合、自分が話した後、悲しみのため返答が遅れる。つまり、自分の声をデコードした後しばらくしてから相手の話のデコード要求が起きる。 The third analysis method uses the interval between the own encoding and the decoding request of the other party's story. The data trend of the past 10 seconds at this interval is examined (S200). If the interval tends to decrease or expand (Yes in S210), it is estimated that the emotion has changed, and if there is no such tendency (No in S210), it is estimated that there is no change in emotion. In other words, in a normal conversation, the other party responds immediately after speaking, so a request to decode the other party's story occurs after encoding his voice. However, if the other party is angry, the other party will begin to speak while trying to deny his / her own conversation. In other words, a request to decode the other party's story occurs while decoding one's own voice. On the other hand, when the other party is sad, the reply is delayed because of sadness after speaking. In other words, a request to decode the other party's story occurs after a while after decoding his / her voice.

図３は、本実施形態に係る話し手の感情を推定するフローチャートである。一例として、話し手の感情が「怒り・興奮」または「悲しみ」に該当するか否かを推定する場合を説明する。 FIG. 3 is a flowchart for estimating the emotion of the speaker according to the present embodiment. As an example, a case will be described in which it is estimated whether or not the speaker's emotion corresponds to “anger / excitement” or “sadness”.

まず、音声周波数分布が高音域に変化している場合（Ｓ３１０のＹｅｓ）について説明する。この場合、話し手の感情が「怒り・興奮」に該当することがある。音量レベルが増加傾向であり（Ｓ３２０のＹｅｓ）、会話の間隔が減少傾向にある（Ｓ３３０のＹｅｓ）と判断すると、音声認識システムによりデコードされた言葉と感情データベースに記録されている「怒り・興奮」と比較する（Ｓ３４０）。そして、適合する言葉使いが存在する場合（Ｓ３５０のＹｅｓ）、現在の話し手の感情は「怒り・興奮」の状態にあっていると判定できる。「怒り・興奮」にあう画像表示と記号を選択し、文字表示フォントと表示色を「怒り・興奮」をイメージできるものに変更する（Ｓ３６０）。なお、適合する言葉がない場合にはデータベースに追加語句として記録する（Ｓ３７０）。 First, the case where the audio frequency distribution has changed to the high sound range (Yes in S310) will be described. In this case, the speaker's emotion may correspond to “anger / excitement”. When it is determined that the volume level is increasing (Yes in S320) and the interval between conversations is decreasing (Yes in S330), the word “detered / excited” recorded in the words and emotion database decoded by the speech recognition system. (S340). If there is a matching word usage (Yes in S350), it can be determined that the emotion of the current speaker is in an “anger / excitement” state. The image display and the symbol corresponding to “anger / excitement” are selected, and the character display font and display color are changed to those that can image “anger / excitement” (S360). If there is no matching word, it is recorded as an additional word in the database (S370).

一方、音量レベルが増加していない場合（Ｓ３２０のＮｏ）、または、話の間隔が減少していない場合（Ｓ３３０のＮｏ）は感情を判断しないで終了する。 On the other hand, if the volume level has not increased (No in S320), or if the talk interval has not decreased (No in S330), the process ends without judging the emotion.

次に、音声周波数分布が高音域に変化していない場合（Ｓ３１０のＮｏ）について説明する。この場合、話し手の感情が「悲しみ」に該当することがある。 Next, the case where the audio frequency distribution has not changed to the high sound range (No in S310) will be described. In this case, the speaker's emotion may correspond to “sadness”.

会話開始の音声周波数分布と音量レベルとを比較する（Ｓ３８０）。そして低音域に変化し（Ｓ３９０のＹｅｓ）、音量レベルが減少傾向にあり（Ｓ４００のＹｅｓ）、しかも、話の間隔が増加傾向にあると判断すると（Ｓ４１０のＹｅｓ）、音声認識システムによりデコードされた言葉と感情データベースに記録されている「悲しみ」と比較する（Ｓ４２０）。そして、適合する言葉使いが存在する場合（Ｓ４３０のＹｅｓ）、現在の話し手の感情は「悲しみ」の状態にあっていると判定できる。「悲しみ」にあう画像表示と記号を選択し、文字表示フォントと表示色を「悲しみ」をイメージできるものに変更する（Ｓ４４０）。なお、適合する言葉がない場合にはデータベースに追加語句として記録する（Ｓ４５０）。 The voice frequency distribution at the start of the conversation is compared with the volume level (S380). When the sound level is changed (Yes in S390), the volume level tends to decrease (Yes in S400), and the interval of the talk is determined to increase (Yes in S410), it is decoded by the voice recognition system. The word is compared with “sadness” recorded in the emotion database (S420). If there is a matching word usage (Yes in S430), it can be determined that the emotion of the current speaker is in the state of “sadness”. The image display and symbol that meet “sadness” are selected, and the character display font and display color are changed to those that can image “sadness” (S440). If there is no matching word, it is recorded as an additional word in the database (S450).

一方、低音域に変化していない場合（Ｓ３８０のＮｏ）、音量レベルが減少していない場合（Ｓ３９０のＮｏ）、話の間隔が増加していない場合（Ｓ４００のＮｏ）は感情を判断しないで終了する。 On the other hand, if the sound level has not changed (No in S380), the volume level has not decreased (No in S390), or the talk interval has not increased (No in S400), the emotion is not judged. finish.

図４は、ディスク２２０に記録される感情データベースの概要を説明するための図である。感情データベースは「喜び」、「悲しみ」、「怒り」、「驚き」を基本とするデータベースである。ここでは、感情解析に使用するデータベースの一例として「怒り」の場合を説明する。感情解析と音声認識結果に基づいて感情データベースから該当するものがある場合には表示候補として準備される。例えば、音声認識デコードの結果「なに」となると、「なにぃぃぃ！」と赤字で表示する。 FIG. 4 is a diagram for explaining the outline of the emotion database recorded on the disc 220. The emotion database is a database based on “joy”, “sadness”, “anger”, and “surprise”. Here, the case of “anger” will be described as an example of a database used for emotion analysis. If there is a corresponding one from the emotion database based on the emotion analysis and the speech recognition result, it is prepared as a display candidate. For example, if the result of speech recognition decoding is “what”, “Nanii!” Is displayed in red.

図５〜７は、感情解析を行った結果の画像表示例である。ここでは、怒りの感情を解析すると、使用するユーザが予め選択した背景などが表示される場合を示す。図５は、背景を変更する場合であり、「炎がめらめらとたぎっている背景」に変化させている。図６、７は、カメラで撮影した人物の画像上に感情を示す画像を追加している場合である。図６は額に「怒りを表すマーク」を付加し、図７は頭上に「沸騰しているやかん」を付加している。 5 to 7 are image display examples as a result of emotion analysis. Here, when an angry emotion is analyzed, a background or the like previously selected by the user to be used is displayed. FIG. 5 shows a case where the background is changed, and the background is changed to “a background in which the flame flickers”. 6 and 7 show a case where an image showing emotion is added to the image of the person photographed by the camera. 6 adds a “mark indicating anger” to the forehead, and FIG. 7 adds “boiling kettle” above the head.

図８は、複数人で会話を行った場合のディスプレイ画像表示例を示す図である。解析した音声を文字表示部に表している。このようにすれば、１対１のＴＶ電話のみならず、３人以上のＴＶチャットの場合にも、本実施形態が適用できる。 FIG. 8 is a diagram illustrating a display image display example when a conversation is performed by a plurality of people. The analyzed voice is shown in the character display section. In this way, the present embodiment can be applied not only to a one-to-one TV phone but also to a TV chat with three or more people.

上述した実施の形態は、本発明の好適な具体例であるから、技術的に好ましい種々の限定が付されているが、本発明の趣旨を逸脱しない範囲であれば、適宜組合わせ及び変更することができることはいうまでもない。 The above-described embodiment is a preferable specific example of the present invention, and thus various technically preferable limitations are attached. However, the embodiments are appropriately combined and changed within a range not departing from the gist of the present invention. It goes without saying that it can be done.

例えば、相手の感情を解析して自分のディスプレイに相手の感情を表示させるという受信側の場合だけでない。すなわち、自分の感情を解析させて、この解析結果を自分の映像に付加して、ネットワーク経由で相手側のディスプレイに表示させるという送信側の場合も適用できる。これによって、ＴＶ画面上の表情からではわかりにくい微妙な感情の変化を相手に伝えることができる。したがって、耳の不自由な方も話し相手の感情を従来のＴＶ電話よりも理解しやすくなる。 For example, it is not only the case of the receiving side that analyzes the other party's feelings and displays the other party's feelings on their display. In other words, it is also possible to apply to the case of the transmitting side in which one's own emotion is analyzed, the analysis result is added to his own video, and displayed on the other party's display via the network. As a result, it is possible to convey to the partner a subtle emotional change that is difficult to understand from the expression on the TV screen. Therefore, a hearing-impaired person can more easily understand the emotion of the other party than the conventional TV phone.

また、言語機能に支障のある方はキーボードで人工音声に変換してもらいたい言葉を入力し、さらにキーボードのファンクションキーに感情を割り当てることにより、本実施形態の感情データベースを用いた解析の逆の解析を行うことによって、感情のこもった人口音声を話し相手側に送信できる。話し相手は、感情のこもった人口音声の他に、本実施形態の感情データベースを用いた解析を行うことによって、話し手の表情に感情を表すマークを付加できるので、人工音声だけの場合よりも豊かに感情を表現することができる。 Also, if you have trouble with the language function, enter the words you want to be converted to artificial speech with the keyboard, and assign emotions to the function keys of the keyboard, which is the reverse of the analysis using the emotion database of this embodiment By performing the analysis, it is possible to transmit the voice of the population with emotions to the other party. The other party can add emotional marks to the speaker's facial expression by performing an analysis using the emotion database of this embodiment in addition to the emotional artificial voice. Can express emotions.

本発明の実施形態に係る情報処理装置の概略ブロック図。1 is a schematic block diagram of an information processing apparatus according to an embodiment of the present invention. 本実施形態に係る話し手の感情変化を推定するフローチャート。The flowchart which estimates the speaker's emotional change which concerns on this embodiment. 本実施形態に係る話し手の感情が「怒り・興奮」または「悲しみ」であるかを推定するフローチャート。The flowchart which estimates whether the speaker's emotion which concerns on this embodiment is "anger / excitement" or "sadness". 本実施形態に係る感情データベースの概要を説明するための図。The figure for demonstrating the outline | summary of the emotion database which concerns on this embodiment. 感情解析を行った結果の画像表示例を示す図。The figure which shows the example of an image display of the result of having performed emotion analysis. 感情解析を行った結果の画像表示例を示す図。The figure which shows the example of an image display of the result of having performed emotion analysis. 感情解析を行った結果の画像表示例を示す図。The figure which shows the example of an image display of the result of having performed emotion analysis. 複数人で会話を行った場合のディスプレイ画像表示例を示す図。The figure which shows the example of a display image display at the time of having conversation by several people.

Explanation of symbols

１１０ＭＰＵ
１２０Ｎｅｔｗｏｒｋ接続部（Ｅｔｈｅｒ接続部）
１３０ＲＡＭ
１４０カメラ
１５０画像入力部（ＵＳＢ等）
１６０マイク
１７０音声入力部
１８０キーボード
１９０マウス
２００Ｉ／Ｏ制御部
２１０ＲＯＭ
２２０ディスク
２３０画像出力部
２４０ディスプレイ
２５０音響出力部
２６０スピーカ 110 MPU
120 Network connection (ether connection)
130 RAM
140 Camera 150 Image input unit (USB etc.)
160 Microphone 170 Voice input unit 180 Keyboard 190 Mouse 200 I / O control unit 210 ROM
220 Disc 230 Image output unit 240 Display 250 Sound output unit 260 Speaker

Claims

A display for displaying the expression of the speaker;
An analysis unit that analyzes emotions from speaker's voice information;
An information processing apparatus comprising: a control unit that instructs to display the analysis result on the display unit.

The information processing apparatus according to claim 1, wherein the analysis unit analyzes an emotion from a change in frequency of the audio information.

The information processing apparatus according to claim 1, wherein the analysis unit analyzes an emotion from a change in volume of the audio information.

The information processing apparatus according to claim 1, wherein the analysis unit analyzes an emotion from a change in a conversation interval of the voice information.

The information processing apparatus according to claim 1, wherein the analysis result is a mark indicating the emotion of the speaker.