JP2005269207A

JP2005269207A - Information delivery method and communication apparatus for realizing this method, and program therefor

Info

Publication number: JP2005269207A
Application number: JP2004078356A
Authority: JP
Inventors: Yoshimasa Yanagihara; 義正柳原; Yoshito Nanjo; 義人南條; Tadashi Mori; 忠毛利; Tamotsu Machino; 保町野; Joji Nakayama; 丈二中山; Hitomi Sato; 仁美佐藤; Takayoshi Mochizuki; 崇由望月; Hiroaki Kawada; 博昭河田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-03-18
Filing date: 2004-03-18
Publication date: 2005-09-29
Anticipated expiration: 2024-03-18
Also published as: JP4287770B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize more accurate and easier mutual understanding between talkers by enabling an opposite talker to be effectively informed about the sentiment of a talker thereby more smoothing communication. <P>SOLUTION: Each of communication terminals TM1 to TMn judges whether or not a talker makes utterance or responds to an utterer during a video conference communication. When the talker responds to the utterer, The talker's communication terminal detects each of the position of the visual line of the talker and a motion of the head of the talker, estimates the sentiment of the responder on the basis of a result of the detection, and transmits an identification ID representing the estimated sentiment to the communication terminal of the utterer together with a person identification ID. Meanwhile, when the talker makes utterance, the talker's communication terminal receives the sentiment identification ID and the person identification ID of the responder transmitted from the communication terminal of the responder side, selectively reads character data denoting contents of the sentiment corresponding to the received sentiment identification ID from a database 22, superimposes the read character data on received face image data of the corresponding responder and displays the resulting data onto a monitor 8. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、例えばテレビジョン電話装置やテレビジョン会議システムを使用した協調作業支援システムにおいて、会議者同士、遠隔協調作業者同士の通話時の意思疎通をさらに円滑化するための、情報伝達方法及びこの方法を実現するための通信装置とそのプログラムに関する。 The present invention relates to an information transmission method for further facilitating communication at the time of a call between conference participants and remote collaborators in a collaborative work support system using, for example, a television telephone device or a video conference system, and The present invention relates to a communication apparatus and a program for realizing this method.

近年、インターネット等の通信ネットワークを利用した種々サービスが提唱されており、その中にテレビジョン電話装置やテレビジョン会議システムを使用して、遠隔地にいる作業員を支援する遠隔協調作業システムが提案されている。この種のシステムでは、通信相手同士の円滑なコミュニケーションがきわめて重要になる。ところが、この種のシステムは、一般に映像及び音声のみを用いてコミュニケーションを行っている。このため、通話者の意図や感情等が通話相手に十分伝達されにくく、話者相互の意思疎通を図り難い。
そこで、通話者の感情を画像で伝達する機能を備えたテレビジョン電話装置が提案されている（例えば、特許文献１を参照。）。 In recent years, various services using communication networks such as the Internet have been proposed. Among them, a remote collaborative work system that supports workers in remote locations using a television telephone device or a video conference system is proposed. Has been. In this type of system, smooth communication between communication partners is extremely important. However, this type of system generally performs communication using only video and audio. For this reason, the intentions and emotions of the caller are not sufficiently transmitted to the call partner, and it is difficult for the speakers to communicate with each other.
In view of this, a television telephone device having a function of transmitting a caller's emotion as an image has been proposed (see, for example, Patent Document 1).

特開２００２−３５４４３６号公報JP 2002-354436 A

しかしながら、この従来提案されたテレビジョン電話装置では、感情を表す画像が伝達されるだけである。このため、送話者が発言した内容や伝えたい内容を相手話者が理解したかどうか、或いは送話者が提案した内容に関して相手話者がどう反応したかを判断し難い。例えば、自己の提案について賛成なのか反対なのか、または考え中で保留なのかを即座に確認できない。このため、相手話者とのコミュニケーションが途切れ途切れになる場合が多く、円滑なコミュニケーションを行えないという問題が発生する。また、通話相手の心情状態を効果的かつ相手に気づかれないように知りうることが難しい。 However, in this conventionally proposed television telephone apparatus, only an image representing emotion is transmitted. For this reason, it is difficult to determine whether the other speaker understands what the speaker has spoken or what he wants to convey, or how the other speaker has reacted to the content proposed by the speaker. For example, you cannot immediately confirm whether you are in favor of or against your suggestion, or whether you are pending. For this reason, the communication with the other party speaker is often interrupted, causing a problem that smooth communication cannot be performed. Also, it is difficult to know the emotional state of the other party so that the other party is not aware of it effectively.

この発明は上記事情に着目してなされたもので、その目的とするところは、話者の心情を効果的に相手話者に通知できるようにしてコミュニケーションのさらなる円滑化を図り、これにより話者相互間の意思の疎通をより一層正確かつ容易にした、情報伝達方法及びこの方法を実現するための通信装置とそのプログラムを提供することにある。 The present invention has been made paying attention to the above circumstances, and the purpose of the present invention is to further facilitate communication by enabling the other speaker to be effectively notified of the speaker's feelings, thereby enabling the speaker An object of the present invention is to provide an information transmission method, a communication device for realizing this method, and a program thereof that make communication between each other more accurate and easy.

この発明は、上記事情に着目してなされたもので、その目的とするところは、第１の通信端末と第２の通信端末との間で通信ネットワークを介して音声及び映像を用いた電話通信を行う際に、第１の通信端末において、当該話者の頭部の動き及び視線の位置をそれぞれ計測してその各計測結果をもとに当該話者の心情を推定し、この推定された心情を表す情報を前記第２の通信端末へ伝送する。そして、上記第１の通信端末から伝送された心情を表す情報を、第２の通信端末においてその話者に提示するようにしたものである。 The present invention has been made paying attention to the above circumstances, and the object of the present invention is to perform telephone communication using audio and video between a first communication terminal and a second communication terminal via a communication network. In the first communication terminal, the head movement of the speaker and the position of the line of sight are respectively measured in the first communication terminal, and the emotion of the speaker is estimated based on the measurement results. Information representing the emotion is transmitted to the second communication terminal. Then, the information representing the emotion transmitted from the first communication terminal is presented to the speaker at the second communication terminal.

したがってこの発明によれば、第１の通信端末では話者の頭部の動きや視線の位置をもとに当該話者の心情が自動的に推定され、この推定された話者の心情が第２の通信端末に伝送されて話者に提示される。このため、第２の通信端末の話者は、電話通信を行いながら相手話者の心情を把握することができる。例えば、話者の頭部の動きや視線の位置をもとに、問いかけに対し話者が同意したか、否定したか或いは保留したかを推定し、これを当該話者の心情を表す情報として定義すれば、第２の通信端末の話者は自身の問いかけに対する相手話者の心情を、相手話者の音声や画像等に頼ることなく明確に把握することが可能となる。したがって、通話中の話者間のコミュニケーションをさらに円滑にし、意思の疎通を正確かつ容易にすることができる。 Therefore, according to the present invention, the first communication terminal automatically estimates the emotion of the speaker based on the movement of the speaker's head and the position of the line of sight, and the estimated emotion of the speaker is the first communication terminal. 2 is transmitted to the communication terminal 2 and presented to the speaker. For this reason, the speaker of the second communication terminal can grasp the emotion of the other speaker while performing telephone communication. For example, based on the movement of the speaker's head and the position of the line of sight, it is estimated whether the speaker has agreed, denied, or suspended the question, and this is used as information representing the speaker's feelings. If defined, the speaker of the second communication terminal can clearly grasp the other speaker's feelings about his / her question without relying on the other speaker's voice or image. Therefore, communication between speakers during a call can be further facilitated, and communication can be made accurately and easily.

上記心情を表す情報を伝送する際には、第１の通信端末において、推定された心情を表す情報を識別コードに変換したのち第２の通信端末へ伝送し、第２の通信端末において、上記伝送された識別コードをもとに心情を表す提示情報を生成して出力するとよい。このようにすると、心情を表す情報のデータ量を大幅に減らすことができ、これにより心情を表す情報を簡単かつ短時間に伝送することができる。 When transmitting the information representing the heart, the first communication terminal converts the information representing the estimated heart into an identification code, and then transmits the identification code to the second communication terminal. It is preferable to generate and output presentation information representing emotion based on the transmitted identification code. In this way, it is possible to greatly reduce the data amount of information representing emotions, and thus it is possible to transmit information representing emotions easily and in a short time.

また、心情を表す情報の提示形態としては次のような各種形態が考えられる。
第１の形態は、伝送された識別コードをもとに心情を文字データにより表した提示情報を生成し、この生成された文字データにより表される提示情報を、第１の通信端末から伝送される映像情報に重畳して表示するものである。このようにすると、第２の通信端末の話者は、例えばテレビジョン電話通信中に相手話者の顔画像を見ながら当該相手話者の心情を文字データにより確認することができる。 Moreover, the following various forms can be considered as a form of presenting information representing emotions.
In the first mode, presentation information that expresses emotions as character data is generated based on the transmitted identification code, and the presentation information that is expressed by the generated character data is transmitted from the first communication terminal. The video information is superimposed on the video information. In this way, the speaker of the second communication terminal can confirm the emotion of the other speaker by using character data while viewing the face image of the other speaker, for example, during videophone communication.

第２の形態は、伝送された識別コードをもとに心情を音声または音響データにより表した提示情報を生成し、この生成された音声または音響データからなる提示情報を、第１の通信端末から伝送される音声情報に挿入または合成して出力するものである。このようにすると、第２の通信端末の話者は、例えば電話通信中に相手話者の音声を聞きながら当該相手話者の心情を合成音声または音響データにより確認することができる。 In the second mode, presentation information that expresses the emotion by voice or acoustic data is generated based on the transmitted identification code, and the presentation information including the generated voice or acoustic data is transmitted from the first communication terminal. It is inserted or synthesized into audio information to be transmitted and output. In this way, the speaker of the second communication terminal can confirm the emotion of the other speaker by means of synthesized speech or acoustic data while listening to the other speaker's voice during telephone communication, for example.

第３の形態は、伝送された識別コードをもとに心情を空気圧により表した提示情報を生成し、この生成された空気圧により表される提示情報を出力するものである。このようにすると、第２の通信端末の話者は、受信映像や受話音声に手を加えることなく、空気圧の吐出パターンにより相手話者の心情を確認することができる。 In the third mode, presentation information that expresses the emotion by air pressure is generated based on the transmitted identification code, and the presentation information that is expressed by the generated air pressure is output. In this way, the speaker of the second communication terminal can check the emotion of the other speaker by the air pressure discharge pattern without changing the received video or the received voice.

さらに、第１の通信端末が複数台存在する場合には、第２の通信端末において、上記複数の第１の通信端末からそれぞれ伝送された心情を表す情報の多数決をとり、この多数決の結果を当該第２の通信端末の話者に提示するとよい。このようにすると、テレビジョン電話会議を行っている状態で第２の通信端末の話者は、多数決の結果から他の各通話相手の心情を大勢を即時判断することができる。 Further, when there are a plurality of first communication terminals, the second communication terminal takes a majority vote of the information representing the emotion transmitted from each of the plurality of first communication terminals, and the result of the majority vote is obtained. It may be presented to the speaker of the second communication terminal. In this way, the speaker of the second communication terminal can quickly determine the feelings of the other call partners from the result of the majority decision in the state where the video conference call is being performed.

要するにこの発明では、第１の通信端末と第２の通信端末との間で通信ネットワークを介して音声及び映像を用いた電話通信を行う際に、第１の通信端末において、当該話者の頭部の動き及び視線の位置をそれぞれ計測してその各計測結果をもとに当該話者の心情を推定し、この推定された心情を表す情報を前記第２の通信端末へ伝送する。そして、上記第１の通信端末から伝送された心情を表す情報を、第２の通信端末においてその話者に提示するようにしている。
したがってこの発明によれば、話者の心情を効果的に相手話者に通知できるようになり、これによりコミュニケーションのさらなる円滑化を図って、話者相互間の意思の疎通をより一層正確かつ容易にすることが可能な情報伝達方法及びこの方法を実現するための通信装置とそのプログラムを提供することができる。 In short, in the present invention, when telephone communication using voice and video is performed between a first communication terminal and a second communication terminal via a communication network, the first communication terminal uses the head of the speaker. The movement of the unit and the position of the line of sight are measured, the emotional state of the speaker is estimated based on each measurement result, and information representing the estimated emotional state is transmitted to the second communication terminal. Then, the information representing the emotion transmitted from the first communication terminal is presented to the speaker at the second communication terminal.
Therefore, according to the present invention, it becomes possible to effectively notify the other party's feelings of the speaker, thereby further facilitating communication and making communication between the speakers more accurate and easy. It is possible to provide an information transmission method, a communication device for realizing the method, and a program thereof.

（第１の実施形態）
図１は、この発明に係わる情報伝達方法を実現するための通信システム及び通信装置の第１の実施形態を示すブロック図である。
この通信システムは、複数の通信装置ＴＭ１〜ＴＭｎを通信ネットワークＮＷを介して相互に接続し、これらの通信装置ＴＭ１〜ＴＭｎ間でテレビジョン会議通信を可能としたものである。 (First embodiment)
FIG. 1 is a block diagram showing a first embodiment of a communication system and a communication apparatus for realizing an information transmission method according to the present invention.
In this communication system, a plurality of communication devices TM1 to TMn are connected to each other via a communication network NW, and video conference communication can be performed between these communication devices TM1 to TMn.

通信ネットワークＮＷは、例えば有線加入通信網及び移動通信網とにより構成される。有線加入通信網には、ＰＳＴＮ（Public Switched Telephone Network）やＩＳＤＮ（Integrated Service Digital Network）に加え、有線ＬＡＮ（Local Area Network）等の企業内通信ネットワーク、ＣＡＴＶ（Cable Television）ネットワークが含まれる。また移動通信網には、例えば携帯電話システム、ＰＨＳ（Personal Handyphone System）、無線ＬＡＮが含まれる。なお、通信ネットワークＮＷにはインターネット等のＩＰ網を含めることも可能である。 The communication network NW includes, for example, a wired subscription communication network and a mobile communication network. In addition to PSTN (Public Switched Telephone Network) and ISDN (Integrated Service Digital Network), the wired subscription communication network includes an in-company communication network such as a wired LAN (Local Area Network), and a CATV (Cable Television) network. The mobile communication network includes, for example, a mobile phone system, a PHS (Personal Handyphone System), and a wireless LAN. The communication network NW can include an IP network such as the Internet.

通信装置ＴＭ１〜ＴＭｎは次のように構成される。なお、各通信装置ＴＭ１〜ＴＭｎは同一構成のため、ここでは通信装置ＴＭ１についてのみ説明し、他の通信装置ＴＭ２〜ＴＭｎについての説明は省略する。
通信装置ＴＭ１はテレビジョン電話通信端末であり、テレビジョン電話通信に必要な機能としてカメラ７、モニタ８、マイクロホン９、スピーカ１０、映像制御部１１、音響制御部１２、多重・分離部１３及び通信処理部６を備えている。 The communication devices TM1 to TMn are configured as follows. Since the communication devices TM1 to TMn have the same configuration, only the communication device TM1 will be described here, and description of the other communication devices TM2 to TMn will be omitted.
The communication device TM1 is a television telephone communication terminal, and functions necessary for television telephone communication include a camera 7, a monitor 8, a microphone 9, a speaker 10, a video control unit 11, an acoustic control unit 12, a multiplexing / separating unit 13, and communication. A processing unit 6 is provided.

このうち映像制御部１１は、上記カメラ７及びモニタ８を制御して、自端末を使用する話者の顔画像等を撮像して符号化すると共に、受信された映像データを復号してモニタ８に表示させる。音響制御部１２は、マイクロホン９により入力された送話音声信号を符号化すると共に、受信された音声または音響データを復号してスピーカ１０から拡声出力する。多重・分離部１３は、上記映像制御部１１において符号化された映像データと、上記音響制御部１２により符号化された送話音声データとを多重化すると共に、受信された多重化データから映像データと音声または音響データとを分離する。 Among these, the video control unit 11 controls the camera 7 and the monitor 8 to capture and encode a face image of a speaker who uses the terminal, and decodes the received video data to monitor 8. To display. The acoustic control unit 12 encodes the transmission voice signal input by the microphone 9, decodes the received voice or acoustic data, and outputs the voice from the speaker 10. The multiplexing / demultiplexing unit 13 multiplexes the video data encoded by the video control unit 11 and the transmission voice data encoded by the acoustic control unit 12 and also generates video from the received multiplexed data. Separate data from voice or acoustic data.

また通信装置ＴＭ１は、この発明に係わる新たな機能として、視線検出部１と、頭部動作検出部２と、発話者検出部３と、応答者処理部４と、発話者処理部５Ａとを備えている。視線検出部１は、図２に示すように画像記憶手段１４及び視線検出手段１５を備え、上記映像制御部１１により得られた話者の顔画像データから、話者の視線の位置を表す座標値を検出する。 Further, the communication device TM1 includes a line-of-sight detection unit 1, a head movement detection unit 2, a speaker detection unit 3, a responder processing unit 4, and a speaker processing unit 5A as new functions according to the present invention. I have. As shown in FIG. 2, the line-of-sight detection unit 1 includes an image storage unit 14 and a line-of-sight detection unit 15, and coordinates representing the position of the speaker's line of sight from the speaker's face image data obtained by the video control unit 11. Detect value.

頭部動作検出部２は、図２に示すように頭部姿勢計測手段１６及び頭部動作方向検出手段１７を備える。頭部姿勢計測手段１６は、ジャイロまたは磁気センサ等の３軸姿勢角度センサにより構成され、話者の頭部に装着される。頭部動作方向検出手段１７は、上記頭部姿勢計測手段１６から出力される検出信号をもとに、話者の頭部の動作方向（角度）を検出する。 The head movement detection unit 2 includes a head posture measurement unit 16 and a head movement direction detection unit 17 as shown in FIG. The head posture measuring means 16 is constituted by a three-axis posture angle sensor such as a gyroscope or a magnetic sensor, and is mounted on the head of the speaker. The head movement direction detection means 17 detects the movement direction (angle) of the speaker's head based on the detection signal output from the head posture measurement means 16.

発話者検出部３は、上記視線検出部１及び頭部動作検出部２によりそれぞれ検出される話者の視線位置及び頭部の動作角度と、上記音響制御部１２から出力される送話音声データとをもとに、当該話者が発言中であるか否かを判定する。
応答者処理部４は、上記発話者検出部３により当該話者が発言中ではないと判定された場合、つまり受話中（応答中）と判定された場合に、当該受話中の話者（応答者）の心情を推定するもので、図２に示すように構成される。すなわち、応答者処理部４は、心情識別手段１８と、心情識別結果記憶手段１９とを備える。心情識別手段１８は、上記視線検出部１及び頭部動作検出部２により検出された話者の視線位置及び頭部の動作角度に基づいて、当該受話中（応答中）の話者の心情情報を推定する処理を行う。そして、この推定された心情情報を心情の種別を表す心情識別ＩＤに変換する。心情識別結果記憶手段１９は、上記心情識別手段１８により得られた心情識別ＩＤを、当該受話中（応答中）の話者の識別ＩＤに対応付けてメモリに記憶する。 The speaker detection unit 3 includes a speaker's line-of-sight position and head movement angle detected by the line-of-sight detection unit 1 and the head movement detection unit 2, and transmission voice data output from the acoustic control unit 12. Based on the above, it is determined whether or not the speaker is speaking.
When the speaker detection unit 3 determines that the speaker is not speaking, that is, when the speaker processing unit 4 determines that the speaker is currently receiving (in response), the responder processing unit 4 2) and is configured as shown in FIG. That is, the responder processing unit 4 includes a feeling identification unit 18 and a feeling identification result storage unit 19. The emotion identification means 18 is based on the speaker's line-of-sight position and head movement angle detected by the line-of-sight detection unit 1 and head movement detection unit 2, and the emotional information of the speaker who is receiving (responding). The process which estimates is performed. Then, the estimated emotion information is converted into a emotion identification ID representing the type of emotion. The sentiment identification result storage means 19 stores the sentiment identification ID obtained by the sentiment identification means 18 in the memory in association with the identification ID of the speaker who is receiving (responding).

発話者処理部５Ａは、上記発話者検出部３により当該話者が発言中と判定された場合に、当該話者の発言内容に対する相手話者の心情情報を相手話者の通信端末ＴＭ２〜ＴＭｎから受信して、当該発言中の話者に提示するもので、図３に示すようにデータベース（ＤＢ）２２と、受信結果記憶手段２４と、データベース検索手段２３と、検索結果記憶手段２１と、重畳手段２０とを備える。 When the speaker detection unit 3 determines that the speaker is speaking, the speaker processing unit 5A displays the other speaker's emotional information about the content of the speaker's speech to the communication terminals TM2 to TMn of the partner speaker. And presenting it to the speaker who is speaking, as shown in FIG. 3, a database (DB) 22, a reception result storage means 24, a database search means 23, a search result storage means 21, Superimposing means 20.

データベース２２は、図４に示すように人物識別テーブル２５と、心情識別テーブル２９とから構成される。人物識別テーブル２５には、人物識別ＩＤ２６に対応付けて、その参加者（話者）名２７及び心情識別ＩＤ２８がそれぞれ記憶される。心情識別テーブル２９には、心情識別ＩＤ２８に対応付けて、心情を表現するテキストファイル名３０が記憶される。 As shown in FIG. 4, the database 22 includes a person identification table 25 and a feeling identification table 29. In the person identification table 25, the participant (speaker) name 27 and the emotion identification ID 28 are stored in association with the person identification ID 26, respectively. The emotion identification table 29 stores a text file name 30 expressing the emotion in association with the emotion identification ID 28.

受信結果記憶手段２４は、受信された心情識別ＩＤ及び話者識別ＩＤを記憶保持する。データベース検索手段２３は、上記受信された心情識別ＩＤ及び人物識別ＩＤをもとに上記データベース２２を検索し、対応する心情を表現する文字データを読み出す。検索結果記憶手段２１は、上記読み出された心情を表現する文字データを記憶保持する。重畳手段２０は、上記心情を表現する文字データを、多重・分離部１３により分離された映像データに重畳するための処理を行う。 The reception result storage means 24 stores and holds the received emotion identification ID and speaker identification ID. The database search means 23 searches the database 22 based on the received emotion identification ID and person identification ID, and reads character data expressing the corresponding emotion. The search result storage means 21 stores and holds character data representing the read emotion. The superimposing means 20 performs a process for superimposing the character data expressing the emotion on the video data separated by the multiplexing / separating unit 13.

通信処理部６は、通信相手となる通信端末ＴＭ１〜ＴＭｎとの間で、通信ネットワークＮＷにより規定される通信プロトコルに従い、上記多重化データ、心情識別ＩＤ及び話者識別ＩＤの送受信を行う。通信プロトコルとしては、通信ネットワークＮＷが例えばＩＰ網を含んでいればＴＣＰ／ＩＰ（Transmission Control Protocol／Internet Protocol）が使用される。 The communication processing unit 6 transmits / receives the multiplexed data, the emotion identification ID, and the speaker identification ID to / from the communication terminals TM1 to TMn as communication partners according to the communication protocol defined by the communication network NW. As the communication protocol, TCP / IP (Transmission Control Protocol / Internet Protocol) is used if the communication network NW includes, for example, an IP network.

次に、以上のように構成された通信端末ＴＭ１の動作を説明する。
テレビジョン会議通信が開始されると、通信端末ＴＭ１では映像及び音声の送受信が行われると共に、話者の心情情報の推定とその送受信処理が行われる。図５は心情情報の推定及び送受信処理の手順と内容を示すフローチャートである。 Next, the operation of the communication terminal TM1 configured as described above will be described.
When the television conference communication is started, the communication terminal TM1 transmits and receives video and audio, and estimates the emotional information of the speaker and performs transmission / reception processing thereof. FIG. 5 is a flowchart showing the procedure and contents of emotion information estimation and transmission / reception processing.

すなわち、通信端末ＴＭ１は、先ずステップ５ａにおいて通話参加者の視線位置を検出する。視線位置の検出方法としては、例えば「眼球形状モデルに基づく視線測定法：画像センシングシンポジウム，２００２」に記載された技術を使用することができる。次に、ステップ５ｂにおいて、通話参加者の頭部に装着されたジャイロあるいは磁気センサの検出信号をもとに、話者の頭部の動きの３軸姿勢角度を検出する。そして、ステップ５ｃにより、音響制御部１２から出力される音声信号をもとに、話者が発言中であるかまたは受話中（応答中）であるか判定する。 That is, the communication terminal TM1 first detects the line-of-sight position of the call participant in step 5a. As a method for detecting the line-of-sight position, for example, the technique described in “Gaze Measurement Method Based on Eyeball Shape Model: Image Sensing Symposium, 2002” can be used. Next, in step 5b, the three-axis posture angle of the movement of the speaker's head is detected based on the detection signal of the gyroscope or magnetic sensor attached to the call participant's head. In step 5c, it is determined based on the audio signal output from the acoustic control unit 12 whether the speaker is speaking or receiving (responding).

この判定の結果、話者が応答中であればステップ５ｇに移行して応答者処理部４により当該応答者の心情情報を判別し、ステップ５ｈにより発言中の話者にこの心情情報に相当する心情識別ＩＤ及び人物識別ＩＤを送信する。これに対し、話者が発言中の場合にはステップ５ｄに移行し、受話中（応答中）の通信端末ＴＭ２〜ＴＭｎから送られた心情情報ＩＤ及び人物識別ＩＤを受信する。そして、発話者処理部５Ａにより、応答者の映像に含まれる視線情報をもとに心情情報の提示方式を選択し（ステップ５ｅ）、この選択された提示方式に応じて上記受信された応答者の心情情報を提示する（ステップ５ｆ）。
以上の処理をテレビジョン会議通信が終了するまで続ける。 As a result of the determination, if the speaker is responding, the process proceeds to step 5g, where the responder processing unit 4 determines the emotion information of the responder, and in step 5h, the speaker who is speaking is equivalent to this emotion information. A sentiment identification ID and a person identification ID are transmitted. On the other hand, when the speaker is speaking, the process proceeds to step 5d to receive the emotional information ID and person identification ID sent from the communication terminals TM2 to TMn that are receiving (responding). Then, the speaker processing unit 5A selects a presentation method of emotion information based on the line-of-sight information included in the responder's video (step 5e), and the received responder according to the selected presentation method. Is presented (step 5f).
The above processing is continued until the video conference communication ends.

ところで、上記視線検出部１による通話参加者の視線位置検出処理、頭部動作検出部２による話者参加者の頭部動作の検出処理、発話者検出部３による発言中であるか応答中であるかの判定処理、及び応答者処理部４による心情情報の推定処理は、それぞれ例えば次のように行われる。図６はその処理手順と処理内容を示すフローチャートである。 By the way, the line-of-sight position detection process of the call participant by the line-of-sight detection unit 1, the head movement detection process of the speaker participant by the head movement detection unit 2, and the speech detection by the speaker detection unit 3 is in response. For example, the determination process of the presence and the estimation process of the emotion information by the responder processing unit 4 are performed as follows, for example. FIG. 6 is a flowchart showing the processing procedure and processing contents.

先ずステップ６ａにより、カメラ７で撮像された話者の顔画像を映像制御部１１から取り込み、ステップ６ｂにより、上記取り込まれた顔画像を画像処理することにより視線位置を検出する。この視線位置の検出は、モニタ８の画面内における眼球の座標を取得することにより行われる。この処理を画像フレームレートごとに行う。座標位置が検出されるとステップ６ｃからステップ６ｄに移行して、ここでフレーム間の視線位置座標の差を演算し、視線位置の移動量を計算する。そして、この計算された視線位置の移動量を用いて、ステップ６ｅ及びステップ６ｆにより、視線が注視状態にあるかどうかを検出する。 First, in step 6a, the face image of the speaker captured by the camera 7 is captured from the video control unit 11, and in step 6b, the line-of-sight position is detected by performing image processing on the captured face image. This line-of-sight position is detected by acquiring the coordinates of the eyeball in the screen of the monitor 8. This process is performed for each image frame rate. When the coordinate position is detected, the process proceeds from step 6c to step 6d, where the difference in the line-of-sight position coordinates between the frames is calculated, and the movement amount of the line-of-sight position is calculated. Then, by using the calculated movement amount of the line-of-sight position, it is detected whether or not the line of sight is in a gaze state in step 6e and step 6f.

具体的には、先ず連続フレーム間の視線移動量を計算する。この計算された視線移動量が予め設定したしきい値以内であれば、差分元フレームの視線位置を基準にして次フレーム以降の視線位置の移動量を計算する。この計算された視線位置の移動量が、連続して予め設定されたしきい値以内で、かつこのしきい値以内の連続回数が予め設定された回数以上、つまり設定時間以上であれば、このときの状態を視線がある位置を注視しているとみなす。 Specifically, first, the amount of line-of-sight movement between consecutive frames is calculated. If the calculated line-of-sight movement amount is within a preset threshold value, the movement amount of the line-of-sight position from the next frame onward is calculated on the basis of the line-of-sight position of the difference source frame. If the calculated movement amount of the line-of-sight position is continuously within a preset threshold value and the number of continuous times within the threshold value is greater than or equal to the preset number of times, that is, greater than or equal to the set time, The state of time is considered to be gazing at the position where the line of sight is.

次に、上記視線位置の検出処理と並行して、頭部動作の姿勢角度を計測する。いま仮に頭部動作として「首のかしげ」、「首ふり」、「うなずき」の３種類の動きを定義し、これらの動きに相当する角度を図６（ｂ）に示すようにそれぞれ（α，β，γ）とする。先ずステップ６ｇにおいて、この頭部の３軸姿勢角度（α，β，γ）をサンプリング周期毎に計測し、ステップ６ｈにより姿勢角度ごとにその変化量を演算する。そして、演算された変化量をもとに頭部動作を以下のように判別する。 Next, in parallel with the eye gaze position detection process, the posture angle of head movement is measured. Now, suppose that head movements are defined as three types of movements of “neck wagging”, “neck wagging”, and “nodding”, and the angles corresponding to these movements are (α, β, γ). First, in step 6g, the three-axis posture angle (α, β, γ) of the head is measured for each sampling period, and the amount of change is calculated for each posture angle in step 6h. Then, the head movement is discriminated as follows based on the calculated change amount.

例えば、首ふり（姿勢角度：β）、うなずき（姿勢角度：γ）については、先ずステップ６ｉ及びステップ６ｊにより、その姿勢角度の変化量が予め設定した値以上で、最新の姿勢角度が設定値以上かどうかを判定する。この条件を満足した場合には、続いてステップ６ｌにより連続した変化量に符号変化があるかどうかを判定する。これは、「首ふり」、「うなずき」は頭部の連続的な往復動作と考えられるためである。この判定の結果、符号変化があった場合には往復動作とみなし、その回数を記憶しておく。そして、ステップ６ｍでこの回数が設定以上の場合には、「首ふり」、「うなずき」が行われたとみなす。一方、首のかしげ（姿勢角度：α）については、往復動作でないため、ステップ６ｋにおいて、サンプリング周期ごとの変化量が設定値以上で、かつ最新の姿勢角度が設定値以上であれば首をかしげたとみなす。 For example, with regard to neck swing (posture angle: β) and nodding (posture angle: γ), the amount of change in the posture angle is first set to a preset value or more in step 6i and step 6j. Judge whether it is above. If this condition is satisfied, it is subsequently determined in step 6l whether there is a sign change in the continuous change amount. This is because “neck swing” and “nodding” are considered to be continuous reciprocation of the head. If there is a sign change as a result of this determination, it is regarded as a reciprocal operation and the number of times is stored. If the number of times is greater than or equal to the setting in step 6m, it is considered that “neck shake” and “nodding” have been performed. On the other hand, since the neck curl (posture angle: α) is not a reciprocal motion, if the amount of change per sampling period is greater than or equal to the set value and the latest posture angle is greater than or equal to the set value in step 6k, the neck is curled. It is considered to be.

上記手順に従って頭部動作を検出した後、次に先に検出された視線位置の注視状態と組み合わせることで心情を判別する。例えば、各動作と注視が連続的に行われたときには、その姿勢角度がγ（うなずき）なら同意、姿勢角度がβ（首ふり）ならば否定と判別し、一方姿勢角度がα（首のかしげ）ならば保留と判別する。そして、ステップ６ｏ、ステップ６ｐ、ステップ６ｑにおいて、上記判別された各心情に対応する識別ＩＤを付与し、この心情識別ＩＤを人物識別ＩＤとともに発言中の話者の通信端末に向け送信する。 After detecting the head movement according to the above procedure, the emotion is determined by combining with the gaze state of the line-of-sight position detected first. For example, when each action and gaze is performed continuously, if the posture angle is γ (nod), it is determined to agree, and if the posture angle is β (fluff), it is determined to be negative. ) Is determined to be on hold. In step 6o, step 6p, and step 6q, an identification ID corresponding to each of the determined emotions is given, and the emotion identification ID is transmitted to the communication terminal of the speaker who is speaking together with the person identification ID.

一方、応答者の通信端末から送られた心情識別ＩＤ及び人物識別ＩＤの提示処理は次のように行われる。図７は、発話者処理部５Ａによる上記心情情報の提示処理手順と処理内容を示すフローチャートである。
発話者の発言に対し、応答者側の複数の通信端末から人物識別ＩＤ及び心情識別ＩＤが到来すると、これらの人物識別ＩＤ及び心情識別ＩＤはステップ７ａの制御の下、通信処理部６で受信されたのち発話者処理部５Ａに取り込まれ、ステップ７ｂにより受信結果記憶手段２４に記憶される。 On the other hand, the process of presenting the emotion identification ID and the person identification ID sent from the responder's communication terminal is performed as follows. FIG. 7 is a flowchart showing the emotion information presentation processing procedure and processing contents by the speaker processing unit 5A.
When person identification IDs and emotional identification IDs arrive from a plurality of communication terminals on the responder side in response to the utterance of the speaker, these personal identification IDs and emotional identification IDs are received by the communication processing unit 6 under the control of step 7a. After that, it is taken into the speaker processing unit 5A and stored in the reception result storage means 24 in step 7b.

上記人物識別ＩＤ及び心情識別ＩＤが受信・記憶されると発話者処理部５Ａは、ステップ７ｃにおいてこれらの識別ＩＤをもとにデータベース２２を検索する。そして、人物識別ＩＤをもとに参加話者の中から対応する話者を選択し（ステップ７ｄ）、さらに心情識別ＩＤをもとに対応する話者ごとの心情情報を表すテキストファイル名を選択して、そのファイルを取得する（ステップ７ｅ）。また人物識別ＩＤをもとに、モニタ８に表示されている対応する話者の顔画像を識別し（ステップ７ｆ）、この対応する話者の顔画像にテキストファイル内の文字データを重畳手段２０により重畳する（ステップ７ｇ）。この文字データが重畳された応答者の顔画像は、映像制御部１１を介してモニタ８に供給され表示される。 When the person identification ID and the emotion identification ID are received and stored, the speaker processing unit 5A searches the database 22 based on these identification IDs in step 7c. Then, a corresponding speaker is selected from participating speakers based on the person identification ID (step 7d), and a text file name representing emotion information for each corresponding speaker is selected based on the emotion identification ID. Then, the file is acquired (step 7e). The face image of the corresponding speaker displayed on the monitor 8 is identified based on the person identification ID (step 7f), and the character data in the text file is superimposed on the face image of the corresponding speaker. (Step 7g). The face image of the responder on which the character data is superimposed is supplied to the monitor 8 via the video control unit 11 and displayed.

図８は、上記文字データが重畳された応答者の顔画像の表示例を示すものである。同図に示すように、モニタ８の主画面３２に主たる応答者の顔画像とその心情を表す文字データが表示され、副画面３２〜３４にはそれぞれ他の各応答者の顔画像とその心情を表す文字データが表示される。 FIG. 8 shows a display example of the respondent's face image on which the character data is superimposed. As shown in the figure, the main responder's face image and character data representing its emotion are displayed on the main screen 32 of the monitor 8, and each of the other respondent's face images and their emotions are displayed on the sub-screens 32-34. Character data representing is displayed.

以上述べたように第１の実施形態では、各通信端末ＴＭ１〜ＴＭｎにおいて、テレビジョン会議通信中に話者が発言中であるか応答中であるかを判定する。そして、応答中のときには、話者の視線位置と頭部の動きをそれぞれ検出して、その検出結果をもとに当該応答者の心情を推定し、この推定された心情を表す識別ＩＤを人物識別ＩＤと共に発言者の通信端末へ送信する。これに対し発言中のときには、応答者側の通信端末から送られる応答者の心情識別ＩＤ及び人物識別ＩＤを受信し、この受信された心情識別ＩＤに対応する心情の内容を表す文字データをデータベース２２から選択的に読み出し、この読み出された文字データを対応する応答者の受信顔画像データに重畳してモニタ８に表示するようにしている。 As described above, in the first embodiment, each of the communication terminals TM1 to TMn determines whether the speaker is speaking or responding during the video conference communication. When a response is being made, the speaker's gaze position and head movement are detected, the responder's emotion is estimated based on the detection results, and an identification ID representing the estimated emotion is assigned to the person. It is transmitted to the communication terminal of the speaker together with the identification ID. On the other hand, when speaking, the respondent's emotional identification ID and person identification ID sent from the responder's communication terminal are received, and character data representing the emotional content corresponding to the received emotional identification ID is stored in the database. 22 is selectively read out and the read character data is superimposed on the received face image data of the corresponding responder and displayed on the monitor 8.

したがって、通信端末ＴＭ１〜ＴＭｎでは応答中の話者の心情が自動的に推定され、この推定された話者の心情が発言者側の通信端末に伝送されて文字データに変換されたのち応答者の顔画像とともに表示される。このため、発言中の話者は、テレビジョン電話通信を行いながら、自身の発言に対する応答者の心情を文字データにより簡単かつ正確に把握することが可能となる。例えば、問いかけに対し応答者が同意したか、否定したか或いは保留したかを、応答者の音声や画像等からの推測に頼ることなく明確に把握することができる。したがって、通話中の話者間のコミュニケーションをさらに円滑にすることができ、意思の疎通をさらに正確かつ容易にすることができる。 Therefore, the communication terminals TM1 to TMn automatically estimate the emotion of the responding speaker. The estimated emotion of the speaker is transmitted to the communication terminal on the speaker side and converted into character data. Are displayed together with the face image. For this reason, the speaker who is speaking can easily and accurately grasp the responder's feelings about his / her speech from the character data while performing the television telephone communication. For example, it is possible to clearly grasp whether the responder has agreed, denied, or suspended the inquiry without relying on guesses from the responder's voice or images. Therefore, communication between speakers during a call can be further facilitated, and communication can be made more accurate and easy.

（第２の実施形態）
図９は、この発明の第２の実施形態に係わる通信端末の要部構成を示すブロック図である。なお、同図において前記図１乃至図３と同一部分には同一符号を付して詳しい説明は省略する。
第２の実施形態に係わる通信端末の発話者処理部５Ｂは、心情情報を文字データに変換して提示する処理部と、応答者の心情情報を文字データ以外のメディアに変換して提示する処理部とから構成されている。 (Second Embodiment)
FIG. 9 is a block diagram showing a main configuration of a communication terminal according to the second embodiment of the present invention. In the figure, the same parts as those in FIGS. 1 to 3 are denoted by the same reference numerals, and detailed description thereof is omitted.
The speaker processing unit 5B of the communication terminal according to the second embodiment includes a processing unit that converts emotion information into character data and presents it, and a process that converts the emotion information of the responder into media other than character data and presents it. It consists of a part.

文字データに変換して提示する処理部は、前記図３と同様に、受信された複数の応答者端末からの人物識別ＩＤ及び心情識別ＩＤを記憶する受信結果記憶手段２４と、この受信された各識別ＩＤをもとにデータベース２２から対応する心情を表す文字データを検索するデータベース検索手段２２と、検索された文字データを記憶する検索結果記憶手段２１と、上記検索された文字データを応答者の受信顔情報に重畳するための重畳手段２０とを備えている。 As in FIG. 3, the processing unit for converting to character data and presenting the received result storage means 24 for storing the received person identification ID and emotion identification ID from the plurality of responder terminals, Based on each identification ID, the database search means 22 for searching the character data representing the corresponding feeling from the database 22, the search result storage means 21 for storing the searched character data, and the searched character data as the responder Superimposing means 20 for superimposing on the received face information.

一方、心情情報を他のメディアに変換して提示する処理部は、重畳画像記憶手段４３と、特徴領域抽出手段４４と、提示手法識別手段３６と、データベース（ＤＢ２）４０と、データベース検索手段３９と、音声・音響ファイル選択手段３７と、音声・音響出力手段３８と、空気圧制御手段４１と、空気圧発生手段４２とを備えている。 On the other hand, the processing unit that converts the emotion information to other media and presents it includes the superimposed image storage means 43, the feature area extraction means 44, the presentation technique identification means 36, the database (DB2) 40, and the database search means 39. Voice / acoustic file selection means 37, voice / sound output means 38, air pressure control means 41, and air pressure generation means 42.

重畳画像記憶手段４３は、上記重畳手段２０において文字データが重畳された応答者の顔家画像データを記憶する。特徴領域抽出手段４４は、視線検出部１で検出された当該話者の視線位置と重畳画像の特徴領域を抽出する。提示手法識別手段３６は、応答者の心情情報をどのようなメディアに変換して提示するかを決定する。 The superimposed image storage unit 43 stores the face image data of the responder on which the character data is superimposed in the superimposing unit 20. The feature area extraction unit 44 extracts the gaze position of the speaker detected by the gaze detection unit 1 and the feature area of the superimposed image. The presentation method identification means 36 determines what kind of media the responder's emotional information is converted to be presented.

データベース４０は、心情情報を表す他のメディアデータを格納するもので、図１０に示すように人物識別テーブル２５と、心情識別テーブル２９と、視線位置テーブル４６とから構成される。人物識別テーブル２５には、人物識別ＩＤ２６に対応付けて、その参加者（話者）名２７及び心情識別ＩＤ２８が記憶される。心情識別テーブル２９には、心情識別ＩＤ２８に対応付けて、心情を表現するテキストファイル名３０と、視線位置ＩＤが記憶される。視線位置テーブル４６には、視線位置ＩＤに対応付けて、心情を表現する音声・音響ファイル名４７と、空気圧パラメータ設定ファイル名４８が記憶される。 The database 40 stores other media data representing emotion information, and includes a person identification table 25, a emotion identification table 29, and a line-of-sight position table 46 as shown in FIG. The person identification table 25 stores the participant (speaker) name 27 and the emotion identification ID 28 in association with the person identification ID 26. The emotion identification table 29 stores a text file name 30 expressing the emotion and a line-of-sight position ID in association with the emotion identification ID 28. In the line-of-sight position table 46, a voice / acoustic file name 47 expressing a feeling and an air pressure parameter setting file name 48 are stored in association with the line-of-sight position ID.

データベース検索手段３９は、提示手法判別手段３６の判別結果に基づいて、上記受信された心情識別ＩＤと人物識別ＩＤ、及び上記検出された視線位置ＩＤをもとに上記データベース４０を検索し、これにより対応する心情を表現する音声・音響ファイル名４７又は空気圧パラメータ設定ファイル名４８を読み出す。 The database search means 39 searches the database 40 based on the received emotion identification ID and person identification ID and the detected line-of-sight position ID based on the determination result of the presentation technique determination means 36. The voice / acoustic file name 47 or the pneumatic parameter setting file name 48 expressing the corresponding emotion is read out.

音声・音響ファイル選択手段３７は、上記提示手法判別手段３６により音声・音響データを用いた提示手法が指定された場合に、上記データベース検索手段３９により読み出された音声・音響ファイル名に対応する音声・音響ファイルを選択する。音声・音響出力手段３８は、上記選択された音声・音響ファイルをスピーカ１０から出力させる。 The voice / sound file selection unit 37 corresponds to the voice / sound file name read by the database search unit 39 when the presentation method using the voice / sound data is designated by the presentation method determination unit 36. Select a voice / acoustic file. The voice / sound output unit 38 outputs the selected voice / sound file from the speaker 10.

空気圧制御手段４１は、上記提示手法判別手段３６により空気圧を用いた提示手法が指定された場合に、上記データベース検索手段３９により読み出された空気圧パラメータ設定ファイル名に対応する空気圧の吐出パターンを選択する。空気圧発生手段４２は、上記選択された空気圧の吐出パターンに応じた空気圧を吐出する。 The air pressure control means 41 selects a discharge pattern of air pressure corresponding to the air pressure parameter setting file name read out by the database search means 39 when the presentation technique using the air pressure is designated by the presentation technique discriminating means 36. To do. The air pressure generating means 42 discharges air pressure according to the selected air pressure discharge pattern.

次に、以上のように構成された通信端末の発話者処理部５Ｂによる心情情報の提示処理動作を説明する。図１１は、その処理手順と処理内容を示すフローチャートである。
複数の応答者の通信端末から当該話者の顔画像データが送られると、この顔画像データはステップ１１ａの制御の下、通信処理部６、多重・分離部１３及び映像制御部１１により受信されたのち、ステップ１１ｂにより受信結果記憶手段に記憶される。 Next, the emotion information presentation processing operation by the speaker processing unit 5B of the communication terminal configured as described above will be described. FIG. 11 is a flowchart showing the processing procedure and processing contents.
When face image data of the speaker is sent from the communication terminals of a plurality of responders, the face image data is received by the communication processing unit 6, the multiplexing / separating unit 13 and the video control unit 11 under the control of step 11a. After that, it is stored in the reception result storage means in step 11b.

またそれと共に、上記各応答者の通信端末から人物識別ＩＤ及び心情識別ＩＤが送られると、これらの識別ＩＤはステップ１１ｃの制御の下通信処理部６で受信されたのち受信結果記憶部２４に一旦記憶される。発話者処理部５Ｂは、先ずステップ１１ｄにより、上記受信された各識別ＩＤをもとにデータベース４０を検索し、心情識別ＩＤにマッチングするテキストファイル名を読み出して、このテキストファイル内に記述されている文字データを取得する。そして、ステップ１１ｆにより、上記取得された文字データを上記受信された応答者の顔画像データに重畳し、この文字データが重畳された顔画像データをステップ１１ｇによりモニタ８に表示させる。さらに、上記の文字データが重畳された顔画像データをステップ１１ｈにおいて記憶するとともに、ステップ１１ｉにより受信顔画像との差分を演算する。そして、この演算された差分をもとに、ステップ１１ｊにより特徴領域を抽出する。 At the same time, when the person identification ID and the emotion identification ID are sent from each responder's communication terminal, these identification IDs are received by the communication processing unit 6 under the control of step 11c and then stored in the reception result storage unit 24. Once memorized. First, in step 11d, the speaker processing unit 5B searches the database 40 based on each received identification ID, reads the text file name matching the emotion identification ID, and is described in this text file. Get character data. In step 11f, the acquired character data is superimposed on the received responder's face image data, and the face image data on which the character data is superimposed is displayed on the monitor 8 in step 11g. Further, the face image data superimposed with the character data is stored in step 11h, and the difference from the received face image is calculated in step 11i. Then, based on the calculated difference, a feature region is extracted in step 11j.

一方、発話者処理部５Ｂは、ステップ１１ｋにより、カメラ７により撮像された発話者の顔画像データを映像制御部１１から視線検出部１に取り込み、ステップ１１ｌにおいて上記取り込んだ発話者の顔画像データから画面内の視線位置座標（ｘ，ｙ）を演算する。そして、この演算された視線位置座標と、上記ステップ１１ｊにおいて抽出された特徴領域から、視線位置が文字データ重畳領域、人物領域、その他の領域のうちどの領域に存在するかをステップ１１ｍにより識別し、この識別された領域に対応する視線位置ＩＤを取得する。 On the other hand, in step 11k, the speaker processing unit 5B captures the face image data of the speaker captured by the camera 7 from the video control unit 11 to the line-of-sight detection unit 1, and in step 11l, the speaker face image data captured in step 11l. To calculate the line-of-sight position coordinates (x, y) in the screen. Then, based on the calculated line-of-sight position coordinates and the feature area extracted in step 11j, it is identified in step 11m which of the character data superimposition area, the person area, and other areas the line-of-sight position exists. The line-of-sight position ID corresponding to the identified area is acquired.

次に、発話者処理部５Ｂは、上記識別結果に応じた心情情報の提示処理を実行する。例えば、視線位置が人物領域にある場合には、ステップ１１ｎにより視線位置ＩＤをもとにデータベース４０を検索して対応する音声・音響ファイル名を取得し、この取得された音声・音響ファイル名により表される音声・音響ファイルをステップ１１ｏの制御の下、スピーカ１０から出力させる。したがってこの場合には、応答者の心情情報が例えば合成された音声または音響により発言者に提示される。 Next, the speaker processing unit 5B executes emotion information presentation processing according to the identification result. For example, if the line-of-sight position is in the person area, the database 40 is searched based on the line-of-sight position ID in step 11n to obtain the corresponding voice / acoustic file name, and the acquired voice / acoustic file name is used. The voice / sound file represented is output from the speaker 10 under the control of step 11o. Therefore, in this case, the responder's emotional information is presented to the speaker by, for example, synthesized voice or sound.

これに対し、視線位置が文字データとの重畳領域にある場合には、発話者処理部５Ｂはステップ１１ｐにより、上記人物領域の場合と同様に視線位置ＩＤをもとにデータベース４０を検索して対応する音声・音響ファイル名を取得し、この取得された音声・音響ファイル名により表される音声・音響ファイルをステップ１１ｑの制御の下、スピーカ１０から出力させる。またそれと共に、ステップ１１ｒにおいて、上記視野位置ＩＤをもとにデータベース４０を検索して、空気圧発生のための吐出圧力の大きさや吐出時間等の空気圧パラメータを記述してあるファイル名を取得する。そして、取得されたファイルに記述してあるパラメータをもとに、ステップ１１ｓ及びステップ１１ｔによりそれぞれ空気圧力及び圧力吐出時間を設定し、この設定された空気圧及び圧力吐出時間に従い、ステップ１１ｕにより空気圧発生手段４２から発言中の話者に対し空気を吐出させる。 On the other hand, when the line-of-sight position is in the overlapping area with the character data, the speaker processing unit 5B searches the database 40 based on the line-of-sight position ID in the same manner as in the case of the person area in step 11p. The corresponding voice / acoustic file name is acquired, and the voice / acoustic file represented by the acquired voice / acoustic file name is output from the speaker 10 under the control of step 11q. At the same time, in step 11r, the database 40 is searched based on the visual field position ID, and a file name in which air pressure parameters such as the magnitude of discharge pressure and discharge time for generating air pressure are described is obtained. Based on the parameters described in the acquired file, air pressure and pressure discharge time are set in steps 11s and 11t, respectively, and air pressure is generated in step 11u according to the set air pressure and pressure discharge time. Air is discharged from the means 42 to the speaker who is speaking.

また、視線位置が人物領域でもなく、また文字データ重畳領域でもない場合には、発話者処理部５Ｂはステップ１１ｖにより、視線位置ＩＤをもとにデータベース４０から空気圧パラメータ設定ファイル名を取得する。そして、この取得された空気圧パラメータ設定ファイルの内容に従い、ステップ１１ｗ及びステップ１１ｘによりそれぞれ空気圧力及び圧力吐出時間を設定し、この設定された空気圧及び圧力吐出時間に従い、ステップ１１ｙにより空気圧発生手段４２から発言中の話者に向け空気を吐出させる。 If the line-of-sight position is neither a person area nor a character data superimposition area, the speaker processing unit 5B acquires the air pressure parameter setting file name from the database 40 based on the line-of-sight position ID in step 11v. Then, according to the contents of the acquired air pressure parameter setting file, the air pressure and the pressure discharge time are respectively set at step 11w and step 11x, and from the air pressure generating means 42 at step 11y according to the set air pressure and pressure discharge time. Air is discharged toward the speaker who is speaking.

以上述べたように第２の実施形態では、応答者の心情情報を提示する際に、発言者の顔画像をもとに視線位置を検出し、この検出された視線位置が受信画像中のどの位置にあるかを判定する。そして、視野位置が人物領域にあると判定された場合には心情情報を音声・音響情報に変換して出力し、視線位置が文字データとの重畳領域にあると判定された場合には、心情情報を音声・音響情報に変換して出力すると共に空気圧パターンに変換して空気を吐出する。また視線位置が人物領域でもなく、また文字データ重畳領域でもない場合には、心情情報を空気圧パターンに変換して空気を吐出する。
したがって、発言者の視野位置に応じ最適な提示手法により応答者の心情情報が提示されることになり、これにより発言者はテレビジョン会議通信を行いながら、自身の発言に対する応答者の心情をさらに的確に把握できるようになる。 As described above, in the second embodiment, when presenting the respondent's emotional information, the line-of-sight position is detected based on the face image of the speaker, and the detected line-of-sight position Determine if it is in position. If it is determined that the visual field position is in the person area, the emotional information is converted into voice / acoustic information and output. If it is determined that the line-of-sight position is in the superimposed area with the character data, the emotional information is output. Information is converted into voice / acoustic information and output, and also converted into a pneumatic pattern to discharge air. If the line-of-sight position is neither a person area nor a character data superimposition area, the emotion information is converted into an air pressure pattern and air is discharged.
Therefore, the respondent's emotional information is presented by the optimal presentation method according to the visual field position of the speaker, so that the speaker can further improve the responder's emotion to his / her speech while performing a video conference communication. It becomes possible to grasp accurately.

（第３の実施形態）
図１２は、この発明の第３の実施形態に係わる通信端末の要部構成を示すブロック図である。なお、同図において前記図１乃至図３及び図９と同一部分には同一符号を付して詳しい説明は省略する。
第３の実施形態に係わる通信端末の発話者処理部５Ｃは、心情情報を文字データに変換して提示する処理部と、心情情報別の応答者の人数を提示する処理部と、心情情報別の応答者数が最大となる心情情報を提示する処理部とを備えている。 (Third embodiment)
FIG. 12 is a block diagram showing a main configuration of a communication terminal according to the third embodiment of the present invention. In the figure, the same parts as those in FIGS. 1 to 3 and FIG.
The speaker processing unit 5C of the communication terminal according to the third embodiment includes a processing unit that converts emotion information into character data and presents, a processing unit that presents the number of responders for each emotion information, And a processing unit for presenting emotional information that maximizes the number of respondents.

文字データに変換して提示する処理部は、前記図３及び図９と同様に、受信された複数の応答者端末からの人物識別ＩＤ及び心情識別ＩＤを記憶する受信結果記憶手段２４と、この受信された各識別ＩＤをもとにデータベース２２から対応する心情を表す文字データを検索するデータベース検索手段２２と、検索された文字データを記憶する検索結果記憶手段２１と、上記検索された文字データを応答者の受信顔情報に重畳するための重畳手段２０とを備えている。 The processing unit that converts to character data and presents the received result storage means 24 that stores the received person identification IDs and emotional identification IDs from the plurality of responder terminals, as in FIGS. Based on each received identification ID, database search means 22 for searching for character data representing the corresponding emotion from the database 22, search result storage means 21 for storing the searched character data, and the searched character data Is superimposed on the received face information of the responder.

心情情報別の応答者の人数を提示する処理部は、映像記憶手段４９と、領域検出手段５０と、心情別人数出力手段５１と、心情情報別人数演算手段５２とを備える。映像記憶手段４９は、カメラ７により撮像された顔画像データを記憶する。領域検出手段５０は、文字データを表示する文字情報表示ウインドウの領域を検出する。心情情報別人数演算手段５２は、各応答者の通信端末から送られた心情識別ＩＤをもとに心情別の応答者の人数を計数する。心情別人数表示手段５１は、上記演算された心情別の応答者の人数をモニタ８に表示させる。 The processing unit that presents the number of responders by emotion information includes a video storage unit 49, a region detection unit 50, a number-of-hearts output unit 51, and a number-of-persons calculation unit 52 by emotion information. The video storage unit 49 stores face image data captured by the camera 7. The area detecting means 50 detects an area of a character information display window that displays character data. The number calculation means 52 according to the emotion information counts the number of responders according to the feeling based on the feeling identification ID sent from the communication terminal of each responder. The number-of-heart-indicating number display means 51 displays the calculated number of responders according to the number of feelings on the monitor 8.

心情情報別の応答者数が最大となる心情情報を提示する処理部は、判定手段５３と、音声・音響ファイル選択手段３７と、音声・音響出力手段３８と、空気圧設定ファイル選択手段５４と、空気圧制御手段４１と、空気圧発生手段４２とを備える。判定手段５３は、心情別の応答者の人数をもとに最大人数となる心情情報を判定する。音声・音響ファイル選択手段３７及び音声・音響出力手段３８は、上記判定手段５３による判定結果に基づいて、最大人数の心情に相当する音声・音響データを発話者に提示する。空気圧設定ファイル選択手段５４、空気圧制御手段４１及び空気圧発生手段４２は、上記判定手段５３による判定結果に基づいて、最大人数の心情を空気圧により発話者に提示する。 The processing unit that presents the emotion information that maximizes the number of responders for each emotion information includes a determination unit 53, a voice / acoustic file selection unit 37, a voice / acoustic output unit 38, an air pressure setting file selection unit 54, Air pressure control means 41 and air pressure generation means 42 are provided. The determination means 53 determines the emotion information that is the maximum number based on the number of responders according to emotion. The voice / sound file selection unit 37 and the voice / sound output unit 38 present voice / sound data corresponding to the emotion of the maximum number of people to the speaker based on the determination result by the determination unit 53. The air pressure setting file selection means 54, the air pressure control means 41, and the air pressure generation means 42 present the emotion of the maximum number of people to the speaker by air pressure based on the determination result by the determination means 53.

次に、以上のように構成された通信端末の発話者処理部５Ｃによる心情情報の提示処理動作を説明する。図１３は、その処理手順と処理内容を示すフローチャートである。
発話者の発言に対し、応答者側の複数の通信端末から人物識別ＩＤ及び心情識別ＩＤが到来すると、これらの人物識別ＩＤ及び心情識別ＩＤは通信処理部６で受信されたのち発話者処理部５Ｃに取り込まれ、受信結果記憶手段２４に一旦記憶される。 Next, the emotion information presentation processing operation by the speaker processing unit 5C of the communication terminal configured as described above will be described. FIG. 13 is a flowchart showing the processing procedure and processing contents.
When a person identification ID and a feeling identification ID arrive from a plurality of communication terminals on the responder side with respect to the utterance of the speaker, the person identification ID and the feeling identification ID are received by the communication processing section 6 and then the speaker processing section 5C and temporarily stored in the reception result storage unit 24.

上記人物識別ＩＤ及び心情識別ＩＤが受信・記憶されると発話者処理部５Ｃは、ステップ１３ａにおいて、これらの識別ＩＤをもとにデータベース２２を検索する。そして、人物識別ＩＤをもとに参加話者の中から対応する話者を選択すると共に、心情識別ＩＤをもとに対応する話者ごとの心情情報を表すテキストファイル名を選択して、そのファイルを取得する。また人物識別ＩＤをもとに、モニタ８に表示されている対応する話者の顔画像を識別し、この対応する話者の顔画像にテキストファイル内の文字データを重畳手段２０により重畳する。この文字データが重畳された応答者の顔画像は、映像制御部１１を介してモニタ８に供給され表示される。 When the person identification ID and the emotion identification ID are received and stored, the speaker processing unit 5C searches the database 22 based on these identification IDs in step 13a. Then, a corresponding speaker is selected from participating speakers based on the person identification ID, and a text file name representing emotion information for each corresponding speaker is selected based on the emotion identification ID. Get the file. Further, the face image of the corresponding speaker displayed on the monitor 8 is identified based on the person identification ID, and the superimposing means 20 superimposes the character data in the text file on the corresponding face image of the speaker. The face image of the responder on which the character data is superimposed is supplied to the monitor 8 via the video control unit 11 and displayed.

次に発話者処理部５Ｃは、ステップ１３ｃの制御の下、心情識別ＩＤをもとに対応する応答者の心情情報別の人数を心情別人数演算手段５２により演算する。またそれと共に、ステップ１３ｃにより発話者のモニタ画面上の視線位置を検出し、この検出された視線位置が文字情報表示用ウインドウ領域にあるか否かをステップ１３ｄで判定する。そして、視線位置が文字情報表示用ウインドウ領域にあると判定されると、ステップ１３ｅに移行して、上記心情別人数演算手段５２により算出された応答者の心情情報別の人数を心情別人数表示手段５１によりモニタ８に表示させる。これに対し、発話者の視線位置が文字情報表示用ウインドウ領域にないと判定された場合には、上記ステップ１３ｃ及びステップ１３ｄによる視線位置の検出と判定の処理を繰り返す。 Next, under the control of step 13c, the speaker processing unit 5C calculates the number of responders according to the emotion information based on the emotion identification ID by the number-of-hearts calculation means 52. At the same time, the line-of-sight position on the monitor screen of the speaker is detected in step 13c, and it is determined in step 13d whether the detected line-of-sight position is in the character information display window area. When it is determined that the line-of-sight position is in the character information display window area, the process proceeds to step 13e, where the number of responders according to the emotion information calculated by the number of persons according to the emotion is displayed. It is displayed on the monitor 8 by means 51. On the other hand, when it is determined that the line of sight of the speaker is not in the character information display window area, the process of detecting and determining the line of sight in steps 13c and 13d is repeated.

第１４図は、上記応答者の心情情報別の人数の表示例を示すものである。同図に示すように、主画面３２に文字情報表示用ウインドウ領域５５が設けられ、この文字情報表示用ウインドウ領域５５に心情情報別の応答者の人数、つまり「賛成」、「反対」及び「保留」の各人数が表示される。 FIG. 14 shows a display example of the number of responders according to their emotional information. As shown in the figure, a character information display window area 55 is provided on the main screen 32. In this character information display window area 55, the number of responders according to emotion information, that is, “agree”, “disagree”, and “ Each number of “pending” is displayed.

さらに発話者処理部５Ｃは、上記心情別人数演算手段５２により算出された心情情報別の人数をもとに、応答者の人数が最大となる心情情報をステップ１３ｆで選択する。そして、この選択された応答者の人数が最大となる心情情報に対応する音声・音響ファイルおよび空気圧パラメータ設定ファイルをそれぞれデータベース２２から取得する。そして、この取得された音声・音響ファイルに基づいて、ステップ１３ｇにより音声・音響ファイルをスピーカ１０から出力させる。またそれと共に、上記取得された空気圧パラメータ設定ファイルに記述された吐出力パラメータ及び吐出時間パラメータに応じた空気圧を、ステップ１３ｈにより空気圧発生手段４２から吐出させる。これにより、発話者は自分の発言に対する応答者全員の大まかな心情を即座に把握することが可能となる。 Further, the speaker processing unit 5C selects the emotional information that maximizes the number of responders based on the number of emotional information calculated by the emotional number-of-people calculating means 52 in step 13f. Then, the voice / acoustic file and the air pressure parameter setting file corresponding to the emotional information that maximizes the number of selected responders are acquired from the database 22, respectively. Then, based on the acquired sound / sound file, the sound / sound file is output from the speaker 10 in step 13g. At the same time, the air pressure corresponding to the discharge force parameter and the discharge time parameter described in the acquired air pressure parameter setting file is discharged from the air pressure generating means 42 in step 13h. As a result, the speaker can immediately grasp the rough feelings of all responders to his / her speech.

以上述べたように第３の実施形態では、応答者の心情情報を提示する際に、応答者の心情情報別の人数が算出されてモニタ８の文字情報表示用ウインドウ領域５５に表示される。またそれと共に、上記算出された心情情報別の応答者の人数をもとに、応答者の人数が最大となる心情情報を選択し、この選択された心情情報を合成音声及び空気圧を使用して発話者に提示するようにしている。
したがって、発話者はテレビジョン会議通信中に、自身の発言に対する各応答者の心情情報を文字データによりそれぞれ個別に把握した上で、さらに心情情報別の応答者の人数と、人数が最大となる心情情報、つまり応答者の心情の大勢を一目で把握することが可能となる。 As described above, in the third embodiment, when presenting respondent's emotion information, the number of responders according to the emotion information is calculated and displayed in the character information display window area 55 of the monitor 8. At the same time, based on the calculated number of responders for each emotional information, the emotional information that maximizes the number of respondents is selected, and the selected emotional information is synthesized using synthesized speech and air pressure. They are presented to the speaker.
Therefore, during the video conference communication, the speaker can grasp the emotional information of each responder to his / her utterance individually from the character data, and further maximize the number of responders by number of emotional information. It becomes possible to grasp at a glance the emotion information, that is, the majority of respondent's emotions.

（その他の実施形態）
応答者の心情を表す文字データを表示する際に、発言者の視線位置を検出し、検出された視線位置に対応する応答者の心情を表す文字データの表示サイズを、それ以外の応答者の心情を表す文字データの表示サイズより大きくするように構成するとよい。このようにすると、発言者が注視している応答者の心情を表す文字データをより明確に視認することが可能となる。 (Other embodiments)
When displaying the character data representing the responder's feelings, the gaze position of the speaker is detected, and the display size of the character data representing the responder's feeling corresponding to the detected gaze position is determined by the other responders. It may be configured to be larger than the display size of the character data representing the emotion. If it does in this way, it will become possible to visually recognize more clearly the character data showing the respondent's feelings which the speaker is gazing at.

心情情報の伝送方式としては、心情情報を映像データ及び音声データと別に制御データの一部として伝送する方式や、心情情報を表す文字データを映像データ及び音声データとともに共通のパケットに多重して伝送する方式等が使用可能である。このとき多重方式としては、例えばITU-T H.223に規定された方式等のマルチメディア多重伝送に使用する方式が適用可能である。 As a transmission method of emotion information, a method of transmitting emotion information as a part of control data separately from video data and audio data, or character data representing emotion information is multiplexed and transmitted together with video data and audio data in a common packet. It is possible to use a method to do so. At this time, as a multiplexing method, for example, a method used for multimedia multiplex transmission such as a method defined in ITU-T H.223 can be applied.

また前記実施形態では、三者以上の話間でテレビジョン会議通信を行う場合を例にとって説明したが、二者間でテレビジョン電話通信を行う場合や、二者間で音声のみによる通信を行う場合にも、この発明を適用可能である。さらに、その際使用する通信端末としては、パーソナル・コンピュータ等の有線通信端末以外に、ＰＤＡ（Personal Digital Assistants）や携帯電話機等の移動通信端末を使用することが可能である。 In the above embodiment, the case where the video conference communication is performed between three or more talks has been described as an example. However, the video phone communication between the two parties or the communication between the two parties only by voice is performed. Even in this case, the present invention can be applied. Furthermore, as a communication terminal used at that time, it is possible to use a mobile communication terminal such as a PDA (Personal Digital Assistants) or a mobile phone in addition to a wired communication terminal such as a personal computer.

その他、心情情報の種類やその検出方法、提示形態、心情情報の伝送方式、通信端末の種類やその構成、通信ネットワークの種類やその構成、通信プロトコル等についても、この発明の要旨を逸脱しない範囲で種々変形して実施できる。
要するにこの発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる各実施形態に亘る構成要素を適宜組み合せてもよい。 Other types of emotional information and detection methods, presentation forms, emotional information transmission methods, types and configurations of communication terminals, types and configurations of communication networks, communication protocols, and the like do not depart from the spirit of the present invention. Various modifications can be made.
In short, the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the components without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, you may combine suitably the component covering different embodiment.

この発明の第１の実施形態に係わる通信システム及び通信端末の構成を示すブロック図。The block diagram which shows the structure of the communication system and communication terminal concerning 1st Embodiment of this invention. 図１に示した通信端末の応答者処理部とその周辺部の構成を示すブロック図。The block diagram which shows the structure of the responder process part of the communication terminal shown in FIG. 1, and its peripheral part. 図１に示した通信端末の発話者処理部とその周辺部の構成を示すブロック図。The block diagram which shows the structure of the speaker process part of the communication terminal shown in FIG. 1, and its periphery part. 図３に示した発話処理部で使用されるデータベースの構成を示す図。The figure which shows the structure of the database used with the speech process part shown in FIG. 図１に示した通信端末による心情情報の送受信処理手順と処理内容を示すフローチャート。The flowchart which shows the transmission / reception processing procedure and processing content of heart information by the communication terminal shown in FIG. 図５に示した心情情報の送受信処理手順のうち、心情情報の推定処理手順と処理内容を示すフローチャート。FIG. 6 is a flowchart showing the emotion information estimation processing procedure and processing contents in the emotion information transmission / reception processing procedure shown in FIG. 5; 図５に示した心情情報の送受信処理手順のうち、心情情報の受信提示処理手順と処理内容を示すフローチャート。FIG. 6 is a flowchart showing the emotion information reception presentation processing procedure and processing contents in the emotion information transmission / reception processing procedure shown in FIG. 5; 心情情報の表示結果の一例を示す図。The figure which shows an example of the display result of heart information. この発明の第２の実施形態に係わる通信端末の要部構成を示すブロック図。The block diagram which shows the principal part structure of the communication terminal concerning the 2nd Embodiment of this invention. 図９に示した通信端末において発話処理部で使用されるデータベースの構成を示す図。The figure which shows the structure of the database used by the speech process part in the communication terminal shown in FIG. 図９に示した通信端末による心情情報の受信提示処理手順と処理内容を示すフローチャート。The flowchart which shows the reception presentation process procedure of the emotion information by the communication terminal shown in FIG. 9, and the processing content. この発明の第３の実施形態に係わる通信端末の要部構成を示すブロック図。The block diagram which shows the principal part structure of the communication terminal concerning the 3rd Embodiment of this invention. 図１２に示した通信端末による心情情報の提示処理手順と処理内容を示すフローチャート。The flowchart which shows the presentation process procedure and processing content of the heart information by the communication terminal shown in FIG. 図１３に示した心情情報提示処理による心情情報の表示結果の一例を示す図。The figure which shows an example of the display result of the feeling information by the feeling information presentation process shown in FIG.

Explanation of symbols

ＴＭ１〜ＴＭｎ…通信端末、ＮＷ…通信ネットワーク、１…視線検出部、２…頭部動作検出部、３…発話者検出部、４…応答者処理部、５Ａ，５Ｂ，５Ｃ…発話者処理部、６…通信処理部、７…カメラ、８…モニタ、９…マイクロホン、１０…スピーカ、１１…映像制御部、１２…音響制御部、１３…多重・分離部、１４…画像記憶手段、１５…視線検出手段、１６…頭部姿勢検出手段、１７…頭部動作方向検出手段、２０…重畳手段、２１…検索結果記憶手段、２２…データベース（ＤＢ）、２３…データベース検索手段、２４…受信結果記憶手段、２５…人物識別テーブル、２６…人物識別ＩＤ、２７…人物識別ＩＤ、２８…心情識別ＩＤ、２９…心情識別テーブル、３０…テキストファイル名、３１…心情情報提示画面、３２〜３５…重畳画面、３６…提示手法判別手段、３７…音声・音響ファイル選択手段、３８…音声・音響出力手段、３９…データベース検索手段、４０…データベース（ＤＢ２）、４１…空気圧制御手段、４２…空気圧発生手段、４３…重畳画像記憶手段、４４…特徴領域抽出手段、４５…視線位置ＩＤ、４６…視線位置テーブル、４７…音声・音響ファイル名、４８…空気圧パラメータ設定ファイル名、４９…映像記憶手段、５０…領域検出手段、５１…心情別人数表示手段、５２…心情別人数演算手段、５３…判定手段、５４…空気圧設定ファイル選択手段、５５…文字情報表示用ウィンドウ。 TM1 to TMn: communication terminal, NW: communication network, 1 ... gaze detection unit, 2 ... head motion detection unit, 3 ... speaker detection unit, 4 ... responder processing unit, 5A, 5B, 5C ... speaker processing unit , 6 ... Communication processing unit, 7 ... Camera, 8 ... Monitor, 9 ... Microphone, 10 ... Speaker, 11 ... Video control unit, 12 ... Sound control unit, 13 ... Multiplexing / separating unit, 14 ... Image storage means, 15 ... Line of sight detection means, 16 ... head posture detection means, 17 ... head movement direction detection means, 20 ... superimposition means, 21 ... search result storage means, 22 ... database (DB), 23 ... database search means, 24 ... reception results Storage means, 25 ... person identification table, 26 ... person identification ID, 27 ... person identification ID, 28 ... emotion identification ID, 29 ... emotion identification table, 30 ... text file name, 31 ... emotion information presentation screen, 32-35 ... Tatami screen, 36 ... presentation method discriminating means, 37 ... voice / sound file selecting means, 38 ... sound / sound output means, 39 ... database search means, 40 ... database (DB2), 41 ... pneumatic control means, 42 ... pressure generation Means 43 ... Superimposed image storage means 44 ... Feature region extraction means 45 ... Eye position ID 46 ... Eye position table 47 ... Sound / acoustic file name 48 ... Air pressure parameter setting file name 49 ... Video storage means 50 ... area detection means, 51 ... number-of-hearts display means, 52 ... number-of-hearts calculation means, 53 ... determination means, 54 ... air pressure setting file selection means, 55 ... character information display window.

Claims

When performing telephone communication using audio and video via a communication network between the first communication terminal and the second communication terminal,
Measuring the movement of the speaker's head in the first communication terminal;
Measuring the position of the speaker's line of sight in the first communication terminal;
In the first communication terminal, a process of estimating the emotion of the speaker based on the measurement result of the head movement and the measurement result of the position of the line of sight;
Transmitting information representing the estimated emotion from the first communication terminal to the second communication terminal;
The information communication method comprising: a step of presenting information representing the transmitted emotion to a speaker of the second communication terminal in the second communication terminal.

In the process of transmitting the information representing the emotion, the information representing the estimated emotion is converted into an identification code, and the converted identification code is transmitted from the first communication terminal to the second communication terminal.
The step of presenting the transmitted information representing the emotion to the speaker generates presentation information representing the emotion based on the transmitted identification code, and outputs the generated presentation information. The information transmission method according to claim 1.

The process of presenting the transmitted information representing the emotion to the speaker generates presentation information representing the emotion in character data based on the transmitted identification code, and is represented by the generated character data. 3. The information transmission method according to claim 2, wherein the presentation information is displayed superimposed on video information transmitted from the first communication terminal.

The process of presenting the transmitted information representing the emotion to the speaker generates presentation information representing the emotion by voice or acoustic data based on the transmitted identification code, and the generated voice or acoustic data. 3. The information transmission method according to claim 2, wherein the presentation information consisting of: is inserted into or synthesized with voice information transmitted from the first communication terminal.

The process of presenting the transmitted information representing the emotion to the speaker generates presentation information representing the emotion in air pressure based on the transmitted identification code, and presenting information represented by the generated air pressure. The information transmission method according to claim 2, wherein:

When there are a plurality of the first communication terminals,
In the second communication terminal, a process of taking a majority vote of information representing the emotion transmitted from each of the plurality of first communication terminals and presenting a result of the majority vote to a speaker of the second communication terminal, The information transmission method according to claim 1, further comprising:

A communication device that performs telephone communication using audio and video via a communication network with a communication device on the other side,
Means for measuring the movement of the speaker's head;
Means for measuring the position of the speaker's line of sight;
Means for estimating the emotion of the speaker based on the measurement result of the movement of the head and the measurement result of the position of the line of sight;
A communication apparatus comprising: a transmission unit configured to transmit information representing the estimated emotion to the communication apparatus on the other side.

8. The communication apparatus according to claim 7, wherein the transmission means converts information representing the estimated emotion into an identification code, and transmits the converted identification code to the communication apparatus on the other side.

A communication network with a communication device on the other side having a function of measuring the movement of the speaker's head and the position of the line of sight and estimating and transmitting the emotion of the speaker based on each measurement result A communication device for performing telephone communication using audio and video via
Receiving means for receiving information representing the sentiment transmitted from the communication device on the other side;
A communication device comprising: presentation means for presenting the information representing the received emotion to a speaker using the device.

When the communication device on the other side has a function of converting the information representing the estimated emotion into an identification code and transmitting the converted identification code,
The receiving means receives an identification code transmitted from the counterpart communication device,
The communication device according to claim 9, wherein the presenting unit generates presenting information representing emotion based on the received identification code, and outputs the generated presenting information.

The presenting means generates presentation information representing emotions as character data based on the received identification code, and superimposes the presentation information represented by the generated character data on the received video information. The communication device according to claim 10, wherein the communication device is displayed.

The presenting means generates presentation information that expresses the emotion by voice or acoustic data based on the received identification code, and the presentation information represented by the generated voice or acoustic data is received by the received voice. 11. The communication apparatus according to claim 10, wherein the communication apparatus outputs the information by inserting or combining the information.

11. The presenting means generates presentation information that expresses a heart by air pressure based on the received identification code, and outputs the presentation information represented by the generated air pressure. Communication equipment.

When there are a plurality of communication devices on the other side,
The receiving means receives information representing the sentiments respectively transmitted from the plurality of counterpart communication devices,
The communication device according to claim 9, wherein the presenting unit further includes a function of taking a majority vote of the information representing the plurality of received emotions and presenting a result of the majority vote to a speaker.

A program used in a communication device including a computer for performing telephone communication using audio and video via a communication network with a communication device on the other side,
Processing to detect the movement of the speaker's head;
Detecting the position of the speaker's line of sight;
A process of estimating the emotion of the speaker based on the detection result of the movement of the head and the detection result of the position of the line of sight;
A program for causing the computer to execute a process of transmitting information representing the estimated emotion to the communication apparatus on the other side.

The process to send is
A process of converting information representing the estimated emotion into an identification code;
16. The program according to claim 15, which causes the computer to execute a process of transmitting the converted identification code to the communication apparatus on the other side.

A communication network with a communication device on the other side having a function of measuring the movement of the speaker's head and the position of the line of sight and estimating and transmitting the emotion of the speaker based on each measurement result A program used in a communication device including a computer for performing telephone communication using audio and video via
A process of receiving information representing the sentiment transmitted from the communication device on the other side;
A program for causing the computer to execute a process of presenting the information representing the received emotion to a speaker who uses the apparatus.

When the communication device on the other side has a function of converting the information representing the estimated emotion into an identification code and transmitting the converted identification code,
The receiving process receives an identification code transmitted from the counterpart communication device,
18. The program according to claim 17, wherein the presenting process generates presentation information representing a feeling based on the received identification code, and outputs the generated presentation information.

When there are a plurality of communication devices on the other side,
The receiving process receives information representing a sentiment transmitted from the plurality of counterpart communication devices, respectively.
The program according to claim 17, wherein the presenting process further includes a processing function of taking a majority vote of the received information representing a plurality of emotions and presenting a result of the majority vote to a speaker.