JP4604173B2

JP4604173B2 - Remote conference / education system

Info

Publication number: JP4604173B2
Application number: JP2005076012A
Authority: JP
Inventors: 潔野須; 智哉黒川; 憲志大上; 経之鈴木; 旭奥田
Original assignee: Tokai University Educational Systems
Current assignee: Tokai University Educational Systems
Priority date: 2005-03-16
Filing date: 2005-03-16
Publication date: 2010-12-22
Anticipated expiration: 2025-03-16
Also published as: JP2006262010A

Description

本発明は、遠隔地間で異なる時間帯で活動する人々、特に、異なる文化間、業務分野間など異なる文脈で活動する人々が共同作業、会議、議論、意思決定を円滑に遂行する遠隔会議・教育システムを提供する。 The present invention is a remote conference / conference that facilitates collaboration, conferences, discussions, and decision-making by people working at different times in remote locations, especially people working in different contexts such as between different cultures and business fields. Provide an education system.

ユビキタス社会になり、いつでも、どこからでもコミュニケ−ションを取ることが可能になりつつあるが、場所だけでなく時間も共有しないコミュニケーションにより共同作業、会議、講義を経済的に実現することが求められている。
例えば、グローバルに事業展開している企業では、時差がある遠隔地間の事業所間で異なる時間帯に活動する社員の共同作業、会議、議論、意思決定を円滑に行う手段が求められている。特に、グローバル企業では、時間帯だけなく、業務分野や文化的背景など異なる文脈で活動している社員が多く、言語や映像から得られる情報とともに、対話相手の真意の理解、心理状態の推定が円滑な共同作業、会議、議論、迅速な意思決定において重要であるが、経済的に実現するのは、現状技術では、困難である。 Being a ubiquitous society, it is becoming possible to communicate from anywhere at any time, but it is required to economically realize joint work, meetings, and lectures through communication that not only shares place but also time. Yes.
For example, companies that operate globally need a means to facilitate joint work, meetings, discussions, and decision-making for employees working at different times between offices between remote locations with time differences. . In particular, global companies have many employees who work in different contexts, such as work fields and cultural backgrounds, not only in time, but with the information obtained from the language and video, understanding the true intention of the conversation partner and estimating the psychological state. It is important for smooth collaboration, meetings, discussions, and rapid decision making, but it is difficult to achieve economically with current technology.

デジタル映像音響技術とインターネット通信技術を使って遠隔地間で共同作業、会議、講義が行われている。しかし、映像音響入出力機器の機能と性能の限界から遠隔地間で受け取る情報は、限定されたものである。高品質な映像音響入出力機器と広帯域な通信回線を使って臨場感ある遠隔地間のコミュニケーションを実現する試みがあるが装置コスト、運用コストの点で実現が難しい場合があり、現状技術では、時差がない遠隔地間の同期遠隔会議においても対面コミュニケーションに近い場の共有が、困難なことが多い。 Collaborative work, conferences, and lectures are held between remote locations using digital audiovisual technology and Internet communication technology. However, information received between remote locations is limited due to limitations in the functions and performance of audiovisual input / output devices. Although there are attempts to realize communication between remote places with a sense of reality using high-quality audiovisual input / output devices and broadband communication lines, there are cases where it is difficult to realize in terms of equipment cost and operation cost, Even in a synchronous teleconference between remote locations with no time difference, it is often difficult to share a place close to face-to-face communication.

これらの諸課題のうち、時間を共有しない人の間で知識を共有する課題を解決する手段の提案例として、特開平１０−２１４０２２「グループ学習教材編成方法及びシステム」がある。この例では、仲間同士で議論し教え合うことで、学習者間の相互作用を活性化することができるグループ学習教材編成方法及びシステムを提供する。共通の学習対象をもつ複数の学習者の協調的な振る舞いから得られるデータを基本とし、先行する学習者の学習履歴を記録し、後進学習者に対し目的状態へ至る手順を表示する機能を持ったグループ学習教材を編集し、編成されたグループ学習教材を他の学習者が共有、利用しながら、学習対象についての情報を表示し、学習者のノート／メモを取込みながら、グループ学習教材を更新する。 Among these various problems, as a proposal example of means for solving the problem of sharing knowledge among people who do not share time, there is JP-A-10-214022, “Group learning teaching material organizing method and system”. In this example, the group learning teaching material organization method and system which can activate the interaction between learners by discussing and teaching with each other are provided. Based on data obtained from the cooperative behavior of multiple learners with a common learning target, it has a function to record the learning history of the preceding learner and display the procedure to reach the target state for the backward learner. Edit group learning materials, share and use organized group learning materials, display information about learning targets, and update group learning materials while taking learner notes / memos To do.

しかし、この方法では、忙しいビジネスマンにとって、皆で共有するノートに書き込むのは、煩雑である。仮に、入力が容易な映像及び音声情報の蓄積交換が出来たとしても、リアルタイムに双方向でやり取りを行わない非同期のコミュニケーションでは、対話の文脈が分断されるため、蓄積交換された映像、音声情報だけから遠隔地の対話相手、特に異文化圏、異なる分野で業務を遂行している対話相手の真意を理解し、相手の心理状態を推定することは、困難である。 However, with this method, it is cumbersome for a busy businessman to write in a notebook shared by everyone. Even if it is possible to store and exchange video and audio information that is easy to input, asynchronous communication that does not exchange interactively in real time breaks the context of the conversation. From the above, it is difficult to understand the true intentions of remote conversation partners, especially those in different cultures and fields, and to estimate the psychological state of the other party.

一方、電子メディアの使用者の心理状態を推定する方法も提案されている。例えば、特開平５−１２０２３「感情認識装置」では、基準となる音声信号と特徴信号とのズレ量を検出し、このズレ量に基づいて、感情状態を判断する。
図１は、遠隔会議使用者の音声スペクトルの最大パワーレベルとその周波数の変化率と「嬉しい」、「怒り」、「悲しい」、「納得」、「困惑」といった心理状態の関係を統計的信頼区間（信頼度９５％）で示したデータ例である。このデータから分かるように、音声データのバラツキが大きい。従って、遠隔会議使用者の心理状態を音声データのみから適切に判断するのは、困難である。 On the other hand, a method for estimating the psychological state of a user of electronic media has also been proposed. For example, Japanese Patent Laid-Open No. 5-12023 “Emotion Recognition Device” detects a deviation amount between a reference voice signal and a feature signal, and determines an emotional state based on this deviation amount.
Figure 1 shows the statistical trust of the relationship between the maximum power level of the voice spectrum and the frequency change rate of the teleconference user and the psychological state such as “happy”, “angry”, “sad”, “convinced”, “confused”. It is an example of data shown in a section (95% reliability). As can be seen from this data, the variation of the audio data is large. Therefore, it is difficult to appropriately determine the psychological state of the remote conference user only from the voice data.

別の考案例として特開平９−５６７０３「感情検出装置および心拍データ発信器および感情表示装置」がある。これは、心拍センサから出力する心拍信号により心拍数の変化度合いを演算することにより、対話等における相手あるいは双方の緊張度や興奮度を、その場で的確に定量的に、しかも、場の雰囲気を害することなく検出し、その検出結果を表示することをめざしている。図２は、電子メディア利用者の脈拍数と皮膚温度の変化率と「面白い」、「驚き」、「納得」、「困惑」といった心理状態の関係を統計的信頼区間（信頼度９５％）で示したデータ例である。図２は、心拍数や脈拍数情報に皮膚温度情報を組み合わせた場合でも、電子メディア利用者の心理状態を細かく識別することが出来ないことを示している。 Another example of the invention is Japanese Patent Laid-Open No. 9-56703 “Emotion detection device, heart rate data transmitter and emotion display device”. This is because the heart rate signal output from the heart rate sensor is used to calculate the degree of change in heart rate, so that the degree of tension and excitement of the other party or both parties in the conversation can be accurately and quantitatively determined on the spot, and the atmosphere of the field. Is detected without harming, and the detection result is displayed. Figure 2 shows the relationship between the electronic media user's pulse rate and skin temperature change rate and psychological states such as “interesting”, “surprise”, “convinced”, and “confused” in a statistical confidence interval (95% confidence level). It is the example of data shown. FIG. 2 shows that even when skin temperature information is combined with heart rate and pulse rate information, the psychological state of the electronic media user cannot be identified in detail.

生体情報を活用する別の考案例に、特開平７−２０４１６８「生体情報自動識別装置」がある。これは、脳波センサのデジタル出力信号をフーリエ変換し所望周波数帯域内の複数の分割された周波数帯域毎のスペクトルパワーを求め、センサを取り付けた被験者の複数の心理状態又は知的作業内容を判定することを狙っている。表１は、電子メディア利用者の脳波成分と「面白い」、「驚き」、「納得」、「困惑」といった心理状態の関係を示したデータ例である。表１は、α波、β波、δ波、θ波の増減からのみ心理状態を適切に判断するのは、困難であることを示している。 Japanese Patent Laid-Open No. 7-204168 “Biometric Information Automatic Identification Device” is another example of a device that utilizes biometric information. This is the Fourier transform of the digital output signal of the electroencephalogram sensor to obtain the spectral power for each of a plurality of divided frequency bands within the desired frequency band, and to determine the plurality of psychological states or intellectual work contents of the subject to which the sensor is attached. I am aiming for that. Table 1 is an example of data showing the relationship between the electroencephalogram components of electronic media users and psychological states such as “interesting”, “surprise”, “convincing”, and “confusing”. Table 1 shows that it is difficult to appropriately determine the psychological state only from the increase / decrease of the α wave, β wave, δ wave, and θ wave.

特開平１０−２１４０２２JP-A-10-214022 特開平５−１２０２３JP 5-12023 A 特開平９−５６７０３JP-A-9-56703 特開平７−２０４１６８JP-A-7-204168

本発明は、時間を共有しない人の間で情報交換を行う非同期遠隔会議・教育システムにおいて、映像、音声情報に加え、使用者の音声、表情・動作、生体情報のデータから抽出した多様な心理状態推定情報を遠隔地の対話相手に表示することを可能にする遠隔会議・教育システムを提供する。 The present invention is an asynchronous remote conference / education system in which information is exchanged between people who do not share time, in addition to video and audio information, various psychology extracted from user's voice, facial expression / motion, and biological information data. A teleconference / education system is provided that enables state estimation information to be displayed to remote conversation partners.

前記課題を解決する請求項１に記載の発明は、サーバ及び複数の端末を、ネットワークを介して接続可能な遠隔会議・教育システムであって、各端末は、音声情報の入出力を行う音声情報入出力手段と、映像情報の入出力を行う映像情報入出力手段と、
生体情報の入力を行う生体情報入力手段と、音声、表情・動作、生体情報のデータを実時間でそれぞれ抽出するデータ抽出手段と、音声、表情・動作、生体情報の抽出データから利用者の心理状態を実時間で推定する心理状態推定手段と、利用者自身の映像音響情報に加え心理状態表示データをサーバに実時間伝送し、当該サーバに映像音響情報と心理状態表示データを一時蓄積させ、通信相手が前記サーバを介して任意の時間に取り出すことを可能にする送信手段と、蓄積された対話相手の端末の映像音響情報と心理状態表示データを、ネットワークを介してサーバより受信する受信手段と、そして、前記受信手段を介して取得した心理状態表示データから対話相手の心理状態表示人工構造物を表示する心理状態表示人工構造物表示手段とを備え、心理状態推定手段は、遠隔地間での対話開始前に、各個人別に音声、表情・動作、生体情報から各心理状態に対応する測定基準値を予め測定し、前記端末の心理状態推定回路内に蓄積記憶し、音声、映像情報の入出力手段を介して入力された音声の音声最大レベル値及び最大レベルの周波数の変化から「高揚」又は「非高揚」の心理状態を推定し、音声、映像情報の入出力手段を介して入力された顔画像の複数の特徴点の相対的移動変化量から「驚き」、「納得」、「面白さ」、「困惑」の心理状態を推定する心理状態判定区間の平均点からの距離を一致指数として算出し、利用者の顔画像の唇の左右端２点、右眉の左右端２点、左眉の左右端２点の相対的移動変化量から「驚き」、「納得」、「面白さ」、「困惑」の心理状態を推定する心理状態判定区間の平均点からの距離を一致指数として算出し、利用者の脈拍数、皮膚温度、呼吸数の生体情報の変化量から「驚き」、「納得」、「面白さ」、「困惑」の心理状態を推定する心理状態判定区間の平均点からの距離を一致指数として算出し、各々の心理状態ごとに算出された各一致指数をそれぞれ合算し、一致指数が最大の心理状態をその時点の心理状態と推定することを特徴とする。 The invention according to claim 1, which solves the above-mentioned problem, is a remote conference / education system in which a server and a plurality of terminals can be connected via a network, and each terminal performs voice information input / output of voice information. Input / output means , video information input / output means for inputting / outputting video information ,
A biological information input means for inputting biometric information, voice, facial expression and operation, a data extracting means for extracting respective data of the biological information in real time, voice, facial expression and operation, psychology of the user from the extracted data of the biological information Psychological state estimating means for estimating the state in real time, in addition to the user's own audiovisual information, the psychological state display data is transmitted to the server in real time, the audiovisual information and the psychological state display data are temporarily stored in the server, a transmission unit that makes it possible to retrieve any time the communication partner via the server, the video-audio information and psychological status data of the terminal of the stored dialogue partner, receiving means for receiving from the server via the network If, then, the psychological status artificial structure display means for displaying the mental status artificial structure of dialogue partner from psychological status data acquired through the receiving means Includes, psychological state estimation means, before starting dialogue between remote locations, the audio for each individual expression and operation, previously measured metric values corresponding to the psychological state of the biological information, the psychological state estimation of the terminal Accumulating and storing in the circuit, estimating the psychological state of "uplifting" or "non-uplifting" from the change of the maximum voice level and the maximum level frequency of the voice inputted through the input and output means of voice and video information , Estimate the psychological state of “surprise”, “convinced”, “interesting”, and “confused” from the relative movement change amount of multiple feature points of the face image input via voice / video information input / output means The distance from the average point of the psychological state determination section is calculated as a coincidence index, and the relative movement changes of the two left and right lips of the user's face image, the two right and left ends of the right eyebrow, and the two left and right ends of the left eyebrow estimated the psychological state of the "surprise", "assent", "fun", "embarrassment" from the amount "Surprise" the distance from the average point of the psychological state determining section calculates a concordance index, pulse rate of the user, skin temperature, the amount of change in respiration rate of the biological information, "satisfied", "fun", " Calculate the distance from the average point of the psychological state determination interval that estimates the psychological state of `` confused '' as a coincidence index, add up each coincidence index calculated for each psychological state, and calculate the psychological state with the largest coincidence index It is characterized by estimating the psychological state at that time.

前記課題を解決する請求項２に記載の発明は、請求項１記載の遠隔会議・教育システムにおいて、顔画像の複数の特徴点の変化量を１〜３秒間隔で算出することを特徴とする。 The invention according to claim 2 that solves the above-mentioned problem is characterized in that in the teleconference / education system according to claim 1, the amount of change of a plurality of feature points of the face image is calculated at intervals of 1 to 3 seconds. .

前記課題を解決する請求項３に記載の発明は、請求項１の遠隔会議・教育システムにおいて、生体情報の変化量を２〜１０秒間隔で算出することを特徴とする。 The invention according to claim 3 for solving the above-mentioned problem is characterized in that in the teleconference / education system according to claim 1, the amount of change of biological information is calculated at intervals of 2 to 10 seconds.

上述した本発明は、遠隔地間で異なる時間帯で活動する人々、特に異なる文化や業務分野の人々が共同作業、会議、議論、意思決定を円滑かつ迅速に遂行する。 The above-described present invention enables smooth and quick collaboration, meetings, discussions, and decision-making by people working at different times in remote locations, especially people in different cultures and business fields.

以下、図面を用いて本発明に係る遠隔会議・教育システムについて詳細に説明する。
図３は、遠隔地間で時間を共有せず会議を行う非同期式遠隔会議の実施例の構成である。１は、ヘッドフォン、２は、マイクロフォン、３は、画像表示装置、４は、カメラ、５は、身体装着型生体センサ、７は、通信ネットワーク、９は、蓄積型遠隔会議情報処理装置、３００，３００’は、通信端末、４００は、映像音響情報蓄積サーバである。通信端末３００、３００’は、各々、ヘッドフォン１、マイクロフォン２、画像表示装置３、カメラ４、身体装着型生体センサ５、蓄積型遠隔会議情報処理装置９で構成されている。 Hereinafter, a remote conference / education system according to the present invention will be described in detail with reference to the drawings.
FIG. 3 shows a configuration of an embodiment of an asynchronous remote conference that performs a conference without sharing time between remote locations. 1 is a headphone, 2 is a microphone, 3 is an image display device, 4 is a camera, 5 is a body-mounted biosensor, 7 is a communication network, 9 is a storage-type remote conference information processing device, 300, Reference numeral 300 ′ denotes a communication terminal, and reference numeral 400 denotes an audiovisual information storage server. Each of the communication terminals 300 and 300 ′ includes a headphone 1, a microphone 2, an image display device 3, a camera 4, a body-mounted biosensor 5, and a storage-type remote conference information processing device 9.

上記の構成において、遠隔会議システム使用者は、相手と時間を共有して通信回線を通じて直接対話するのではなく、映像音響情報蓄積サーバ４００に、映像情報、音声情報、心理状態表示データをいったん蓄積し、相手の都合が良いときにダウンロードしてもらって情報伝達を行う。また、他者から情報伝達があったときは、都合が良いときに、映像音響情報蓄積サーバ４００にアクセスし、蓄積された送信者の映像情報、音声情報、心理状態表示データをダウンロードして閲覧する。
自分宛の情報があることを認識した通信端末３００の使用者は、映像音響情報蓄積サーバ４００にいったん蓄積された遠隔地の通信端末３００’の対話相手の情報をダウンロードし、音声情報は、ヘッドフォン１により、相手の映像情報及び心理状態表示データは、画像表示装置３により得る。 In the above configuration, the user of the teleconference system temporarily stores video information, audio information, and psychological state display data in the video / audio information storage server 400, instead of directly communicating with the other party through a communication line. And when the other party's convenience, it is downloaded and information is transmitted. Also, when information is transmitted from another person, when it is convenient, the video and audio information storage server 400 is accessed, and the stored video information, audio information, and psychological state display data of the sender are downloaded and viewed. To do.
The user of the communication terminal 300 recognizing that there is information addressed to him downloads the information on the conversation partner of the remote communication terminal 300 ′ once stored in the audiovisual information storage server 400, and the sound information is stored in the headphones. 1, the other party's video information and psychological state display data are obtained by the image display device 3.

他者に送りたいメッセージがあるときは、マイクロフォン２で対話相手に話しかける音声情報は、送付先ラベルを添付して映像音響情報蓄積サーバ４００に送る。同時に、蓄積型遠隔会議情報処理装置９で音声解析による使用者の心理状態推定にも用いられる。
カメラ４は、遠隔会議使用者の映像を撮影する。その映像情報は、送付先ラベルを添付して遠隔地の映像音響情報蓄積サーバ４００に送る。同時に、蓄積型遠隔会議情報処理装置９で顔画像解析による使用者の心理状態推定にも用いられる。身体装着型生体センサ５は、脈拍数、皮膚温度、呼吸数の生体情報を検出する。これらのデータは、蓄積型遠隔会議情報処理装置９で生体情報解析による使用者の心理状態推定に用いられる。蓄積型遠隔会議情報処理装置９は、リアルタイムで生体情報、顔画像、音声情報の解析による使用者心理状態推定を行い、そのデータを使用者の映像、音声とともに送付先ラベルを添付して映像音響情報蓄積サーバ４００に送る。 When there is a message to be sent to another person, the voice information to be spoken to the conversation partner with the microphone 2 is sent to the audiovisual information storage server 400 with a destination label attached. At the same time, the storage type remote conference information processing apparatus 9 is also used for estimating the user 's psychological state by voice analysis.
The camera 4 captures an image of the remote conference user. The video information is sent to the remote audiovisual information storage server 400 with a destination label attached. At the same time, the storage type remote conference information processing apparatus 9 is also used for estimating a user's psychological state by facial image analysis. The body-mounted biological sensor 5 detects biological information such as the pulse rate, skin temperature, and respiratory rate. These data are used by the storage-type remote conference information processing apparatus 9 to estimate the user's psychological state by biometric information analysis. The storage-type remote conference information processing apparatus 9 estimates a user's psychological state by analyzing biometric information, face images, and audio information in real time, and attaches a destination label along with the user's video and audio to the video sound. The information is sent to the information storage server 400.

映像音響情報蓄積サーバ４００は、遠隔会議システム使用者の映像、音声及び使用者の心理状態表示データを受け取ると、蓄積するとともに、情報送付先に伝達メッセージ到着を連絡する。 When receiving the video and audio of the teleconference system user and the psychological state display data of the user, the video / audio information storage server 400 stores the information and notifies the information transmission destination of the arrival of the transmission message.

映像音響情報蓄積サーバ４００から伝達メッセージ到着の連絡を受けた通信端末３００の使用者は、都合が良いときに、映像音響情報蓄積サーバ４００にアクセスし、蓄積された送信者の映像情報、音声情報、心理状態表示データをダウンロードする。相手の心理状態表示データは、蓄積型遠隔会議情報処理装置９で処理され、映像情報とともに画像表示装置３に表示される。送られてきた相手の音声情報は、遠隔会議情報処理装置９で処理され、ヘッドフォン１に出力される。 The user of the communication terminal 300 that has received notification of the arrival of the transmission message from the audiovisual information storage server 400 accesses the audiovisual information storage server 400 when convenient, and stores the stored video information and audio information of the sender. , Download psychological state display data. The other party's psychological state display data is processed by the storage type remote conference information processing apparatus 9 and displayed on the image display apparatus 3 together with the video information. The transmitted voice information of the other party is processed by the remote conference information processing device 9 and output to the headphones 1.

以上の構成により、非同期遠隔会議使用者の音声、表情・動作、生体情報のデータから総合的に判断した心理状態を遠隔地の対話相手に表示することを可能にし、時間、空間を越えて心の微妙な変化を対話相手に伝える事が可能になる。 With the above configuration, the asynchronous remote conference user of voice, facial expression and operation, the psychological state after comprehensive consideration from the data of biological information makes it possible to display a remote location of the dialogue partner, time, beyond the space It is possible to convey subtle changes in mind to the other party.

図４は、非同期式遠隔会議の実施例の遠隔会議情報処理装置９の構成を示している。２１は、音響出力インタフェース回路、２２は、音響入力インタフェース回路、２３は、映像出力インタフェース回路、２４は、映像入力インタフェース回路、２５は、生体情報入力インタフェース回路、２６は、音響データ抽出回路、２７は、映像データ抽出回路、２８は、生体情報抽出回路、２９は、心理状態推定回路、３０は、心理状態データ生成回路、３１は、映像・音響多重回路、３２は、送信データ多重化回路、３３は、送信回路、３４は、受信回路、３５は、受信データ分離回路、３６は、映像・音響分離回路、３７は、会議映像／心理状態データ合成回路である。 FIG. 4 shows the configuration of the remote conference information processing apparatus 9 in the embodiment of the asynchronous remote conference. 21 is an acoustic output interface circuit, 22 is an acoustic input interface circuit, 23 is a video output interface circuit, 24 is a video input interface circuit, 25 is a biological information input interface circuit, 26 is an acoustic data extraction circuit, 27 Is a video data extraction circuit, 28 is a biological information extraction circuit, 29 is a psychological state estimation circuit, 30 is a psychological state data generation circuit, 31 is a video / acoustic multiplexing circuit, 32 is a transmission data multiplexing circuit, 33 is a transmission circuit, 34 is a reception circuit, 35 is a reception data separation circuit, 36 is a video / sound separation circuit, and 37 is a conference video / psychological state data synthesis circuit.

音響出力インタフェース回路２１は、図３のヘッドフォン１に接続されている。音響入力インタフェース回路２２は、マイクロフォン２に接続されている。映像出力インタフェース回路２３は、画像表示装置３に接続されている。映像入力インタフェース回路２４は、カメラ４に接続されている。生体情報入力インタフェース回路２５は、身体装着型生体センサ５に接続されている。 The sound output interface circuit 21 is connected to the headphone 1 of FIG. The acoustic input interface circuit 22 is connected to the microphone 2. The video output interface circuit 23 is connected to the image display device 3. The video input interface circuit 24 is connected to the camera 4. The biological information input interface circuit 25 is connected to the body-mounted biological sensor 5.

音響データ抽出回路２６は、音響入力インタフェース回路２２の信号から、心理状態を推定する音声データとして、音声スペクトルの最大パワーレベルとその周波数を抽出する。 The sound data extraction circuit 26 extracts the maximum power level of the sound spectrum and its frequency from the signal of the sound input interface circuit 22 as sound data for estimating the psychological state.

映像データ抽出回路２７は、映像入力インタフェース回路２４の映像信号から顔画像の複数の特徴点の相対的移動変化量から心理状態を推定する。具体的には、顔画像の唇の左右端２点、右眉の左右端２点、左眉の左右端２点の相対的移動変化量を抽出する。これらの顔画像の特徴点の変化量を１〜３秒間隔で算出する。 The video data extraction circuit 27 estimates a psychological state from the relative movement change amounts of a plurality of feature points of the face image from the video signal of the video input interface circuit 24. Specifically, the relative movement change amounts of the two left and right end points of the lips of the face image, the two right and left end points of the right eyebrow, and the two left and right end points of the left eyebrow are extracted. The amount of change of the feature points of these face images is calculated at intervals of 1 to 3 seconds.

生体情報抽出回路２８は、生体情報入力インタフェース回路２５の信号から脈拍数、皮膚温度、呼吸数の変化量を抽出する。これらの生体情報の変化量を２〜１０秒間隔で算出する。 The biometric information extraction circuit 28 extracts changes in pulse rate, skin temperature, and respiration rate from the signal from the biometric information input interface circuit 25. The amount of change in the biological information is calculated at intervals of 2 to 10 seconds.

心理状態推定回路２９は、音響データ抽出回路２６の音声データ、映像データ抽出回路２７の顔画像データ、生体情報抽出回路２８の生体データを基に「驚き」、「納得」、「面白さ」、「困惑」といった利用者の心理状態を実時間で推定する回路である。
心理状態データ生成回路３０は、心理状態推定回路２９の出力である利用者の心理状態推定結果をもとに、使用者の推定心理状態を他人が一目でわかるようなアバタ−、二次元もしくは三次元アニメーションなど心理状態表示人工構造物データを生成する。
音響入力インタフェース回路２２及び映像入力インタフェース回路２４のデジタル出力信号は、映像・音響多重回路３１で多重化される。 The psychological state estimation circuit 29 performs “surprise”, “consent”, “fun” based on the audio data of the acoustic data extraction circuit 26, the face image data of the video data extraction circuit 27, and the biological data of the biological information extraction circuit 28. It is a circuit that estimates the user's psychological state such as “confused” in real time.
The psychological state data generation circuit 30 is an avatar, two-dimensional or tertiary that allows other users to understand the estimated psychological state of the user at a glance based on the psychological state estimation result of the user that is the output of the psychological state estimation circuit 29. Psychological state display artificial structure data such as original animation is generated.
Digital output signals of the audio input interface circuit 22 and the video input interface circuit 24 are multiplexed by the video / audio multiplexing circuit 31.

映像・音響多重回路３１及び心理状態データ生成回路３０の出力は、送信データ多重化回路３２で多重化され、送信回路３３からネットワークに送り出される。
ネットワークから送られてきたデータは、受信回路３４で受信され、受信データ分離回路３５で、対話相手の心理状態表示人工構造物データと対話相手の映像・音声多重化データに分離される。 The outputs of the video / sound multiplexing circuit 31 and the psychological state data generation circuit 30 are multiplexed by the transmission data multiplexing circuit 32 and sent out from the transmission circuit 33 to the network.
The data sent from the network is received by the receiving circuit 34 and separated by the received data separating circuit 35 into the dialogue partner psychological state display artificial structure data and the dialogue partner video / audio multiplexed data.

映像・音響分離回路３６は、映像・音声多重化データを映像データと音声データに分離する。分離された音声データは、音響出力インタフェース回路２１に送られ、アナログ情報に変換される。 The video / sound separation circuit 36 separates the video / audio multiplexed data into video data and audio data. The separated audio data is sent to the acoustic output interface circuit 21 and converted into analog information.

会議映像／心理状態データ合成回路３７は、分離された映像データと心理状態表示人工構造物データを合成し、映像出力インタフェース回路２３に送られ、アナログ情報に変換される。
推定する利用者の心理状態の種類を予め「驚き」、「納得」、「面白さ」、「困惑」の四つに設定し、生体情報、顔画像のデータごとに個々の前記心理状態に対応する判断基準値を蓄積記憶し、抽出された各時点の音声、表情・動作、生体情報のデータごとに前記心理状態の判断基準値との差分から一致指数を算出し、心理状態ごとに生体情報、表情・動作のデータの一致指数を合算し、一致指数が最大の心理状態をその時点の心理状態と推定する。 The conference video / psychological state data synthesis circuit 37 synthesizes the separated video data and psychological state display artificial structure data, and sends them to the video output interface circuit 23 where they are converted into analog information.
Predict the user's psychological state types to “surprise”, “convince”, “interesting”, and “confused” in advance, corresponding to each psychological state for each piece of biometric information and facial image data The judgment reference value is stored and stored, and the coincidence index is calculated from the difference from the judgment reference value of the psychological state for each extracted voice, facial expression / motion, biological information data at each time point, and the biometric information for each psychological state Then, the coincidence indices of the facial expression / motion data are added together, and the psychological state having the maximum coincidence index is estimated as the psychological state at that time.

音声データでは、詳細な心理状態の識別が困難であるので、「高揚」もしくは「非高揚」心理状態の識別のみを行う。例えば、前記４つの心理状態にさらに「怒り」の心理状態を追加する必要がある場合に、本データを追加することができる。
判断基準値は、測定データの変化率で定めており、遠隔地間での対話開始前に、各個人別に音声、表情・動作、生体情報のデータごとに各心理状態に対応する測定基準値を測定し蓄積記憶しておく。 With voice data, it is difficult to identify a detailed psychological state, so only “high” or “non-high” psychological states are identified. For example, this data can be added when it is necessary to add an “anger” psychological state to the four psychological states.
Judgment standard values are determined by the rate of change of measurement data, and before each conversation between remote locations, the measurement standard values corresponding to each psychological state for each voice, facial expression / motion, and biological information data for each individual. Measure and store it.

図５は、心理状態推定回路２９の処理アルゴリズムを示している。顔画像データ１（右唇位置）、顔画像データ２（左唇位置）、顔画像データ３（右内眉位置）、顔画像データ４（左内眉位置）、顔画像データ５（右外眉位置）、顔画像データ６（左外眉位置）、生体データ１（呼吸数）、生体データ２（心拍数）、生体データ３（皮膚温度）ごとにその測定変化率から「驚き」、「納得」、「面白さ」、「困惑」の四つの心理状態に対応する判断基準値が蓄積記憶されている。特定の時点の測定データから各心理状態別の一致指数を計算する。 FIG. 5 shows a processing algorithm of the psychological state estimation circuit 29. Face image data 1 (right lip position), face image data 2 (left lip position), face image data 3 (right inner eyebrow position), face image data 4 (left inner eyebrow position), face image data 5 (right outer eyebrow position) Position), face image data 6 (left outer eyebrow position), biometric data 1 (respiration rate), biometric data 2 (heart rate), and biometric data 3 (skin temperature) from the measurement change rate, "surprise" ”,“ Interesting ”, and“ Confusion ”are stored and stored as reference values corresponding to the four psychological states. The coincidence index for each psychological state is calculated from the measurement data at a specific time.

９種類の測定データごとに計算した四つの心理状態に対応する一致指数を「驚き」、「納得」、「面白さ」、「困惑」の心理状態ごとに合算する。一致指数合算値が最大の心理状態と判定する。
音声データから「高揚」心理状態／「非高揚」心理状態の判定を行う。これらの推定データは、心理状態データ生成回路３０に送られ、使用者の推定心理状態を他人が一目でわかるようなアバタ−、二次元もしくは三次元アニメーションなど心理状態表示人工構造物データを生成する。 The coincidence indices corresponding to the four psychological states calculated for each of the nine types of measurement data are summed up for each psychological state of “surprise”, “convinced”, “fun”, and “confused”. The psychological state having the maximum coincidence index sum is determined.
Judgment of “uplifted” psychological state / “non-uplifted” psychological state is performed from the voice data. These estimated data are sent to the psychological state data generation circuit 30 to generate psychological state display artificial structure data such as an avatar, two-dimensional or three-dimensional animation so that other users can see the estimated psychological state of the user at a glance. .

以下、図５のアルゴリズムに組み込まれている音声、顔画面、生体情報別にその判定基準値の具体例を説明する。
図６は、音声最大レベル値及び最大レベルの周波数の変化率と心理状態の関係を示している。音声最大パワーレベル及びその周波数の変化率からは細かい心理状態を分けて推測することは難しい。怒りといった高揚心理状態とその他の心理状態（非高揚心理状態）の分離のみが可能である。測定データが四角の枠内であるならば、「高揚」心理状態もしくは「非高揚」心理状態と判断する。 Hereinafter, specific examples of the determination reference value will be described for each voice, face screen, and biological information incorporated in the algorithm of FIG.
FIG. 6 shows the relationship between the voice maximum level value and the frequency change rate of the maximum level and the psychological state. It is difficult to deduce a detailed psychological state from the maximum voice power level and the rate of change of its frequency. Only the uplifting psychological state such as anger and other psychological states (non-uplifting psychological state) can be separated. If the measurement data is within the square frame, it is determined as an “uplift” psychological state or a “non-uplift” psychological state.

顔画面の解析は、頭頂−あごのラインを基準線４１として相対座標系４０を再計算することにより、頭のぶれや画面に占める顔の割合などに左右されずに、ほぼ計測点の動きだけを算定し、顔画像の複数の特徴点の相対的移動変化量から心理状態を推定する。具体的には、顔画像の唇の左右端２点、右眉の左右端２点、左眉の左右端２点の相対的移動変化量から心理状態を推定する。 In the analysis of the face screen, the relative coordinate system 40 is recalculated using the top-chin line as the reference line 41, so that only the movement of the measurement point is affected, regardless of the head shake or the ratio of the face to the screen. And the psychological state is estimated from the relative movement change amounts of the plurality of feature points of the face image. Specifically, the psychological state is estimated from the relative movement change amounts of the left and right edges of the lips of the face image, the left and right edges of the right eyebrow, and the left and right edges of the left eyebrow.

図７は、顔画面をデータから頭頂−あごのラインを基準線４１として相対座標系４０を再計算する過程の画像の一部である。顔画像の複数の特徴点の変化量は、１〜３秒間隔で算出する。
図８は、右唇の画像解析結果（位置変化率）を示している。「驚き」の心理状態での変化は広範囲にｙ軸下方に広がっている。「面白さ」の心理状態での変化はｙ軸上方向に引き上げられる傾向がある。逆に、「困惑」での変化はｘ軸方向外側に伸びる傾向にある。図９は、左唇の画像解析結果を示している。右唇と同様な動きをしている。 FIG. 7 is a part of an image in the process of recalculating the relative coordinate system 40 from the data of the face screen using the top-chin line as the reference line 41. The change amounts of the plurality of feature points of the face image are calculated at intervals of 1 to 3 seconds.
FIG. 8 shows the image analysis result (position change rate) of the right lip. The change in the psychological state of “surprise” has spread widely below the y-axis. Changes in the psychological state of “interesting” tend to be raised in the y-axis upward direction. On the contrary, the change in “confused” tends to extend outward in the x-axis direction. FIG. 9 shows the image analysis result of the left lip. It moves in the same way as the right lip.

図１０は、右内眉の画像解析結果（位置変化率）を示している。「驚き」及び「困惑」の心理状態での変化は、ｘ軸方向に長くなる傾向にある。「納得」と「面白さ」の心理状態での変化は左右どちらもほぼ重なっており、これらの計測点だけで判断するのは難しい。
図１１は、左内眉の画像解析結果（位置変化率）を示している。右内眉とほぼ同様な変化の動きをしている。図１２は、右外眉の画像解析結果（位置変化率）を示している。「驚き」及び「困惑」の心理状態での変化は、ｘ軸方向長くなる傾向にあり、重なりやすくなっている（識別しにくくなっている）。「面白さ」の心理状態での変化はｘ，ｙ軸方向ともに一番変化が少ない。「納得」の心理状態での変化はｘ軸方向へ大きく動くことが多い。 FIG. 10 shows the image analysis result (position change rate) of the right inner eyebrow. Changes in the psychological state of “surprise” and “confused” tend to be longer in the x-axis direction. Changes in the psychological state of "Consent" and "Interesting" are almost overlapped on both the left and right sides, and it is difficult to judge only from these measurement points.
FIG. 11 shows the image analysis result (position change rate) of the left inner eyebrow. The movement is almost the same as the right inner eyebrow. FIG. 12 shows the image analysis result (position change rate) of the right outer eyebrow. Changes in the psychological state of “surprise” and “confused” tend to be longer in the x-axis direction, and are more likely to overlap (become difficult to identify). The change in the psychological state of “fun” is the least in both the x and y axis directions. Changes in the psychological state of “convinced” often move greatly in the x-axis direction.

図１３は、左外眉の画像解析結果（位置変化率）を示している。右外眉と同様な動きをしている。図１４は、平均化時間を５秒に設定したときの生体情報のデータ例、図１５は、平均化時間を１０秒に設定したときの生体情報のデータ、図１６は、平均化時間を３０秒に設定したときの生体情報のデータ例である。平均化時間が長くなると雑音成分が除去されるが同時に生体情報の急峻な変化量も平滑化され重要な情報が失われる。そこで、本実施態様では、脈拍数、皮膚温度、呼吸数の生体情報の変化量を１０秒間隔で抽出している。 FIG. 13 shows the image analysis result (position change rate) of the left outer eyebrow. It moves in the same way as the right outer eyebrow. FIG. 14 shows an example of biometric information data when the averaging time is set to 5 seconds, FIG. 15 shows biometric information data when the averaging time is set to 10 seconds, and FIG. It is a data example of the biometric information when set to second. When the averaging time becomes longer, the noise component is removed, but at the same time, the abrupt change amount of the biological information is also smoothed and important information is lost. Therefore, in this embodiment, changes in biological information such as pulse rate, skin temperature, and respiratory rate are extracted at 10-second intervals.

図１７は、生体情報のうち、脈拍と呼吸数の心理状態判定基準を示している。「納得」と「面白さ」の心理状態は、「困惑」心理状態は分離することが可能であるが、「驚き」の心理状態の範囲は全体に広がっており、他の心理状態との分離は不可能である。
図１８は、生体情報のうち、呼吸数と皮膚温度の心理状態判定基準を示している。「納得」と「面白さ」の心理状態は「困惑」の心理状態と分離可能である。
図１９左外眉の測定データの一致度指数の算出方法を示している。測定点と「驚き」、「納得」、「困惑」、「面白さ」の四つの心理状態判定区間の平均点の距離をもとに一致指数を算出している。もっとも距離の短い心理状態の一致指数を「＋１」、もっとも距離の長い心理状態の一致指数を「−１」、その他の心理状態の一致指数を「０」とした。 FIG. 17 shows psychological state determination criteria for pulse and respiratory rate in the biological information. Psychological states of “Consent” and “Interesting” can be separated from “Puzzled” psychological states, but the range of “Surprised” psychological states is widened and separated from other psychological states. Is impossible.
FIG. 18 illustrates psychological state determination criteria for respiratory rate and skin temperature in the biological information. The psychological state of "Consent" and "Interesting" can be separated from the psychological state of "Confusion".
FIG. 19 shows a method for calculating the coincidence index of the measurement data of the left outer eyebrow. The coincidence index is calculated based on the distance between the measurement points and the average points of the four psychological state determination sections of “surprise”, “convinced”, “confused”, and “interesting”. The coincidence index of the psychological state with the shortest distance was “+1”, the coincidence index of the psychological state with the longest distance was “−1”, and the coincidence index of the other psychological states was “0”.

以上の構成により、遠隔会議使用者の音声、表情・動作、生体情報のデータを総合的に判断して多様な使用者心理状態を実時間で推定し、遠隔地の対話相手に表示することを可能にし、遠隔にいる対話相手の心の微妙な変化を相手に伝える事が可能になる。 With the above configuration, it is possible to comprehensively judge the remote conference user's voice, facial expression / motion, and biological information data, estimate various user psychological states in real time, and display them to the remote conversation partner. It becomes possible, and it becomes possible to convey the subtle change of the mind of the remote conversation partner to the other party.

遠隔会議使用者の音声スペクトルの最大パワーレベルとその周波数の変化率と「嬉しい」、「怒り」、「悲しい」、「納得」、「困惑」といった心理状態の関係を統計的信頼区間（信頼度９５％）で示したデータ例である。The statistical confidence interval (reliability level) indicates the relationship between the maximum power level of the voice spectrum and the rate of change of the frequency of the teleconference user and the psychological state such as “happy”, “anger”, “sad”, “convinced”, “confused”. 95%). 電子メディア利用者の脈拍数と皮膚温度の変化率と「面白い」、「驚き」、「納得」、「困惑」といった心理状態の関係を統計的信頼区間（信頼度９５％）で示したデータ例である。Data example showing the relationship between the rate of change of heart rate and skin temperature of electronic media users and psychological states such as “interesting”, “surprise”, “convincing”, and “confusing” with a statistical confidence interval (95% confidence) It is. 遠隔地間で時間を共有しないで会議を行う非同期式遠隔会議の実施例の構成を示す概略図である。It is the schematic which shows the structure of the Example of the asynchronous type remote conference which performs a conference without sharing time between remote locations. 実施例の遠隔会議情報処理装置９の構成を示すブロック図である。It is a block diagram which shows the structure of the remote conference information processing apparatus 9 of an Example. 心理状態推定回路２９の処理アルゴリズムを説明するための図である。It is a figure for demonstrating the processing algorithm of the mental state estimation circuit. 音声最大レベル値及び最大レベルの周波数の変化率と心理状態の関係を示すチャート図である。It is a chart figure which shows the relationship between the change rate of the audio | voice maximum level value and the frequency of a maximum level, and a mental state. 顔画像分析基準線を説明するための概略図である。It is the schematic for demonstrating a face image analysis reference line. 顔画像による心理状態判定基準（右唇）を説明するためのチャート図である。It is a chart figure for demonstrating the psychological state criteria (right lip) by a face image. 顔画像による心理状態判定基準（左唇）を説明するためのチャート図である。It is a chart figure for demonstrating the psychological state criteria (left lip) by a face image. 顔画像による心理状態判定基準（右内眉）を説明するためのチャート図である。It is a chart figure for demonstrating the psychological state criteria (right inner eyebrows) by a face image. 顔画像による心理状態判定基準（左内眉）を説明するためのチャート図である。It is a chart figure for demonstrating the psychological state criteria (left inner eyebrows) by a face image. 顔画像による心理状態判定基準（右外眉）を説明するためのチャート図である。It is a chart for demonstrating the psychological state determination standard (right outer eyebrows) by a face image. 顔画像による心理状態判定基準（左外眉）を説明するためのチャート図である。It is a chart for demonstrating the psychological state criteria (left outer eyebrows) by a face image. 生体情報の平均化時間の設定（５秒）の場合における変化率を示すチャート図である。It is a chart figure which shows the change rate in the case of the setting (5 second) of the averaging time of biometric information. 生体情報の平均化時間の設定（１０秒）の場合における変化率を示すチャート図である。It is a chart figure which shows the change rate in the case of the setting (10 second) of the averaging time of biometric information. 生体情報の平均化時間の設定（３０秒）の場合における変化率を示すチャート図である。It is a chart figure which shows the change rate in the case of the setting (30 second) of the averaging time of biometric information. 生体情報による心理状態判定基準（脈拍−呼吸数）を説明するためのチャート図である。It is a chart for demonstrating the psychological state determination criteria (pulse-respiration rate) by biometric information. 生体情報による心理状態判定基準（呼吸数−皮膚温度）を説明するためのチャート図である。It is a chart for demonstrating the psychological state determination criteria (respiration rate-skin temperature) by biometric information. 一致度指数の算出方法（左外眉の場合）を説明するためのチャート図である。It is a chart figure for demonstrating the calculation method (in the case of left outer eyebrows) of a coincidence index.

Explanation of symbols

１ヘッドフォン
２マイクロフォン
３画像表示装置
４カメラ
５身体装着型生体センサ
７通信ネットワーク
９蓄積型遠隔会議情報処理装置
２１音響出力インタフェース回路
２２音響入力インタフェース回路
２３映像出力インタフェース回路
２４映像入力インタフェース回路
２５生体情報入力インタフェース回路
２６音響データ抽出回路
２７映像データ抽出回路
２８生体情報抽出回路
２９心理状態推定回路
３０心理状態データ生成回路
３１映像・音響多重回路
３２送信データ多重化回路
３３送信回路
３４受信回路
３５受信データ分離回路
３６映像・音響分離回路
３７会議映像／心理状態データ合成回路
３００，３００’ 通信端末
４００映像音響情報蓄積サーバ DESCRIPTION OF SYMBOLS 1 Headphone 2 Microphone 3 Image display apparatus 4 Camera 5 Body-mounted biosensor 7 Communication network 9 Storage type remote conference information processing apparatus 21 Acoustic output interface circuit 22 Acoustic input interface circuit 23 Video output interface circuit 24 Video input interface circuit 25 Biological information Input interface circuit 26 Acoustic data extraction circuit 27 Video data extraction circuit 28 Biological information extraction circuit 29 Psychological state estimation circuit 30 Psychological state data generation circuit 31 Video / acoustic multiplexing circuit 32 Transmission data multiplexing circuit 33 Transmission circuit 34 Reception circuit 35 Reception data Separation circuit 36 Video / sound separation circuit 37 Conference video / psychological state data synthesis circuit 300, 300 'Communication terminal 400 Video / audio information storage server

Claims

A remote conference / education system in which a server and a plurality of terminals can be connected via a network.
Voice information input / output means for inputting and outputting voice information ;
Video information input / output means for inputting / outputting video information ;
Biometric information input means for inputting biometric information ;
Audio, and data extracting means for extracting respective expression-operation, the data of the biological information in real time,
Psychological state estimating means for estimating a user's psychological state in real time from extracted data of voice, facial expression / motion and biological information;
In addition to the user's own audiovisual information, the psychological state display data is transmitted to the server in real time, the audiovisual information and the psychological state display data are temporarily stored in the server , and the communication partner can communicate at any time via the server. A transmission means that enables retrieval;
Receiving means for receiving the stored audiovisual information and psychological state display data of the conversation partner terminal from a server via a network; and
Psychological state display artificial structure display means for displaying the psychological state display artificial structure of the conversation partner from the psychological state display data acquired via the receiving means ;
Equipped with a,
The psychological state estimating means includes
Before starting the conversation between remote locations, measure the reference value corresponding to each psychological state in advance from voice, facial expression / motion, biometric information for each individual , and store and store in the psychological state estimation circuit of the terminal ,
Estimating the psychological state of “uplifting” or “non-uplifting” from the change in the maximum audio level value and the maximum level frequency of the audio input through the audio and video information input / output means ,
Estimate the psychological state of "surprise", "convincing", "interesting", or "confused" from the relative movement change amount of a plurality of feature points of the face image input through the voice / video information input / output means Calculate the distance from the average point of the psychological state determination interval to be a match index ,
“Surprise”, “Consent”, “Interesting”, “Interest” from the relative movement change amounts of the left and right ends of the lips, the left and right ends of the right eyebrow, and the left and right ends of the left eyebrow of the user's face image. Calculate the distance from the average point of the psychological state determination section that estimates the psychological state of "confused" as a coincidence index,
From the average point of the psychological state determination section that estimates the psychological state of "surprise", "consent", "interesting", "confused" from the amount of change in the biological information of the user's pulse rate, skin temperature, respiratory rate Calculate distance as a match index,
A teleconferencing / education system characterized by summing up the respective coincidence indexes calculated for each psychological state, and estimating a psychological state having the maximum coincidence index as a psychological state at that time.

Teleconferencing and education system according to claim 1, wherein the calculating the amount of change of a plurality of feature points of the face image with 1-3 second intervals.

2. The teleconference / education system according to claim 1 , wherein the change amount of the biological information is calculated at intervals of 2 to 10 seconds.