JP2005328484A

JP2005328484A - Video conference system, information processing apparatus, information processing method and program

Info

Publication number: JP2005328484A
Application number: JP2004146930A
Authority: JP
Inventors: Jun Ito; 潤伊藤; Mutsuhiro Omori; 睦弘大森; Shigehiro Shimada; 繁広嶌田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-05-17
Filing date: 2004-05-17
Publication date: 2005-11-24

Abstract

PROBLEM TO BE SOLVED: To provide a video conference system having communication properties and overflowing with presence. SOLUTION: The video conference system, comprising one or more terminal devices installed at different spots and an information processing apparatus connected with the terminal devices via a predetermined network, is provided with a video compositing means for creating a composited video image by compositing a video image of a user, based on the video data transmitted from a terminal device via the network and an image of a required document; a control means for controlling the video compositing means to enlarge or reduce the image of the document or the corresponding video image of the user in the composited video image, based on document operation using a document operating means and/or the presence/absence of the user's voice detected based on the audio data; and a transmission means for transmitting to terminal devices via the network composited video data comprising video data of the composited video image created by the video compositing means. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、テレビ会議システム、情報処理装置及び情報処理方法並びにプログラムに関し、例えば遠隔地にいる複数の相手ユーザとテレビ会議を行う場合に適用して好適なものである。 The present invention relates to a video conference system, an information processing apparatus, an information processing method, and a program, and is suitable for application to a video conference with, for example, a plurality of remote users at remote locations.

従来、複数の地点に設置されたテレビ会議端末をＩＳＤＮ（Integrated Services Digital Network）等の公衆回線や専用線を介して接続し、映像データ及び音声データ等の情報を相互に送受信することによって、遠隔地にいる複数の相手ユーザとの会議を行い得るようになされたテレビ会議システムがある。 Conventionally, video conferencing terminals installed at a plurality of points are connected via public lines such as ISDN (Integrated Services Digital Network) or dedicated lines, and information such as video data and audio data is transmitted and received mutually. There is a video conference system that can perform a conference with a plurality of other users on the ground.

このようなテレビ会議システムにおいては、テレビ会議先の複数の相手ユーザの映像がディスプレイに表示されるが、その相手ユーザの発言の様子やそのときの表情を認識することができれば、よりコミュニケーション性の高い会議を行うことができるものと考えられる。 In such a video conference system, images of a plurality of other users at the video conference destination are displayed on the display. However, if the other user's remarks and facial expressions at that time can be recognized, more communicability can be achieved. It is considered that a high meeting can be held.

しかしながらこのようなテレビ会議システムにおいて、複数のユーザが一堂に会してテレビ会議を行う場合には、誰が発言しているか認識し難いという問題がある。そこでかかる問題を解決するために、音声の方向等に基づいて発言者を検出し、ディスプレイに表示される発言したユーザの映像の頭上にカーソルを付加することによって、当該発言したユーザを識別する方法が提案されている（例えば特許文献１参照）。
特開２００３−１８９２７３公報 However, in such a video conference system, when a plurality of users meet to conduct a video conference, there is a problem that it is difficult to recognize who is speaking. In order to solve such a problem, a method for identifying a user who detects a speaker by detecting the speaker based on the direction of the voice and adding a cursor to the head of the user's video displayed on the display. Has been proposed (see, for example, Patent Document 1).
JP 2003-189273 A

ところで、近年テレビ会議システムにおいては、複数の地点に設置されたパーソナルコンピュータを、例えばウインドウズＸＰ（Windows（登録商標） XP）のリモートデスクトップ機能やVNC（Virtual Network Computing）等のリモートアクセスシステム等を用いて、インターネット回線等のネットワークを介して接続することによって、パーソナルコンピュータを用いたマルチウインドウ・マルチディスプレイ環境におけるテレビ会議を行い得るようになされている。 By the way, in recent video conference systems, a personal computer installed at a plurality of points is used, for example, a remote desktop function of Windows XP (Windows (registered trademark) XP) or a remote access system such as VNC (Virtual Network Computing). By connecting via a network such as the Internet line, a video conference in a multi-window / multi-display environment using a personal computer can be performed.

このとき、テレビ会議先の相手ユーザとフリーディスカッションを行う場合には、相手ユーザの映像を発言の様子やそのときの表情等を容易に認識し得るように拡大させて表示させる一方、テレビ会議先の相手ユーザがドキュメントの内容の説明を行う場合には、ドキュメントの該当部分を拡大させて表示させるようにすれば、コミュニケーション性が高く、臨場感あふれるテレビ会議が行えると考えられる。 At this time, when a free discussion is performed with the other party user of the video conference destination, the video of the other party user is enlarged and displayed so that the state of the utterance and the facial expression at that time can be easily recognized. When the other user explains the contents of the document, it is considered that if the corresponding part of the document is enlarged and displayed, a teleconference with high communication and high presence can be performed.

しかしながらこのような場合には、ユーザに相手ユーザの映像及びドキュメントの映像の拡大や縮小等の操作をさせる等の煩雑な操作が必要となるため、そちらに意識が集中して本来の会議に意識を集中させることができなくなり、また操作の煩雑さから、状況に応じて相手ユーザの映像及びドキュメントの映像の拡大や縮小等の操作自体を行わなくなることによって、臨場感に乏しく、コミュニケーション性が欠如するという問題がある。 However, in such a case, it is necessary to perform complicated operations such as enlarging or reducing the other user's video and document video, so that the user's consciousness concentrates on the original meeting. Due to the inconvenience of operation and the operation of enlargement / reduction of the other user's image and document image depending on the situation, the presence of reality is lacking and communication is lacking There is a problem of doing.

本発明は以上の点を考慮してなされたもので、コミュニケーション性が高く、臨場感あふれるテレビ会議システム、情報処理装置及び情報処理方法並びにプログラムを提案するものである。 The present invention has been made in consideration of the above points, and proposes a video conference system, an information processing apparatus, an information processing method, and a program having high communication performance and full of realism.

かかる課題を解決するために本発明においては、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置とからなるテレビ会議システムにおいて、端末装置は、ユーザを撮影する撮影手段と、ユーザの音声を集音する集音手段と、ドキュメントについての所定のドキュメント操作を行うためのドキュメント操作手段と、撮影手段から出力される映像データ及び集音手段から出力される音声データをネットワークを介して情報処理装置に送信する送信手段と、情報処理装置からネットワークを介して送信される合成映像データに基づく映像を表示する表示手段とを設け、情報処理装置は、端末装置からネットワークを介して送信される映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する映像合成手段と、ドキュメント操作手段を用いたドキュメント操作及び又は上記音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させるように合成映像手段を制御する制御手段と、映像合成手段により生成された合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する送信手段とを設けるようにした。 In order to solve such a problem, in the present invention, in a video conference system comprising one or more terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network, The terminal device includes a photographing unit for photographing the user, a sound collecting unit for collecting the user's voice, a document operation unit for performing a predetermined document operation on the document, video data output from the photographing unit, and video data collection. A transmission unit configured to transmit audio data output from the sound unit to the information processing device via the network; and a display unit configured to display a video based on the composite video data transmitted from the information processing device via the network. The processing device includes the user's video based on the video data transmitted from the terminal device via the network and the necessary data. Based on the presence or absence of a user's speech detected based on the document operation using the document operation means and / or the audio data And / or control means for controlling the synthesized video means so as to enlarge or reduce the corresponding user's video, and synthesized video data composed of synthesized video data generated by the video synthesizing means to the terminal device via the network. A transmission means for transmitting is provided.

これにより、ユーザに余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像及びその操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時には当該発言を行ったユーザのユーザ映像をメインに捉えさせることができる。 As a result, the document image and the user video of the user who performed the operation are mainly captured when the document is operated, and the user video of the user who performed the speech is main when the user speaks without causing the user to perform an extra operation. Can be captured.

またかかる課題を解決するために本発明においては、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置において、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する映像合成手段と、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させるように合成映像合成手段を制御する制御手段と、映像合成手段により生成された合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する送信手段とを設けるようにした。 In order to solve such a problem, in the present invention, in one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network, a network is connected from the terminal device. Video synthesizing means for synthesizing the user's video based on the photographed user's video data transmitted via the network and a required document image, and a predetermined document for the document transmitted via the network from the terminal device The image of the document in the synthesized video and / or the corresponding video of the user is enlarged or reduced based on the presence or absence of the user's speech detected based on the user's document operation and / or the voice data of the collected user's voice Control means for controlling the video composition means and composition generated by the video composition means The synthetic image data composed of video data of the image, and to provide a transmission means for transmitting to the terminal device via the network.

さらにかかる課題を解決するために本発明においては、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して上記端末装置と接続される情報処理装置における情報処理方法において、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する第１のステップと、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させる第２のステップと、合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する第３のステップとを設けるようにした。 Further, in order to solve such a problem, in the present invention, in an information processing method in one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network, A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured video data of the user transmitted from the terminal device via the network; and transmitting from the terminal device via the network The document image in the synthesized video and / or the corresponding user video is enlarged based on the presence or absence of the user's speech detected based on the predetermined document operation and / or the voice data of the collected user's voice Or a second step to reduce and composition composed of composite video data Image data, and to provide a third step of transmitting to the terminal device via the network.

これにより、ユーザに余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像その操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時にはその当該発言を行ったユーザのユーザ映像をメインに捉えさせることができる。 As a result, the user image of the user who operated the document image is mainly captured at the time of the document operation without causing the user to perform an extra operation, while the user image of the user who performed the statement is the main image when the user speaks. Can be captured.

さらにかかる課題を解決するために本発明においては、プログラムにおいてそれぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置に対して、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する第１のステップと、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させる第２のステップと、合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する第３のステップとを実行させるようにした。 Furthermore, in order to solve such a problem, in the present invention, a terminal is provided for one or a plurality of terminal devices installed at different points in a program and an information processing device connected to the terminal device via a predetermined network. A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured video data of the user transmitted from the device via the network, and transmitted from the terminal device via the network The document image in the synthesized video and / or the corresponding user video is enlarged or reduced based on the presence or absence of the user's speech detected based on the predetermined document operation and / or the voice data of the collected user's voice. The second step of reduction and the composite video data composed of the video data of the composite video The data were so as to execute a third step of transmitting to the terminal device via the network.

以上のように本発明によれば、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置とからなるテレビ会議システムにおいて、端末装置は、ユーザを撮影する撮影手段と、ユーザの音声を集音する集音手段と、ドキュメントについての所定のドキュメント操作を行うためのドキュメント操作手段と、撮影手段から出力される映像データ及び集音手段から出力される音声データをネットワークを介して情報処理装置に送信する送信手段と、情報処理装置からネットワークを介して送信される合成映像データに基づく映像を表示する表示手段とを設け、情報処理装置は、端末装置からネットワークを介して送信される映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する映像合成手段と、ドキュメント操作手段を用いたドキュメント操作及び又は上記音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させるように合成映像手段を制御する制御手段と、映像合成手段により生成された合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する送信手段とを設けるようにしたことにより、ユーザに余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像及びその操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時には当該発言を行ったユーザのユーザ映像をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現できる。 As described above, according to the present invention, in a video conference system including one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network, the terminal device A photographing means for photographing the user, a sound collecting means for collecting the user's voice, a document operating means for performing a predetermined document operation on the document, video data output from the photographing means, and a sound collecting means An information processing apparatus comprising: transmission means for transmitting audio data output from the information processing apparatus via the network; and display means for displaying video based on the composite video data transmitted from the information processing apparatus via the network. The user's video based on the video data transmitted from the terminal device via the network and the necessary document The image of the document in the synthesized video and / or the correspondence based on the presence or absence of the user's speech detected based on the document operation using the document operation unit and the above-described audio data Transmitting the control means for controlling the composite video means so as to enlarge or reduce the user's video, and the composite video data composed of the video data of the composite video generated by the video composite means to the terminal device via the network By providing the means, the user can mainly capture the document image and the user video of the user who performed the operation during the document operation without causing the user to perform an extra operation. Can capture the user image of the user, and thus Myunikeshon resistance is high, it is possible to realize a immersive video conferencing system.

また以上のように本発明によれば、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置において、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する映像合成手段と、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させるように合成映像合成手段を制御する制御手段と、映像合成手段により生成された合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する送信手段とを設けるようにしたことにより、ユーザ余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像及びその操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時には当該発言を行ったユーザのユーザ映像をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれる情報処理装置を実現できる。 Further, as described above, according to the present invention, in one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network, from the terminal device via the network A video synthesizing unit that generates a synthesized video by synthesizing a user's video and a necessary document image based on the captured user's video data to be transmitted, and a predetermined document for the document transmitted from the terminal device via the network Based on the presence or absence of the user's speech detected based on the voice data of the user's voice that has been operated and / or collected, synthesized video synthesis so as to enlarge or reduce the document image and / or the corresponding user video in the synthesized video Control means for controlling the means, and video data of the composite video generated by the video composition means By providing a transmission means for transmitting the composite video data to the terminal device via the network, the document image and the user video of the user who performed the operation at the time of the document operation are performed without causing an extra user operation. The user image of the user who made the utterance can be captured mainly when the user utters, thus realizing a highly communicative and realistic information processing apparatus.

さらに以上のように本発明によれば、それぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して上記端末装置と接続される情報処理装置における情報処理方法において、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する第１のステップと、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させる第２のステップと、合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する第３のステップとを設けるようにしたことにより、ユーザに余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像及びその操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時には当該発言を行ったユーザのユーザ映像をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれる情報処理方法を実現できる。 Furthermore, as described above, according to the present invention, in the information processing method in one or a plurality of terminal devices installed at different points and the information processing device connected to the terminal device via a predetermined network, the terminal device A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured user's video data transmitted via the network from the terminal device, and transmitted from the terminal device via the network Enlarge or reduce a document image and / or a corresponding user video in the synthesized video based on a predetermined document operation on the document and / or presence or absence of a user's speech detected based on voice data of the collected user's voice The second step of generating the synthesized video data comprising the synthesized video data, The third step of transmitting to the terminal device via the network is provided, so that the document image and the user image of the user who has performed the operation are mainly displayed during the document operation without causing the user to perform an extra operation. On the other hand, when the user speaks, the user's video of the user who made the speech can be captured mainly, thus realizing an information processing method that is highly communicative and full of realism.

さらに以上のように本発明によれば、プログラムにおいてそれぞれ異なる地点に設置される１又は複数の端末装置と、所定のネットワークを介して端末装置と接続される情報処理装置に対して、端末装置からネットワークを介して送信される撮影されたユーザの映像データに基づくユーザの映像及び必要なドキュメントの画像を合成した合成映像を生成する第１のステップと、端末装置からネットワークを介して送信されるドキュメントについての所定のドキュメント操作及び又は集音されたユーザの音声の音声データに基づき検出されるユーザの発言の有無に基づいて、合成映像におけるドキュメントの画像及び又は対応するユーザの映像を拡大又は縮小させる第２のステップと、合成映像の映像データでなる合成映像データを、ネットワークを介して端末装置に送信する第３のステップとを実行させるようにしたことにより、ユーザに余分な操作をさせることなく、ドキュメント操作時にはドキュメント画像及びその操作を行ったユーザのユーザ映像をメインに捉えさせる一方、ユーザの発言時には当該発言を行ったユーザのユーザ映像をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれるプログラムを実現できる。 Further, as described above, according to the present invention, one or a plurality of terminal devices installed at different points in the program and an information processing device connected to the terminal device via a predetermined network are transmitted from the terminal device. A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured user's video data transmitted via the network, and a document transmitted from the terminal device via the network The document image and / or the corresponding user video in the synthesized video is enlarged or reduced based on the presence or absence of the user's speech detected based on the predetermined document operation and / or the voice data of the collected user's voice In the second step, the synthesized video data composed of the synthesized video data is transferred to the network. The third step of transmitting to the terminal device via the network is executed, so that the document image and the user video of the user who performed the operation at the time of the document operation can be displayed without causing the user to perform an extra operation. On the other hand, when the user speaks, the user's video of the user who made the speech can be captured mainly, and thus a program with high communication and full of realism can be realized.

以下図面について、本発明の一実施の形態を詳述する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

（１）本実施の形態によるテレビ会議システムの構成
図１において、１は全体として本実施の形態によるテレビ会議システムを示し、それぞれ異なる地点に設置された複数のテレビ会議端末２（２Ａ、２Ｂ）と、テレビ会議サーバ３とが例えばインターネット回線等のネットワークＮＴを介して相互に接続されることにより構成される。 (1) Configuration of Video Conference System According to the Present Embodiment In FIG. 1, reference numeral 1 denotes the video conference system according to the present embodiment as a whole, and a plurality of video conference terminals 2 (2A, 2B) installed at different points, respectively. And the video conference server 3 are connected to each other via a network NT such as an Internet line.

この場合、各テレビ会議端末２には、図２に示すように、それぞれ例えばＣＣＤ（Charge Coupled Device）からなるカメラ部４と、マイクロホン５と、マウス等からなるポインティングデバイス６とが設けられている。 In this case, as shown in FIG. 2, each video conference terminal 2 is provided with a camera unit 4 made of, for example, a CCD (Charge Coupled Device), a microphone 5, and a pointing device 6 made of a mouse or the like. .

そしてカメラ部４は、そのテレビ会議端末２を使用するユーザ７（７Ａ、７Ｂ）を撮像し、かくして得られた映像信号Ｓ１を映像処理部８に送出する。また映像処理部８は、供給される映像信号Ｓ１に対して例えばアナログ／ディジタル変換処理等の所定の信号処理を施し、得られた映像データＤ１を映像圧縮部９に送出する。さらに映像圧縮部９は、供給される映像データＤ１に対して、例えば国際電気通信連合ＩＴＵ（International Telecommunication Union）によって標準化されたH.323等の所定規格に従った所定の圧縮方式で圧縮処理を施し、得られた圧縮映像データＤ２をパケタイズ部１０に送出する。 The camera unit 4 captures an image of the user 7 (7A, 7B) who uses the video conference terminal 2, and sends the video signal S1 thus obtained to the video processing unit 8. The video processing unit 8 performs predetermined signal processing such as analog / digital conversion processing on the supplied video signal S1, and sends the obtained video data D1 to the video compression unit 9. Further, the video compression unit 9 compresses the supplied video data D1 by a predetermined compression method according to a predetermined standard such as H.323 standardized by the International Telecommunication Union (ITU). The obtained compressed video data D2 is sent to the packetizing unit 10.

またマイクロホン５は、ユーザ７の発言時の音声を集音し、得られた音声信号Ｓ２を入力音声処理部１１に送出する。また入力音声処理部１１は、供給される音声信号Ｓ２に対して例えばアナログ／ディジタル変換処理等の所定の信号処理を施し、得られた音声データＤ３を音声圧縮部１２に送出する。さらに音声圧縮部１２は、供給される音声データＤ３に対して上述と同様のH.323等の所定の規格に準拠した圧縮方式の圧縮処理を施し、得られた圧縮音声データＤ４をパケタイズ部１０に送出する。 In addition, the microphone 5 collects the voice when the user 7 speaks and sends the obtained voice signal S2 to the input voice processing unit 11. The input audio processing unit 11 performs predetermined signal processing such as analog / digital conversion processing on the supplied audio signal S2, and sends the obtained audio data D3 to the audio compression unit 12. Further, the audio compression unit 12 performs compression processing of the compression method based on a predetermined standard such as H.323 similar to the above on the supplied audio data D3, and the obtained compressed audio data D4 is packetized. To send.

さらにポインティングデバイス６は、ユーザ操作の内容を表す操作情報信号Ｓ３を状況判断部１３に送出する一方、状況判断部１３は、この操作情報信号Ｓ３に基づき判断される、現在のカーソルの位置等の状況判断情報を状況判断データＤ５としてパケタイズ部１０に送出する。 Further, the pointing device 6 sends an operation information signal S3 representing the content of the user operation to the situation determination unit 13, while the situation determination unit 13 determines the current cursor position and the like determined based on the operation information signal S3. The situation determination information is sent to the packetizing unit 10 as the situation determination data D5.

パケタイズ部１０は、映像圧縮部９から供給される圧縮映像データＤ２、音声圧縮部１２から供給される圧縮音声データＤ４及び状況判断部１３から供給される操作データＤ５を所定フォーマットでパケット化し、得られたパケットデータＤ６をネットワーク制御部１４に送出する。 The packetizing unit 10 packetizes the compressed video data D2 supplied from the video compression unit 9, the compressed audio data D4 supplied from the audio compression unit 12, and the operation data D5 supplied from the situation determination unit 13 in a predetermined format. The received packet data D6 is sent to the network control unit 14.

そしてネットワーク制御部１４は、供給されるパケットデータＤ６に対して中間周波数変調等の所定の変調処理を施し、かくして得られた送信信号Ｓ４をネットワークＮＴを介してテレビ会議サーバ３に送信する。 Then, the network control unit 14 performs predetermined modulation processing such as intermediate frequency modulation on the supplied packet data D6, and transmits the transmission signal S4 thus obtained to the video conference server 3 via the network NT.

テレビ会議サーバ３においては、図３に示すような構成を有し、各テレビ会議端末２からそれぞれ送信される送信信号Ｓ４をネットワーク制御部１６において受信する。そしてネットワーク制御部１６は、これら各テレビ会議端末２からの送信信号Ｓ４に対して検波処理等の所定の復調処理をそれぞれ施し、得られたパケットデータＤ７をデパケタイズ部１６に送出する。 The video conference server 3 has a configuration as shown in FIG. 3, and the network control unit 16 receives a transmission signal S 4 transmitted from each video conference terminal 2. Then, the network control unit 16 performs predetermined demodulation processing such as detection processing on the transmission signal S4 from each of the video conference terminals 2, and sends the obtained packet data D7 to the depacketizing unit 16.

デパケタイズ部１６は、供給されるパケットデータＤ７からパケットに含まれる圧縮映像データＤ８、圧縮音声データＤ９及び操作データＤ１０を抽出し、圧縮映像データＤ８を映像伸張部１７に送出すると共に、圧縮音声データＤ９を音声伸張部１８に送出し、状況判断データＤ１０を状況判断制御部１９に送出する。 The depacketizing unit 16 extracts the compressed video data D8, the compressed audio data D9, and the operation data D10 included in the packet from the supplied packet data D7, and sends the compressed video data D8 to the video decompressing unit 17 and the compressed audio data. D9 is sent to the voice decompression unit 18, and the situation judgment data D10 is sent to the situation judgment control unit 19.

そして映像伸張部１７は、供給される圧縮映像データＤ８に対して対応する復号化方式による伸張処理を施し、得られたベースバンドの映像データＤ１１を映像合成部２０に送出する。 The video decompression unit 17 performs decompression processing by the corresponding decoding method on the supplied compressed video data D8, and sends the obtained baseband video data D11 to the video composition unit 20.

このときテレビ会議サーバ３には、テレビ会議端末２と同様に、ＣＣＤカメラ等でなるカメラ部２１と、マイクロホン２２と、マウス等からなるポインティングデバイス２３とが設けられている。 At this time, similarly to the video conference terminal 2, the video conference server 3 is provided with a camera unit 21 made of a CCD camera or the like, a microphone 22, and a pointing device 23 made of a mouse or the like.

そしてカメラ部２１は、このテレビ会議サーバ３を使用するユーザ７（７Ｃ）を撮像することにより得られた映像信号Ｓ５を映像処理部２５に送出する。また映像処理部２５は、供給される映像信号Ｓ５に対して例えばアナログ／ディジタル変換処理等の所定の信号処理を施し、得られた映像データＤ１２を映像合成部２０に送出する。 Then, the camera unit 21 sends the video signal S5 obtained by imaging the user 7 (7C) who uses the video conference server 3 to the video processing unit 25. The video processing unit 25 performs predetermined signal processing such as analog / digital conversion processing on the supplied video signal S5, and sends the obtained video data D12 to the video synthesis unit 20.

またポインティングデバイス２３は、ユーザ操作の内容を表す操作情報信号Ｓ６を状況判断制御部１９に送出する。そして状況判断制御部１９は、この操作情報信号Ｓ６と、各テレビ会議端末２から送信される状況判断データＤ１０とに基づいて、各テレビ会議端末２及びテレビ会議サーバ３にそれぞれ対応させて、そのテレビ会議端末２又はテレビ会議サーバ３の各ポインティングデバイス６及び２３の操作に応動して移動する各カーソルの画像データ（以下、これをカーソル画像データと呼ぶ）Ｄ１３を生成し、これを映像合成部２０に送出する。 The pointing device 23 also sends an operation information signal S6 representing the content of the user operation to the situation determination control unit 19. Then, the situation determination control unit 19 is associated with each video conference terminal 2 and the video conference server 3 based on the operation information signal S6 and the situation judgment data D10 transmitted from each video conference terminal 2, respectively. Image data (hereinafter referred to as “cursor image data”) D13 of each cursor that moves in response to the operation of the pointing devices 6 and 23 of the video conference terminal 2 or the video conference server 3 is generated, and this is generated as a video composition unit. 20 to send.

さらにテレビ会議サーバ３には、予めテレビ会議端末２から送信され又は当該テレビ会議サーバ３のユーザ７Ｃにより入力されたドキュメントの画像データが格納されたドキュメント格納部２６が設けられている。そして状況判断制御部１９は、当該テレビ会議サーバ３のポインティングデバイス２３から供給される操作情報信号Ｓ６及び各テレビ会議端末２から送信される状況判断データＤ１０に基づいて、ドキュメント格納部２６に格納されたドキュメントの表示命令が与えられたときには、ドキュメント格納部２６を制御してそのドキュメントのデータを読み出させる。そしてこのドキュメントのドキュメントデータＤ１４が映像合成部２０に与えられる。 Further, the video conference server 3 is provided with a document storage unit 26 in which image data of a document transmitted in advance from the video conference terminal 2 or input by the user 7C of the video conference server 3 is stored. The situation determination control unit 19 is stored in the document storage unit 26 based on the operation information signal S6 supplied from the pointing device 23 of the video conference server 3 and the situation determination data D10 transmitted from each video conference terminal 2. When a document display command is given, the document storage unit 26 is controlled to read the document data. Then, the document data D14 of this document is given to the video composition unit 20.

映像合成部２０は、状況判断制御部１９の制御のもとに、これら映像伸張部１７から供給される各テレビ会議端末２からの映像データＤ１１に基づく各ユーザ映像２７（２７Ａ、２７Ｂ）と、映像処理部２６から供給される映像データＤ１２に基づくユーザ映像２７（２７Ｃ）と、状況判断制御部１９から供給されるカーソル画像データＤ１３に基づくカーソル画像２９（２９Ａ〜２９Ｃ）と、ドキュメント格納部２６から供給されるドキュメントデータＤ１４に基づくドキュメント画像３１とを合成した例えば図４に示すような合成映像３２を生成し、その映像データでなる合成映像データＤ１５を表示処理部３３及び映像圧縮部３４に送出する。 Under the control of the situation determination control unit 19, the video composition unit 20 has each user video 27 (27 A, 27 B) based on the video data D 11 from each video conference terminal 2 supplied from the video expansion unit 17. User video 27 (27C) based on the video data D12 supplied from the video processing unit 26, cursor images 29 (29A to 29C) based on the cursor image data D13 supplied from the situation determination control unit 19, and the document storage unit 26 For example, a composite video 32 as shown in FIG. 4 is generated by combining the document image 31 based on the document data D14 supplied from the video data, and the composite video data D15 composed of the video data is generated in the display processing unit 33 and the video compression unit 34. Send it out.

そして表示処理部３３は、この合成映像データＤ１５に対してディジタル／アナログ変換処理等の所定の信号処理を施し、得られた映像信号Ｓ７を例えばＣＲＴ（Cathode-Ray Tube）等でなるディスプレイ３５に送出する。これによりこの映像信号Ｓ７に基づく上述のような合成画像３２がディスプレイ３５に表示される。 Then, the display processing unit 33 performs predetermined signal processing such as digital / analog conversion processing on the synthesized video data D15, and the obtained video signal S7 is displayed on a display 35 made of, for example, a CRT (Cathode-Ray Tube). Send it out. Thereby, the composite image 32 as described above based on the video signal S7 is displayed on the display 35.

また映像圧縮部３４は、映像合成部２０から供給される合成映像データＤ１５に対して所定の圧縮処理を施し、得られた圧縮合成映像データＤ１６をパケタイズ部３６に送出する。 In addition, the video compression unit 34 performs a predetermined compression process on the synthesized video data D15 supplied from the video synthesis unit 20, and sends the obtained compressed synthesized video data D16 to the packetizing unit 36.

一方、音声伸張部１８は、デパケタイズ部１６から供給される圧縮音声データＤ９に対して対応する復号化方式での復号化処理を施し、得られたベースバンドの音声データＤ１７を音声合成部３７に送出する。 On the other hand, the voice decompression unit 18 performs a decoding process using a corresponding decoding method on the compressed voice data D9 supplied from the depacketizing unit 16, and sends the obtained baseband voice data D17 to the voice synthesis unit 37. Send it out.

このとき音声合成部３６には、このテレビ会議サーバ３のマイクロホン２２から出力された音声信号Ｓ８を、音声処理部３８においてアナログ／ディジタル変換等の所定の信号処理を施すことにより得られたベースバンドの音声データＤ１８が与えられる。 At this time, the voice synthesizer 36 has a baseband obtained by subjecting the voice signal S8 output from the microphone 22 of the video conference server 3 to predetermined signal processing such as analog / digital conversion in the voice processor 38. Audio data D18 is provided.

かくして音声合成部３７は、これら音声伸張部１８から供給される各テレビ会議端末２（２Ａ及び２Ｂ）からの音声データＤ１７に基づく各音声と、音声処理部３８から供給される音声データＤ１８に基づく音声とを合成した合成音声を生成し、その音声データでなる合成音声データＤ１９を出力音声処理部３９及び音声圧縮部４０に送出する。 Thus, the voice synthesizer 37 is based on each voice based on the voice data D17 from each video conference terminal 2 (2A and 2B) supplied from the voice expansion unit 18 and on the voice data D18 supplied from the voice processing unit 38. A synthesized voice synthesized with the voice is generated, and synthesized voice data D19 composed of the voice data is sent to the output voice processor 39 and the voice compressor 40.

そして出力音声処理部３９は、この合成音声データＤ１９に対してディジタル／アナログ変換処理等の所定の信号処理を施し、得られた音声信号Ｓ９をスピーカ４１に出力する。これによりこの音声信号Ｓ９に基づく上述のような合成音声がスピーカ４１から出力される。 The output sound processing unit 39 performs predetermined signal processing such as digital / analog conversion processing on the synthesized sound data D19, and outputs the obtained sound signal S9 to the speaker 41. As a result, the above synthesized voice based on the voice signal S9 is output from the speaker 41.

また音声圧縮部４０は、音声合成部３７から供給される合成音声データＤ１９に対して所定の圧縮処理を施し、得られた圧縮合成音声データＤ２０をパケタイズ部３６に送出する。 The voice compression unit 40 performs a predetermined compression process on the synthesized voice data D19 supplied from the voice synthesis unit 37, and sends the obtained compressed synthesized voice data D20 to the packetizing unit 36.

パケタイズ部３６は、映像圧縮部３４から供給される圧縮合成映像データＤ１６及び音声圧縮部４０から供給される圧縮合成音声データＤ２０を所定フォーマットでパケット化し、得られたパケットデータＤ２１をネットワーク制御部１５に送出する。 The packetizing unit 36 packetizes the compressed synthesized video data D16 supplied from the video compressing unit 34 and the compressed synthesized audio data D20 supplied from the audio compressing unit 40 in a predetermined format, and the obtained packet data D21 is network control unit 15 To send.

そしてネットワーク制御部１５は、供給されるパケットデータＤ２１に対して中間周波数変調等の所定の変調処理を施し、かくして得られた送信信号Ｓ１０をネットワークＮＴを介して各テレビ会議端末２にそれぞれ送信する。 Then, the network control unit 15 performs predetermined modulation processing such as intermediate frequency modulation on the supplied packet data D21, and transmits the transmission signal S10 thus obtained to each video conference terminal 2 via the network NT. .

各テレビ会議端末２においては、テレビ会議サーバ３から送信される送信信号Ｓ１０をネットワーク制御部１４において受信する。そしてネットワーク制御部１４は、このテレビ会議サーバ３からの送信信号Ｓ１０に対して検波処理等の所定の復調処理をそれぞれ施し、得られたパケットデータＤ２２をデパケタイズ部４２に送出する。 In each video conference terminal 2, the network control unit 14 receives the transmission signal S 10 transmitted from the video conference server 3. Then, the network control unit 14 performs predetermined demodulation processing such as detection processing on the transmission signal S10 from the video conference server 3, and sends the obtained packet data D22 to the depacketizing unit 42.

デパケタイズ部４２は、供給されるパケットデータＤ２２からパケットに含まれる圧縮合成映像データＤ２３及び圧縮合成音声データＤ２４を抽出し、圧縮合成映像データＤ２３を映像伸張部４３に送出すると共に、圧縮合成音声データＤ２４を音声伸張部４４に送出する。 The depacketizing unit 42 extracts the compressed synthesized video data D23 and the compressed synthesized audio data D24 included in the packet from the supplied packet data D22, sends the compressed synthesized video data D23 to the video decompressing unit 43, and also compresses the synthesized synthesized audio data. D24 is sent to the voice decompression unit 44.

そして映像伸張部４３は、供給される圧縮合成映像データＤ２３に対して対応する復号化方式での伸張処理を施し、得られたベースバンドの合成映像データＤ２５を表示処理部４５に送出する。また表示処理部４５は、この合成映像データＤ２５に対してディジタル／アナログ変換処理等の所定の信号処理を施し、得られた映像信号Ｓ１１をＣＲＴ等でなるディスプレイ４６送出する。これによりこの映像信号Ｓ１１に基づく上述の合成画像３２（図４）がディスプレイ４６に表示される。 Then, the video decompression unit 43 performs decompression processing with the corresponding decoding method on the supplied compressed composite video data D23 and sends the obtained baseband composite video data D25 to the display processing unit 45. The display processing unit 45 performs predetermined signal processing such as digital / analog conversion processing on the composite video data D25, and sends the obtained video signal S11 to the display 46 such as a CRT. As a result, the composite image 32 (FIG. 4) based on the video signal S11 is displayed on the display 46.

さらに音声伸張部４４は、供給される圧縮合成音声データＤ２４に対して対応する復号化方式での伸張処理を施し、得られたベースバンドの合成映像データＤ２６を出力音声処理部４７に送出する。また出力音声処理部４７は、この合成音声データＤ２６に対してディジタル／アナログ変換処理等の所定の信号処理を施し、得られた音声信号Ｓ１２をスピーカ４８に送出する。これによりこの音声信号Ｓ１２に基づく上述の合成音声がスピーカ４８から出力される。 Further, the audio decompression unit 44 performs decompression processing by the corresponding decoding method on the supplied compressed synthesized audio data D24, and sends the obtained baseband synthesized video data D26 to the output audio processing unit 47. The output audio processing unit 47 performs predetermined signal processing such as digital / analog conversion processing on the synthesized audio data D26 and sends the obtained audio signal S12 to the speaker 48. As a result, the above-described synthesized voice based on the voice signal S12 is output from the speaker 48.

このようにしてこのテレビ会議システム１においては、各テレビ会議端末２のカメラ部４及びテレビ会議サーバ３のカメラ部２１によってそれぞれ撮像された各ユーザ７の各ユーザ映像２７、カーソル画像２９及びドキュメント画像３１を合成してなる合成映像３２を各テレビ会議端末２の各ディスプレイ４６及びテレビ会議サーバ３のディスプレイ３５にそれぞれ表示させると共に、各テレビ会議端末２の各マイクロホン５及びテレビ会議サーバ３のマイクロホン２２によってそれぞれ集音された各ユーザ７の音声を合成してなる合成音声を各テレビ会議端末２の各スピーカ４８及びテレビ会議サーバ３のスピーカ４１からそれぞれ出力させることができるようになされている。 Thus, in this video conference system 1, each user video 27, cursor image 29, and document image of each user 7 captured by the camera unit 4 of each video conference terminal 2 and the camera unit 21 of the video conference server 3, respectively. The composite video 32 formed by combining 31 is displayed on each display 46 of each video conference terminal 2 and the display 35 of the video conference server 3, and each microphone 5 of each video conference terminal 2 and microphone 22 of the video conference server 3 are displayed. The synthesized voices obtained by synthesizing the voices of the respective users 7 collected by the above can be output from the speakers 48 of the video conference terminals 2 and the speakers 41 of the video conference server 3, respectively.

（２）テレビ会議システム１に搭載された各種機能
次に、このテレビ会議システム１に搭載された各種機能について説明する。 (2) Various functions installed in the video conference system 1 Next, various functions installed in the video conference system 1 will be described.

（２−１）ドキュメント操作時等の表示制御機能
このテレビ会議システム１には、各テレビ会議端末２又はテレビ会議サーバ３の各ポインティングデバイス６又は２３の操作に応じて、そのとき各ディスプレイ３５及び４６に表示されている合成映像３２内のドキュメント画像３１の表示位置及び又はその大きさや、各ユーザ７の各ユーザ映像２７の表示位置及び又はその大きさを変化させるドキュメント操作時等の表示制御機能が搭載されている。 (2-1) Display Control Function During Document Operation, etc. In this video conference system 1, each display 35 and each video conference terminal 2 or the video conference server 3 according to the operation of each pointing device 6 or 23 of the video conference server 3 46, the display position and / or size of the document image 31 in the composite video 32 displayed on the display 46, and the display control function at the time of document operation for changing the display position and / or size of each user video 27 of each user 7. Is installed.

実際上、テレビ会議サーバ３の状況判断制御部１９は、かかるドキュメント操作時等の表示制御機能を実現するための手段として、図５に示すように、そのとき表示されているドキュメント及び現在そのドキュメントを操作しているユーザ７を管理するための付加情報テーブル４９を有している。 Actually, as shown in FIG. 5, the situation determination control unit 19 of the video conference server 3 implements a display control function at the time of the document operation, as shown in FIG. Has an additional information table 49 for managing the user 7 who is operating.

そして状況判断制御部１９は、テレビ会議サーバ３のポインティングデバイス２３から供給される操作入力信号Ｓ６と、各テレビ会議端末２からそれぞれ送信される状況判断データＤ１０とに基づいて、いずれかのテレビ会議端末２又はテレビ会議サーバ３のポインティングデバイス６又は２３が操作されて、ドキュメント格納部２６に格納されたドキュメントデータＤ１４に基づくドキュメント画像３１を表示し、若しくは現在表示しているドキュメント画像３１を他のドキュメント画像３１に変更し、又は現在表示しているドキュメント画像３１の次ページを表示等すべき旨の命令が入力されたことを認識すると、そのドキュメントデータＤ１４のファイル名を付加情報テーブル４９の表示ドキュメント管理欄４９Ａに格納すると共に、その操作を行ったユーザ７が使用しているテレビ会議端末２又はテレビ会議サーバ３のＩＤ（以下、これを端末・サーバＩＤと呼ぶ）を付加情報テーブル４９のドキュメント操作ユーザ管理欄４９Ｂに格納する。 The situation determination control unit 19 then selects one of the video conferences based on the operation input signal S6 supplied from the pointing device 23 of the video conference server 3 and the situation determination data D10 transmitted from each video conference terminal 2. The pointing device 6 or 23 of the terminal 2 or the video conference server 3 is operated to display the document image 31 based on the document data D14 stored in the document storage unit 26, or the currently displayed document image 31 to another When it is recognized that an instruction to change to the document image 31 or to display the next page of the currently displayed document image 31 is input, the file name of the document data D14 is displayed in the additional information table 49. In addition to storing in the document management column 49A, The ID of the video conference terminal 2 or the video conference server 3 used by the user 7 who performed the operation (hereinafter referred to as the terminal / server ID) is stored in the document operation user management column 49B of the additional information table 49. .

また状況判断制御部１９は、この付加情報テーブル４９に基づいてドキュメント格納部２６を制御することにより、当該ドキュメント格納部２６に格納されている指定されたドキュメントデータＤ１４を読み出させて、これを映像合成部２０に送出させる。 Further, the situation determination control unit 19 controls the document storage unit 26 based on the additional information table 49 to read out the designated document data D14 stored in the document storage unit 26, and this is read out. It is sent to the video composition unit 20.

さらに状況判断制御部１９は、これと併せて付加情報テーブル４９に基づいて映像合成部２０を制御することにより、図６に示すように、ドキュメント格納部２６から供給されるドキュメントデータＤ１４に基づくドキュメント画像３１を画面の中央部に拡大表示させると共に、その操作が行われたテレビ会議端末２又はテレビ会議サーバ３から送信される映像データＤ１１又はＤ１２に基づくユーザ７のユーザ映像２７を画面左下に拡大表示し、かつ他のテレビ会議端末２又はテレビ会議サーバ３から送信される映像データＤ１１又はＤ１２に基づくユーザ７のユーザ映像２７が画面右側に拡大していない、うなずき等が認識できる程度の所定の大きさで表示されてなる合成映像３２の合成映像データＤ１５を生成させる。 Further, the situation determination control unit 19 controls the video composition unit 20 based on the additional information table 49 together with this, so that the document based on the document data D14 supplied from the document storage unit 26 as shown in FIG. The image 31 is enlarged and displayed at the center of the screen, and the user video 27 of the user 7 based on the video data D11 or D12 transmitted from the video conference terminal 2 or the video conference server 3 on which the operation is performed is enlarged to the lower left of the screen. The user video 27 of the user 7 based on the video data D11 or D12 that is displayed and transmitted from the other video conference terminal 2 or the video conference server 3 is not enlarged to the right side of the screen, a predetermined level that can recognize a nod or the like The composite video data D15 of the composite video 32 displayed in size is generated.

そして映像合成部２０は、このように生成した合成映像データＤ１５を表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの図６のような合成映像３２がテレビ会議サーバ３のディスプレイ３５及び各テレビ会議端末２のディスプレイ４６にそれぞれ表示されることとなる。 Then, the video composition unit 20 sends the composite video data D15 generated in this way to the display processing unit 33 and the video compression unit 34, respectively. As a result, the composite video 32 as shown in FIG. 6 is displayed on the display 35 of the video conference server 3 and the display 46 of each video conference terminal 2.

一方、状況判断制御部１９は、このように合成映像３２内にドキュメント画像３１を表示させたときは、当該テレビ会議サーバ３のポインティングデバイス２３から供給される操作入力信号Ｓ６と、各テレビ会議端末２から送信される状況判断データＤ１０とに基づいて、テレビ会議サーバ３のポインティングデバイス２３及び各テレビ会議端末２のポインティングデバイス６とそれぞれ対応付けられた各カーソル画像２９の位置を監視する。 On the other hand, when the situation determination control unit 19 displays the document image 31 in the composite video 32 in this way, the operation input signal S6 supplied from the pointing device 23 of the video conference server 3 and each video conference terminal 2, the positions of the cursor images 29 respectively associated with the pointing device 23 of the video conference server 3 and the pointing device 6 of each video conference terminal 2 are monitored.

そして状況判断制御部１９は、それまでカーソル操作が行われていたテレビ会議端末２又はテレビ会議サーバ３と異なるテレビ会議端末２又はテレビ会議サーバ３と対応付けられた各カーソル画像２９がドキュメント上で１〜２秒以上移動するようなカーソル操作が行われたときには、付加情報管理テーブル４９のドキュメント操作ユーザ欄４９Ｂに格納されている端末・サーバＩＤを、その新たなテレビ会議端末２又はテレビ会議サーバ３の端末・サーバＩＤに書き換える。 The situation determination control unit 19 then displays each cursor image 29 associated with the video conference terminal 2 or the video conference server 3 different from the video conference terminal 2 or the video conference server 3 on which the cursor operation has been performed on the document. When a cursor operation that moves for 1 to 2 seconds or more is performed, the terminal / server ID stored in the document operation user column 49B of the additional information management table 49 is set to the new video conference terminal 2 or video conference server. The terminal / server ID is rewritten to 3.

また状況判断制御部１９は、この後この付加情報テーブル４９に基づいて映像合成部２０を制御することにより、図７に示すように、それまでカーソル操作が行われていたテレビ会議端末２からの映像データＤ１１又は当該テレビ会議サーバ３の映像処理部２５からの映像データＤ１２に基づくユーザ映像２７と、新たにカーソル操作が行われたテレビ会議端末２からの映像データＤ１１又は当該テレビ会議サーバ３の映像処理部２５からの映像データＤ１２に基づくユーザ映像２７とを入れ換えた合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの図６のようなもとの合成映像３２に対して、図７に示すような合成映像３２がテレビ会議サーバ３のディスプレイ３５及び各テレビ会議端末２のディスプレイ４６にそれぞれ表示される。 Further, the situation determination control unit 19 thereafter controls the video composition unit 20 based on the additional information table 49, so that, as shown in FIG. The user video 27 based on the video data D11 or the video data D12 from the video processing unit 25 of the video conference server 3 and the video data D11 from the video conference terminal 2 on which the cursor operation is newly performed or the video conference server 3 The composite video data D15 of the composite video 32 in which the user video 27 based on the video data D12 from the video processing unit 25 is replaced is generated, and this is transmitted to the display processing unit 33 and the video compression unit 34, respectively. Accordingly, the composite video 32 as shown in FIG. 7 is displayed on the display 35 of the video conference server 3 and the display 46 of each video conference terminal 2 with respect to the original composite video 32 as shown in FIG.

他方、状況判断制御部１９は、このように映像合成部２０を制御して合成映像３２内にドキュメント画像３１を表示させたときには、テレビ会議サーバ３のポインティングデバイス２３から供給される操作入力信号Ｓ６と、各テレビ会議端末２から送信される状況判断データＤ１０とに基づいて、ユーザ７がそのドキュメント画像３１上でカーソル操作を行っている時間を監視する。 On the other hand, when the situation determination control unit 19 controls the video composition unit 20 to display the document image 31 in the composite video 32 in this way, the operation input signal S6 supplied from the pointing device 23 of the video conference server 3 is displayed. Based on the situation determination data D10 transmitted from each video conference terminal 2, the time during which the user 7 performs the cursor operation on the document image 31 is monitored.

そして状況判断制御部１９は、かかるカーソル操作を同一のユーザ７が継続して行っているときには、所定時間（例えば１〜２分程度）ごとに映像合成部２０を制御することにより、当該所定時間ごとにドキュメント画像３１を段階的に順次拡大させた図８のような合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりユーザ７のカーソル操作に応じて一定時間ごとにドキュメント画像３１が順次段階的に拡大するような合成映像３２がテレビ会議サーバ３のディスプレイ３５及び各テレビ会議端末２のディスプレイ４６にそれぞれ表示される。 Then, when the same user 7 is continuously performing the cursor operation, the situation determination control unit 19 controls the video composition unit 20 every predetermined time (for example, about 1 to 2 minutes) to thereby perform the predetermined time. Each time the document image 31 is enlarged step by step, the synthesized video data D15 of the synthesized video 32 as shown in FIG. 8 is generated and sent to the display processing unit 33 and the video compression unit 34, respectively. As a result, a composite video 32 is displayed on the display 35 of the video conference server 3 and the display 46 of each video conference terminal 2 so that the document image 31 is enlarged step by step according to the cursor operation of the user 7. The

これに対して状況判断制御部１９は、かかるカーソル操作をいずれのユーザ７も行っていないときには、所定時間（例えば１〜２分程度）ごとに映像合成部２０を制御することにより、当該所定時間ごとにドキュメント画像３１を段階的に順次縮小させた合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりテレビ会議端末２又はテレビ会議サーバ３のいずれにおいてもユーザ７のカーソル操作がないときには、一定時間ごとにドキュメント画像３１が順次段階的に縮小し、最終的には図９に示すように、合成映像３２の左端にドキュメント画像３０が縮小されて配置された合成映像３２がテレビ会議サーバ３のディスプレイ３５及び各テレビ会議端末２のディスプレイ４６にそれぞれ表示される。 On the other hand, when no user 7 is performing such a cursor operation, the situation determination control unit 19 controls the video composition unit 20 every predetermined time (for example, about 1 to 2 minutes) to thereby perform the predetermined time. Each time, the synthesized video data D15 of the synthesized video 32 obtained by sequentially reducing the document image 31 in stages is generated and sent to the display processing unit 33 and the video compression unit 34, respectively. As a result, when there is no cursor operation of the user 7 in either the video conference terminal 2 or the video conference server 3, the document image 31 is sequentially reduced in steps at regular intervals, and finally, as shown in FIG. The composite video 32 in which the document image 30 is reduced and arranged at the left end of the composite video 32 is displayed on the display 35 of the video conference server 3 and the display 46 of each video conference terminal 2.

このようにして、このテレビ会議システム１では、ユーザ７のドキュメント操作に応じてドキュメント画像３０及びユーザ７のユーザ映像２７を拡大又は縮小するように表示し得るようになされ、これによりユーザ７に余分な操作をさせることなく、ドキュメント操作時等にはドキュメント画像３１及びその操作を行ったユーザ７のユーザ映像２７をメインに捉えさせ得るようになされている。 In this manner, in the video conference system 1, the document image 30 and the user video 27 of the user 7 can be displayed so as to be enlarged or reduced according to the document operation of the user 7. Thus, the document image 31 and the user video 27 of the user 7 who has performed the operation can be mainly captured at the time of the document operation or the like without performing any operation.

（２−２）音声検出等に基づく表示及び音量制御機能
一方、このテレビ会議システム１には、ユーザ７の発言の有無に応じて、そのときテレビ会議サーバ３のディスプレイ３５及び各テレビ会議端末２のディスプレイ４６に表示されている合成映像３２内の各ユーザ７のユーザ映像２７の表示位置及びその大きさや、各テレビ会議端末２及びテレビ会議サーバ３のスピーカ４１及び４８から出力される各ユーザ７の発言音量を変化させる表示及び音量制御機能が搭載されている。 (2-2) Display and Volume Control Function Based on Voice Detection etc. On the other hand, in this video conference system 1, the display 35 of the video conference server 3 and each video conference terminal 2 at that time according to the presence / absence of the speech of the user 7. The display position and size of the user video 27 of each user 7 in the composite video 32 displayed on the display 46 of each of the users 7 and each user 7 output from the speakers 41 and 48 of the video conference terminal 2 and the video conference server 3. It is equipped with a display and volume control function that changes the volume of the voice.

実際上、このテレビ会議システム１の場合、テレビ会議端末２においては、入力音声処理部１１から出力される音声信号Ｓ２が状況判断部１３に与えられる。 Actually, in the case of this video conference system 1, in the video conference terminal 2, the audio signal S 2 output from the input audio processing unit 11 is given to the situation determination unit 13.

そして状況判断部１３は、供給される音声信号Ｓ２に基づいて、その信号レベルからユーザ７が発言したか否かを常時監視し、ユーザ７が発言したことを検出したときには、これを上述の状況判断データＤ５としてパケタイズ部１０に送出する。かくしてこの状況判断データＤ５は、この後上述のようにネットワークＮＴを介してテレビ会議サーバ３に送信され、その後状況判断制御部１９に与えられる。 And the situation judgment part 13 always monitors whether the user 7 has spoken from the signal level based on the supplied audio signal S2, and when detecting that the user 7 has spoken, this is described above. The determination data D5 is sent to the packetizing unit 10. Thus, the situation determination data D5 is thereafter transmitted to the video conference server 3 via the network NT as described above, and is then given to the situation determination control unit 19.

このとき状況判断制御部１９は、入力音声処理部３８から出力された音声信号Ｓ８を入力し、この音声信号Ｓ８に基づいて当該テレビ会議サーバ３のユーザ２４が発言したか否かを常時監視するようになされている。 At this time, the situation determination control unit 19 receives the audio signal S8 output from the input audio processing unit 38, and constantly monitors whether or not the user 24 of the video conference server 3 has made a speech based on the audio signal S8. It is made like that.

そして状況判断制御部１９は、この監視結果と、各テレビ会議端末２から送信される状況判断データＤ１０とに基づいて、いずれかのユーザ７が発言しているかを判断し、そのとき例えば１〜２秒以上継続して発言しているユーザ７を検出したときには、図５に示すように、対応するテレビ会議端末２又はテレビ会議サーバ３の端末・サーバＩＤを付加情報管理テーブル４９の発言ユーザ格納欄４９Ｃに格納する。 Then, the situation determination control unit 19 determines whether any user 7 is speaking based on the monitoring result and the situation determination data D10 transmitted from each video conference terminal 2, and at that time, for example, 1 to When a user 7 speaking continuously for 2 seconds or more is detected, the terminal / server ID of the corresponding video conference terminal 2 or video conference server 3 is stored in the additional user management table 49 as shown in FIG. Store in column 49C.

また状況判断制御部１９は、この後この付加情報管理テーブル４９に基づいて映像合成部２０を制御することにより、各テレビ会議端末２及びテレビ会議サーバ３からの映像データＤ１１及びＤ１２に基づくユーザ映像２７を拡大又は縮小表示させる一方、これと併せて音声合成部３７を制御することにより、各テレビ会議端末２及びテレビ会議サーバ３からの音声データＤ１７及びＤ１８に基づくユーザ７の発言音声の音量を上げ又は下げさせる。 In addition, the situation determination control unit 19 controls the video composition unit 20 based on the additional information management table 49 thereafter, so that the user video based on the video data D11 and D12 from each video conference terminal 2 and the video conference server 3 is obtained. 27 is enlarged or reduced, and the voice synthesizer 37 is controlled at the same time, so that the volume of the speech voice of the user 7 based on the voice data D17 and D18 from each video conference terminal 2 and the video conference server 3 is controlled. Raise or lower.

例えば、状況判断制御部１９は、そのとき図６に示すように、ドキュメント画像３１が画面中央部に拡大表示され、あるユーザ７のユーザ映像２７がその左側に拡大表示された合成映像３２が映像合成部２０において生成されている場合において、他のユーザ７が発言した場合には、図７に示すように、それまで拡大表示されていたユーザ７と、その発言したユーザ７との表示位置及び大きさを入れ換えた図７に示すような合成映像３２の合成映像データＤ１５を映像合成部２０に生成させる。かくしてこの合成映像データＤ１５に基づくかかる合成画像３２が各テレビ会議端末２及びテレビ会議サーバ３のディスプレイ３５及び４６に表示される。 For example, as shown in FIG. 6, the situation determination control unit 19 displays a composite video 32 in which a document image 31 is enlarged and displayed at the center of the screen, and a user video 27 of a user 7 is enlarged and displayed on the left side. In the case where the synthesizing unit 20 generates, when another user 7 speaks, as shown in FIG. 7, the display positions of the user 7 that has been enlarged so far and the user 7 who has spoken and The synthesized video data D15 of the synthesized video 32 as shown in FIG. Thus, the composite image 32 based on the composite video data D15 is displayed on the displays 35 and 46 of each video conference terminal 2 and the video conference server 3.

また状況判断制御部１９は、この後テレビ会議サーバ３の入力音声処理部３８からの音声信号Ｓ２と、各テレビ会議端末２からの状況判断データＤ１０とを常時監視し、その後もそのユーザ７が継続して発言していると判断したときには、所定時間（例えば１〜２分程度）ごとに映像合成部２０及び音声合成部３７を制御する。 In addition, the situation determination control unit 19 continuously monitors the audio signal S2 from the input audio processing unit 38 of the video conference server 3 and the situation determination data D10 from each video conference terminal 2 thereafter, and the user 7 thereafter When it is determined that the user continues to speak, the video synthesis unit 20 and the voice synthesis unit 37 are controlled every predetermined time (for example, about 1 to 2 minutes).

かくしてこのとき映像合成部２０は、状況判断制御部１９の制御のもとに、そのとき拡大表示しているユーザ７のユーザ映像２７を当該所定時間ごとにさらに段階的に順次拡大させ、かつこれと同期して他のユーザ７のユーザ映像２７を当該所定時間ごとに段階的に順次縮小させた図１０に示すような合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの合成映像データＤ１５に基づく合成映像３２が各テレビ会議端末２及びテレビ会議サーバ３の各ディスプレイ３５及び４６にそれぞれ表示される。 Thus, at this time, under the control of the situation determination control unit 19, the video composition unit 20 further enlarges the user video 27 of the user 7 that is displayed at that time in a stepwise manner at every predetermined time. Synthetic video data D15 of the synthetic video 32 as shown in FIG. 10 in which the user video 27 of the other user 7 is sequentially reduced step by step for every predetermined time is generated in synchronization with the display processing unit 33 and Each is sent to the video compression unit 34. As a result, the composite video 32 based on the composite video data D15 is displayed on each display 35 and 46 of each video conference terminal 2 and video conference server 3, respectively.

またこのとき音声合成部３７は、状況判断制御部１９の制御のもとに、そのとき拡大表示されているユーザ７の音声の音量を当該所定時間ごとに段階的に順次上げ、かつこれと同期して他のユーザ７の音声の音量を当該所定時間ごとに段階的に順次下げた合成音声の合成音声データＤ１９を生成し、これを出力音声処理部３９及び音声圧縮部４０にそれぞれ送出する。これによりこの合成音声データＤ１９に基づく合成音声が各テレビ会議端末２及びテレビ会議サーバ３の各スピーカ４１及び４８からそれぞれ出力される。 At this time, under the control of the situation determination control unit 19, the voice synthesis unit 37 gradually increases the volume of the voice of the user 7 displayed at that time step by step in increments of the predetermined time, and synchronizes with this. Then, the synthesized voice data D19 of the synthesized voice in which the volume of the voice of the other user 7 is lowered step by step in a stepwise manner is generated and sent to the output voice processing unit 39 and the voice compression unit 40, respectively. As a result, synthesized speech based on the synthesized speech data D19 is output from each of the speakers 41 and 48 of the video conference terminal 2 and the video conference server 3, respectively.

さらに例えば状況判断制御部１９は、図９に示すように、合成映像３２において、ドキュメント画像３１が画面左端に縮小表示されていると共に、その右側に各ユーザ７のユーザ映像２７が均等な大きさで表示されている場合において、あるユーザ７が継続して発言し続けた場合には、所定時間（例えば１〜２分程度）ごとに映像合成部２０及び音声合成部３７を制御する。 Further, for example, as shown in FIG. 9, the situation determination control unit 19 displays the document image 31 in the composite video 32 in a reduced size at the left end of the screen and the user video 27 of each user 7 on the right side thereof with an equal size. When a certain user 7 continues to speak, the video synthesis unit 20 and the voice synthesis unit 37 are controlled every predetermined time (for example, about 1 to 2 minutes).

かくしてこのとき映像合成部２０は、状況判断制御部１９の制御のもとに、そのユーザ７のユーザ映像２７を当該所定時間ごとに段階的に順次拡大させ、かつこれと同期して他のユーザ７のユーザ映像２７を当該所定時間ごとに段階的に順次縮小させた合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの合成映像データＤ１５に基づく図１１に示すような合成画像３２が各テレビ会議端末２及びテレビ会議サーバ３のディスプレイ３５及び４６に表示される。 Thus, at this time, the video composition unit 20 gradually expands the user video 27 of the user 7 step by step for each predetermined time under the control of the situation determination control unit 19, and synchronizes with this. 7 of the synthesized video 32 obtained by sequentially reducing the 7 user videos 27 in a stepwise manner for each predetermined time, and transmitting the generated video data D15 to the display processing unit 33 and the video compression unit 34, respectively. As a result, a composite image 32 as shown in FIG. 11 based on the composite video data D15 is displayed on the displays 35 and 46 of each video conference terminal 2 and video conference server 3.

またこのとき音声合成部３７は、状況判断制御部１９の制御のもとに、そのユーザ７の音声の音量を当該所定時間ごとに段階的に順次上げ、かつこれと同期して他のユーザ７の音声の音量を当該所定時間ごとに段階的に順次下げた合成音声の合成音声データＤ１９を生成し、これを出力音声処理部３９及び音声圧縮部４０にそれぞれ送出する。これによりこの合成音声データＤ１９に基づく合成音声が各テレビ会議端末２及びテレビ会議サーバ３の各スピーカ４１及び４８からそれぞれ出力される。 At this time, the voice synthesizing unit 37 gradually increases the volume of the voice of the user 7 step by step for every predetermined time under the control of the situation determination control unit 19 and synchronizes with this. The synthesized voice data D19 of the synthesized voice in which the volume of the voice is gradually reduced step by step for each predetermined time is generated and sent to the output voice processing unit 39 and the voice compression unit 40, respectively. As a result, synthesized speech based on the synthesized speech data D19 is output from each of the speakers 41 and 48 of the video conference terminal 2 and the video conference server 3, respectively.

これに対して状況判断制御部１９は、テレビ会議サーバ３の入力音声処理部１１から与えられる音声信号Ｓ８と、各テレビ会議端末２から送信される状況判断情報Ｄ１０とに基づいて、いずれのユーザ７も発言していないことを検出したときには、所定時間（例えば１〜２分程度）ごとに映像合成部２０及び音声合成部３７を制御する。 On the other hand, the situation determination control unit 19 determines which user based on the audio signal S8 provided from the input audio processing unit 11 of the video conference server 3 and the situation determination information D10 transmitted from each video conference terminal 2. When it is detected that 7 is not speaking, the video synthesis unit 20 and the voice synthesis unit 37 are controlled every predetermined time (for example, about 1 to 2 minutes).

かくしてこのとき映像合成部２０は、状況判断制御部１９の制御のもとに、例えば図１０や図１１の状態から図９又は図１２のように各ユーザ７のユーザ映像２７が均等な所定の大きさとなるまで、そのとき拡大表示されていたユーザ７のユーザ映像２７を当該所定時間ごとに段階的に順次縮小し、かつこれと同期してその他のユーザ７のユーザ映像２７を当該所定時間ごとに段階的に順次拡大するような合成映像３２の合成映像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの合成映像データＤ１５に基づいて、上述のように各ユーザ映像２７の大きさが均等な大きさとなるまで、段階的に縮小又は拡大してゆく合成映像３２が各テレビ会議端末２及びテレビ会議サーバ３の各ディスプレイ３５及び４６にそれぞれ表示される。 Thus, at this time, under the control of the situation determination control unit 19, the video composition unit 20 is configured to obtain the user video 27 of each user 7 from the state shown in FIG. 10 or FIG. The user video 27 of the user 7 that has been enlarged and displayed at that time is reduced in stages step by step at the predetermined time until the size is reached, and the user video 27 of the other users 7 at the predetermined time in synchronization with this. The composite video data D15 of the composite video 32 that is sequentially expanded in steps is generated and sent to the display processing unit 33 and the video compression unit 34, respectively. Thus, based on the composite video data D15, the composite video 32 that is reduced or enlarged in stages until the sizes of the user videos 27 are equal to each other as described above becomes the video conference terminals 2 and the televisions. It is displayed on each display 35 and 46 of the conference server 3 respectively.

またこのとき音声合成部３７は、状況判断制御部１９の制御のもとに、各ユーザ７の音声の音量が均等な所定の大きさとなるまで、そのとき音量が上げられていたユーザ７の音声の音量を当該所定時間ごとに段階的に順次下げ、かつこれと同期して他のユーザ７の音声の音量を当該所定時間ごとに段階的に順次上げるような合成音声の合成音声データＤ１９を順次生成し、これを出力音声処理部３９及び音声圧縮部４０にそれぞれ送出する。かくしてこの合成音声データＤ１９に基づいて、上述のように各ユーザ７の音声の音量が、段階的に均等な大きさとなるまで、下がり又は上がってゆく合成音声が各テレビ会議端末２又はテレビ会議サーバ３の各スピーカ４１及び４８からそれぞれ出力されることとなる。 At this time, the voice synthesizer 37 controls the voice of the user 7 whose volume has been raised until the volume of the voice of each user 7 becomes equal to a predetermined volume under the control of the situation determination control unit 19. The synthesized voice data D19 of the synthesized voice is sequentially reduced in a stepwise manner for each predetermined time and the volume of the voice of the other user 7 is gradually increased in a stepwise manner in synchronization with this. It is generated and sent to the output audio processing unit 39 and the audio compression unit 40, respectively. Thus, on the basis of the synthesized voice data D19, the synthesized voice that is lowered or raised until each user 7's voice volume becomes equal in a stepwise manner as described above is sent to each video conference terminal 2 or video conference server. 3 is output from each of the three speakers 41 and 48.

このようにして、このテレビ会議システム１では、ユーザ７の発言の有無に応じてユーザ７のユーザ映像２７を拡大又は縮小するように表示し得るようになされ、これによりユーザに余分な操作をさせることなく、ユーザ７のユーザ映像２７をメインに捉えさせ得るようになされている。 In this manner, in the video conference system 1, the user video 27 of the user 7 can be displayed so as to be enlarged or reduced according to the presence or absence of the user 7's speech, thereby causing the user to perform an extra operation. The user video 27 of the user 7 can be captured as a main.

（２−３）発言支援機能
他方、このテレビ会議システム１には、ユーザ操作に応じて図１３に示すような支援マーク画像５０や図１４に示すようなテキスト画像５１を合成映像３２上に表示するようにして、そのユーザ７及び２４の発言を支援する発言支援機能が搭載されている。 (2-3) Speech Support Function On the other hand, in this video conference system 1, a support mark image 50 as shown in FIG. 13 and a text image 51 as shown in FIG. Thus, a speech support function for supporting the speech of the users 7 and 24 is installed.

実際上、このテレビ会議システム１の場合、各テレビ会議端末２には、「発言」又は「分かった」等の発言内容にそれぞれ対応付けられた複数の専用スイッチからなる専用スイッチ群５３が設けられている。 In practice, in the case of this video conference system 1, each video conference terminal 2 is provided with a dedicated switch group 53 including a plurality of dedicated switches respectively associated with the content of statements such as “speak” or “know”. ing.

そしてユーザ７がこれら専用スイッチ群５３のなかの所望する発言内容と対応付けられた専用スイッチを押圧操作すると、これに応じた押圧操作信号Ｓ１３が状況判断部１３及び発言画像生成部５４に出力される。 When the user 7 presses a dedicated switch associated with the desired message content in the dedicated switch group 53, a pressing operation signal S13 corresponding to this is output to the situation determination unit 13 and the message image generation unit 54. The

また各テレビ会議端末２には、キーボード５５が設けられており、ユーザ７が所定操作を行って「その原因は○○だよ」等の文字列を入力すると、キーボード操作があったことを表すキーボード操作信号Ｓ１４が状況判断部１３に出力され、このとき入力されたテキストのテキストデータＤ２７が発言画像生成部５４に出力される。 Each video conference terminal 2 is provided with a keyboard 55. When the user 7 performs a predetermined operation and inputs a character string such as "The cause is OO", it indicates that there is a keyboard operation. The keyboard operation signal S14 is output to the situation determination unit 13, and the text data D27 of the text input at this time is output to the utterance image generation unit 54.

状況判断部１３は、専用スイッチ群５３から押圧操作信号Ｓ１３が与えられ又はキーボード５５からキーボード操作信号Ｓ１４が与えられると、これに応じた状況判断データＤ５をパケタイズ部１０に送出する。かくしてこの状況判断データＤ５が、この後上述のようにネットワークＮＴを介してテレビ会議サーバ３に送信され、その後このテレビ会議サーバ３の状況判断制御部１９に与えられる。 When the pressing operation signal S13 is given from the dedicated switch group 53 or the keyboard operation signal S14 is given from the keyboard 55, the situation judgment unit 13 sends the situation judgment data D5 corresponding thereto to the packetizing unit 10. Thus, the situation determination data D5 is thereafter transmitted to the video conference server 3 via the network NT as described above, and thereafter provided to the situation determination control unit 19 of the video conference server 3.

また発言画像生成部５４は、専用スイッチ群５３から押圧操作信号Ｓ１３が与えられ又はキーボード５５からテキストデータＤ２７が与えられると、その押圧操作信号Ｓ１３に応じた例えば図１３に示すような支援マーク画像５０又はテキストデータＤ２７に基づく例えば図１４に示すようなテキスト画像５１を生成し、その支援マーク画像５０又はテキスト画像５１（以下、これらをまとめて発言画像と呼ぶ）の画像データ（以下、これを発言画像データと呼ぶ）Ｄ２８を映像圧縮部９に送出する。 Further, when the pressing operation signal S13 is given from the dedicated switch group 53 or the text data D27 is given from the keyboard 55, the speech image generating unit 54 corresponds to the pressing operation signal S13, for example, a support mark image as shown in FIG. For example, a text image 51 as shown in FIG. 14 is generated based on the text data D27, and image data (hereinafter referred to as a comment image) of the support mark image 50 or the text image 51 (hereinafter collectively referred to as a speech image). D28) is sent to the video compression unit 9.

かくしてこの発言画像データＤ２８は、この後映像データＤ１の場合と同様に、映像圧縮部９及びパケタイズ部１０において所定の圧縮処理及びパケット化処理が施されてネットワークＮＴを介してテレビ会議サーバ３に送信され、その後デパケタイズ部１６及び映像伸張部１７を順次介して映像合成部２０に与えられる。 Thus, the speech image data D28 is subjected to predetermined compression processing and packetization processing in the video compression section 9 and the packetizing section 10 to the video conference server 3 via the network NT, as in the case of the video data D1. Then, the data is sent to the video composition unit 20 through the depacketizing unit 16 and the video decompression unit 17 in sequence.

このときテレビ会議サーバ３には、テレビ会議端末２と同様の複数の専用スイッチからなる専用スイッチ群５６が設けられており、ユーザ７がこれら専用スイッチ群５６のなかの所望する発言内容と対応付けられた専用スイッチを押圧操作すると、これに応じた押圧操作信号Ｓ１５が状況判断制御部１９及び発言画像生成部５７に与えられる。 At this time, the video conference server 3 is provided with a dedicated switch group 56 composed of a plurality of dedicated switches similar to the video conference terminal 2, and the user 7 is associated with the desired message content in the dedicated switch group 56. When the dedicated switch is pressed, a pressing operation signal S15 corresponding thereto is given to the situation determination control unit 19 and the speech image generation unit 57.

またテレビ会議サーバ３には、キーボード５８も設けられており、ユーザ７が所定操作を行って「その原因は○○だよ」等の文字列を入力すると、キーボード操作があったことを意味するキーボード操作信号Ｓ１６が状況判断部に出力され、このとき入力されたテキストのテキストデータＤ２９が発言画像生成部５７に出力される。 The video conference server 3 is also provided with a keyboard 58, which means that when the user 7 performs a predetermined operation and inputs a character string such as “The cause is OO”, there is a keyboard operation. The keyboard operation signal S16 is output to the situation determination unit, and the text data D29 of the text input at this time is output to the speech image generation unit 57.

そして発言画像生成部５７は、専用スイッチ群５６から押圧操作信号Ｓ１５が与えられ又はキーボード５８からテキストデータＤ２９が与えられると、その押圧操作信号Ｓ１５に応じた例えば図１３のような支援マーク画像５０又はテキストデータＤ２９に基づく例えば図１４のようなテキスト画像５１を生成し、その支援マーク画像５０又はテキスト画像５１でなる発言画像の画像データ（発言画像データ）Ｄ３０を映像合成部２０に送出する。 When the pressing operation signal S15 is given from the dedicated switch group 56 or the text data D29 is given from the keyboard 58, the speech image generation unit 57, for example, a support mark image 50 as shown in FIG. 13 corresponding to the pressing operation signal S15. Alternatively, for example, a text image 51 as shown in FIG. 14 is generated based on the text data D29, and image data (speech image data) D30 of the comment image formed by the support mark image 50 or the text image 51 is sent to the video composition unit 20.

このとき状況判断制御部１９は、当該テレビ会議サーバ３の専用スイッチ群５６又はキーボード５８から供給される押圧操作信号Ｓ１５又はキーボード操作信号Ｓ１６と、各テレビ会議端末２から送信される状況判断データＤ１０とを常時監視しており、これら押圧操作信号Ｓ１５、キーボード操作信号Ｓ１６又は状況判断データＤ１０に基づいて、いずれかのテレビ会議端末２又はテレビ会議サーバ３において専用スイッチ群５３又は５６内のいずれかの専用スイッチが押圧操作され、又はキーボード５５又は５８を介して合成映像３２上に表示すべき文字列が入力されたことを認識すると、これに応じた制御信号Ｄ３１を映像合成部２０に送出する。 At this time, the situation determination control unit 19 sends the pressing operation signal S15 or the keyboard operation signal S16 supplied from the dedicated switch group 56 or the keyboard 58 of the video conference server 3 and the situation determination data D10 transmitted from each video conference terminal 2. Are constantly monitored, and any one of the dedicated switches 53 or 56 in the video conference terminal 2 or the video conference server 3 based on the pressing operation signal S15, the keyboard operation signal S16, or the situation determination data D10. When a dedicated switch is pressed or a character string to be displayed on the composite video 32 is input via the keyboard 55 or 58, a control signal D31 corresponding to this is sent to the video composition unit 20. .

かくしてこのとき映像合成部２０は、この制御信号Ｄ３１に基づいて、ドキュメント格納部２６からのドキュメントデータＤ１４に基づくドキュメント画像３１と、当該テレビ会議サーバ３の映像処理部２５から供給される映像データＤ１２に基づく映像２８と、各テレビ会議端末２から送信される映像データＤ１１に基づくユーザ７のユーザ映像２７とを合成してなる合成映像３２上に、さらに対応するテレビ会議端末２から送信され又は当該テレビ会議サーバ３の発言画像生成部５４又は５７から供給される発言画像データＤ２８又はＤ３０に基づく発言画像（支援マーク画像５０又はテキスト画像５１）を対応する位置に重畳して合成してなる合成画像３２の合成画像データＤ１５を生成し、これを表示処理部３３及び映像圧縮部３４にそれぞれ送出する。これによりこの合成映像データＤ１５に基づいて、図１３又は図１４のような発言画像（支援マーク画像５０又はテキスト画像５１）がその操作を行ったユーザ７の近傍に表示されてなる合成画像３２が各テレビ会議端末２及びテレビ会議サーバ３の各ディスプレイ３５及び４６にそれぞれ表示される。 Thus, at this time, the video composition unit 20 based on the control signal D31, the document image 31 based on the document data D14 from the document storage unit 26, and the video data D12 supplied from the video processing unit 25 of the video conference server 3. Is transmitted from the corresponding video conference terminal 2 on the synthesized video 32 formed by synthesizing the video 28 based on the above and the user video 27 of the user 7 based on the video data D11 transmitted from each video conference terminal 2 or Composite image formed by superimposing and compositing speech images (support mark images 50 or text images 51) based on the speech image data D28 or D30 supplied from the speech image generation unit 54 or 57 of the video conference server 3 at corresponding positions. 32 composite image data D15 is generated, and is displayed as a display processing unit 33 and a video compression unit 34. Each sends. As a result, based on the composite video data D15, a composite image 32 in which a speech image (support mark image 50 or text image 51) as shown in FIG. 13 or 14 is displayed in the vicinity of the user 7 who performed the operation is displayed. It is displayed on each display 35 and 46 of each video conference terminal 2 and video conference server 3, respectively.

このようにして、このテレビ会議システム１では、そのとき発言しているユーザの当該発言を妨げることなく、事前にかつさりげなく自己の発言意思や所望する発言内容を他のユーザに伝えることができるようになされている。 In this manner, in the video conference system 1, it is possible to inform the other users of his / her intention to speak and the content of the desired speech in advance without casually disturbing the speech of the user who is speaking at that time. It is made like that.

（２−４）テレビ会議中の照明機能
さらにこのテレビ会議システム１では、図１５に示すように、テレビ会議中に参加しているユーザ７の周囲の床面等を照明することにより、当該テレビ会議中であることを周囲の人に認識させる照明機能が各テレビ会議端末２及びテレビ会議サーバ３に搭載されている。 (2-4) Illumination Function During Video Conference Further, in this video conference system 1, as shown in FIG. 15, the television screen is illuminated by illuminating the floor surface around the user 7 participating during the video conference. Each video conference terminal 2 and the video conference server 3 are equipped with a lighting function that allows the surrounding people to recognize that the conference is in progress.

実際上、このテレビ会議システム１の場合、図２及び図３からも明らかなように、テレビ会議サーバ３及び各テレビ会議端末２には、それぞれユーザ７の周囲の床面等を照明するための光源５９及び６０が設けられている。 In practice, in the case of this video conference system 1, as is apparent from FIGS. 2 and 3, the video conference server 3 and each video conference terminal 2 are used to illuminate the floor surface around the user 7, respectively. Light sources 59 and 60 are provided.

そしてテレビ会議サーバ３の状況判断制御部１９は、ユーザ操作に応じて例えばＳＩＰ（Session Initiation Protocol）制御によって対応する各テレビ会議端末２と通信可能な状態に接続すると、光源５９に対して駆動電圧の供給を開始することによって図１５（Ａ）のように当該光源５９を点灯させる一方、その後ユーザ操作に応じてすべてのテレビ会議端末２との接続を切断すると、かかる光源５９への駆動電圧の供給を停止することによって図１５（Ｂ）のように当該光源５９を消灯させるようになされている。 Then, when the situation determination control unit 19 of the video conference server 3 is connected to the corresponding video conference terminal 2 in a communicable state by, for example, SIP (Session Initiation Protocol) control according to the user operation, the drive voltage is applied to the light source 59. 15A, the light source 59 is turned on as shown in FIG. 15A. On the other hand, when all the video conference terminals 2 are disconnected in response to a user operation, the drive voltage to the light source 59 is reduced. By stopping the supply, the light source 59 is turned off as shown in FIG.

また各テレビ会議端末２の状況判断部１３も同様に、ユーザ操作に応じてテレビ会議サーバ３と通信可能な状態に接続すると、光源６０に対して駆動電圧の供給を開始することによって図１５（Ａ）のように当該光源６０を点灯させる一方、その後ユーザ操作に応じてテレビ会議サーバ３との接続を切断すると、かかる光源６０への駆動電圧の供給を停止することによって図１５（Ｂ）のように当該光源６０を消灯させるようになされている。 Similarly, when the status determination unit 13 of each video conference terminal 2 is connected to the video conference server 3 in a state where it can communicate with the video conference server 3 in response to a user operation, the supply of drive voltage to the light source 60 is started. When the light source 60 is turned on as shown in A) and then the connection with the video conference server 3 is disconnected in response to a user operation, the supply of the drive voltage to the light source 60 is stopped, thereby stopping the light source 60 shown in FIG. In this way, the light source 60 is turned off.

このようにして、このテレビ会議システム１では、自己の作業をしているのか、又はテレビ会議中であるのかを明確に区別して、テレビ会議中であることを容易に周囲に認識させることによって、テレビ会議中に周囲の人から話しかけられるのを未然かつ有効に防止し、かくしてテレビ会議を中断させることなく円滑に進行し得るようになされている。 In this way, in this video conference system 1, by clearly distinguishing whether the user is working or is in video conference, it is easily recognized by the surroundings that the video conference is in progress. It is possible to prevent the surrounding people from speaking during a video conference effectively and effectively so that the video conference can proceed smoothly without interruption.

（３）状況判断制御部１９の具体的処理内容
（３−１）ドキュメント操作時等の表示制御処理手順
ここで、上述のようなテレビ会議サーバ３の状況判断制御部１９によるドキュメント操作時等の表示制御は、予め状況判断制御部１９に与えられた制御プログラムに基づき、図１６に示すドキュメント操作時等の表示制御処理手順ＲＴ１に従って行われる。 (3) Specific processing contents of the situation judgment control unit 19 (3-1) Display control processing procedure at the time of document operation Here, at the time of document operation by the situation judgment control unit 19 of the video conference server 3 as described above The display control is performed according to a display control processing procedure RT1 such as during document operation shown in FIG. 16 based on a control program given in advance to the situation determination control unit 19.

すなわち状況判断制御部１９は、対応する各テレビ会議端末２と通信可能な状態に接続すると、ドキュメント操作時等の表示制御処理手順ＲＴ１をステップＳＰ０において開始し、続くステップＳＰ１において、ドキュメント操作されたか否かを判断する。そして状況判断制御部１９は、このステップＳＰ１においてユーザ７にドキュメント操作されるのを待ちながら、ドキュメント操作された場合のみステップＳＰ２に進み、ドキュメント画像３１が表示されているか否かを判断する。 In other words, when the situation determination control unit 19 is connected to the corresponding video conference terminal 2 in a communicable state, it starts the display control processing procedure RT1 at the time of document operation in step SP0, and in the subsequent step SP1, has the document operation been performed? Judge whether or not. Then, while waiting for the user 7 to operate the document in step SP1, the situation determination control unit 19 proceeds to step SP2 only when the document is operated, and determines whether or not the document image 31 is displayed.

そして状況判断制御部１９は、このステップＳＰ２において否定結果を得るとステップＳＰ３に進んで、ドキュメント画像３１を拡大表示させ、ドキュメント画像３１の横にドキュメント操作を行ったユーザ７のユーザ映像２７を拡大表示させ、これに対して肯定結果を得るとステップＳＰ４に進んで、ドキュメント画像３１を変更して表示させ、ドキュメント画像３１の横にドキュメント操作を行ったユーザ７のユーザ映像２７を入れ換えて表示させる。 If the result of the determination in step SP2 is negative, the situation determination control unit 19 proceeds to step SP3 to enlarge and display the document image 31, and enlarges the user video 27 of the user 7 who has operated the document beside the document image 31. If an affirmative result is obtained, the process proceeds to step SP4 where the document image 31 is changed and displayed, and the user video 27 of the user 7 who has operated the document is displayed next to the document image 31. .

また状況判断制御部１９は、この後ステップＳＰ５に進んで、所定の時間内にカーソル操作されたか否かを判断する。そして状況判断制御部１９は、このステップＳＰ５において肯定結果を得るとステップＳＰ７に進み、これに対して肯定結果を得るとステップＳＰ８に進んで、ドキュメント画像３１を段階的に縮小表示させる。 Further, the situation determination control unit 19 thereafter proceeds to step SP5 and determines whether or not the cursor is operated within a predetermined time. The situation determination control unit 19 proceeds to step SP7 when an affirmative result is obtained in step SP5, and proceeds to step SP8 when an affirmative result is obtained, and displays the document image 31 in a reduced scale stepwise.

さらに状況判断制御部１９は、この後ステップＳＰ７に進んで、同一のユーザ７が所定の時間カーソル操作し続けたか否かを判断する。そして状況判断制御部１９は、このステップＳＰ７において肯定結果を得るとステップＳＰ８に進んで、ドキュメント画像３１を段階的に拡大表示させ、これに対して否定結果を得るとステップＳＰ９に進んで、ドキュメント画像３１の横にカーソル操作を行ったユーザ７のユーザ映像２７を表示させる。 Further, the situation determination control unit 19 proceeds to step SP7 to determine whether or not the same user 7 has continued to operate the cursor for a predetermined time. Then, when the situation determination control unit 19 obtains a positive result in step SP7, the process proceeds to step SP8, and the document image 31 is enlarged and displayed stepwise. On the other hand, if a negative result is obtained, the process proceeds to step SP9. A user video 27 of the user 7 who performed the cursor operation is displayed beside the image 31.

さらに状況判断制御部１９は、この後ステップＳＰ１０に進んで、ユーザ７がテレビ会議を終了すべき操作（以下、これを終了操作と呼ぶ）を行なったか否かを判断し、否定結果を得るとステップＳＰ１に戻り、この後ステップＳＰ１〜ステップＳＰ１０について同様の処理を繰り返す。そして状況判断制御部１９は、やがてユーザ７が終了操作を行うことによりステップＳＰ１０において肯定結果を得ると、ステップＳＰ１１に進んでこのドキュメント操作時等の表示制御処理手順ＲＴ１を終了する。 Further, the situation determination control unit 19 proceeds to step SP10 to determine whether or not the user 7 has performed an operation (hereinafter referred to as an end operation) to end the video conference, and obtains a negative result. Returning to step SP1, thereafter, the same processing is repeated for step SP1 to step SP10. Then, when the user 7 eventually performs an ending operation and obtains a positive result in step SP10, the situation determination control unit 19 proceeds to step SP11 and ends the display control processing procedure RT1 such as during document operation.

（３−２）音声検出時等の表示及び音量制御処理手順
一方、状況判断制御部１９は、対応する各テレビ会議端末２と通信可能な状態に接続すると、上述のドキュメント操作検出時等の表示制御処理手順ＲＴ１と並行して、図１７に示す音声検出時等の表示及び音量制御処理手順ＲＴ２をステップＳＰ１２において開始し、続くステップＳＰ１３において、所定の時間内にユーザ７から音声が検出されたか否かを判断する。そして状況判断制御部１９は、このステップＳＰ１３において肯定結果を得るとステップＳＰ１５に進み、これに対して否定結果を得るとステップＳＰ１４に進んで、ユーザ７のユーザ映像２７を均等になるように表示させる。 (3-2) Display at the time of voice detection, etc. and volume control processing procedure On the other hand, when the situation determination control unit 19 is connected to the corresponding video conference terminal 2 in a communicable state, the above-mentioned display at the time of detection of the document operation or the like is performed. In parallel with the control processing procedure RT1, the display and sound volume control processing procedure RT2 shown in FIG. 17 is started in step SP12, and in step SP13, whether voice is detected from the user 7 within a predetermined time. Judge whether or not. Then, the situation determination control unit 19 proceeds to step SP15 when a positive result is obtained in step SP13, and proceeds to step SP14 when a negative result is obtained, and displays the user video 27 of the user 7 to be equalized. Let

また状況判断制御部１９は、この後ステップＳＰ１５において、同一のユーザ７から所定の時間以上音声を検出し続けたか否かを判断する。そして状況判断制御部１９は、このステップＳＰ１５において否定結果を得るとステップＳＰ１７に進み、これに対して肯定結果を得るとステップＳＰ１６に進んで、音声を検出したユーザ７のユーザ映像２７を拡大表示させ、及び音量を大きくさせる一方、それ以外のユーザ７のユーザ映像２７を縮小表示させ、及び音量を小さくさせる。 Further, the situation determination control unit 19 thereafter determines whether or not the voice from the same user 7 has been detected for a predetermined time or longer in step SP15. The situation determination control unit 19 proceeds to step SP17 when a negative result is obtained at step SP15, and proceeds to step SP16 when an affirmative result is obtained, and enlarges and displays the user video 27 of the user 7 who detected the voice. The user image 27 of the other user 7 is reduced and displayed, and the volume is reduced.

さらに状況判断制御部１９は、この後ステップＳＰ１７において、ドキュメント画像３１が表示されているか否かを判断する。そして状況判断制御部１９は、このステップＳＰ１７において否定結果を得るとステップＳＰ１９に進み、これに対して肯定結果を得るとステップＳＰ１８に進んで、ドキュメント画像３１の横に音声を検出したユーザ７のユーザ映像２７を表示させる。 Further, the situation determination control unit 19 thereafter determines whether or not the document image 31 is displayed in step SP17. The situation determination control unit 19 proceeds to step SP19 when a negative result is obtained in step SP17, and proceeds to step SP18 when an affirmative result is obtained in response to this, and the situation determination control unit 19 proceeds to step SP18. The user video 27 is displayed.

さらに状況判断制御部１９は、この後ステップＳＰ１９に進んで、ユーザ７が終了操作を行なったか否かを判断し、否定結果を得るとステップＳＰ１３に戻り、この後ステップＳＰ１３〜ステップＳＰ１９について同様の処理を繰り返す。そして状況判断制御部１９は、やがてユーザ７が終了操作を行うことによりステップＳＰ１９において肯定結果を得ると、ステップＳＰ２０に進んでこの音声検出時等の表示及び音量制御処理手順ＲＴ２を終了する。 Further, the situation determination control unit 19 proceeds to step SP19 to determine whether or not the user 7 has performed an end operation. If a negative result is obtained, the process returns to step SP13. Thereafter, the same applies to steps SP13 to SP19. Repeat the process. Then, when the user 7 eventually performs an end operation and obtains a positive result in step SP19, the situation determination control unit 19 proceeds to step SP20 and ends the display and volume control processing procedure RT2 at the time of voice detection.

このようにして状況判断制御部１９は、映像合成部２０及び音声合成部３７を制御して、ユーザ７のユーザ映像２７及びドキュメント画像３１の表示位置や大きさ並びに音量を変化させ得るようになされている。 In this way, the situation determination control unit 19 can control the video synthesis unit 20 and the voice synthesis unit 37 to change the display position, size, and volume of the user video 27 and the document image 31 of the user 7. ing.

（４）本実施の形態による動作及び効果
以上の構成において、このテレビ会議システム１は、ユーザ７のドキュメント操作に応じてドキュメント画像３１及びユーザ７のユーザ映像２７を拡大又は縮小するように表示する。 (4) Operation and Effect According to this Embodiment In the above configuration, the video conference system 1 displays the document image 31 and the user video 27 of the user 7 so as to be enlarged or reduced according to the document operation of the user 7. .

従って、ユーザ７に余分な操作をさせることなく、ドキュメント操作時等にはドキュメント画像３１をメインに捉えさせる一方、その操作を行ったユーザ７のユーザ映像２７をメインに捉えさせることができる。 Accordingly, without causing the user 7 to perform an extra operation, the document image 31 can be captured mainly when the document is operated, and the user video 27 of the user 7 who has performed the operation can be captured mainly.

また、合成映像３２内にドキュメント画像３１を表示させたときには、ユーザ７がそのドキュメント画像３１上でカーソル操作を行っている時間を監視し、かかるカーソル操作を同一のユーザ７が継続して行っているときには、所定時間（例えば１〜２分程度）ごとにドキュメント画像３１を段階的に順次拡大するように表示するようになされているため、ユーザ７に余分な操作をさせることなく、ユーザ７がそのドキュメント画像３１上でカーソル操作を行っているときには、ドキュメント画像３１をメインに捉えさせることができる。 When the document image 31 is displayed in the composite video 32, the time during which the user 7 performs the cursor operation on the document image 31 is monitored, and the same user 7 continuously performs the cursor operation. The document image 31 is displayed so as to be enlarged step by step every predetermined time (for example, about 1 to 2 minutes), so that the user 7 does not need to perform an extra operation. When the cursor operation is performed on the document image 31, the document image 31 can be captured as a main.

さらに、このテレビ会議システム１は、ユーザ７の発言の有無に応じてユーザ７のユーザ映像２７を拡大又は縮小するように表示するようになされているため、ユーザ７に余分な操作をさせることなく、発言を行ったユーザ７のユーザ映像２７をメインに捉えさせることができる。 Furthermore, since the video conference system 1 is configured to display the user video 27 of the user 7 so as to be enlarged or reduced according to the presence or absence of the user 7 utterance, the user 7 is not allowed to perform an extra operation. The user video 27 of the user 7 who made the remark can be captured mainly.

さらに、このテレビ会議システム１は、いずれかのテレビ会議端末２又はテレビ会議サーバ３において専用スイッチ群５３又は５６内のいずれかの専用スイッチが押圧操作、又はキーボード５５又は５８を介して合成映像３２上に表示すべき文字列の入力に応じて発言画像（支援マーク画像５０又はテキスト画像５１）をその操作を行ったユーザ７の近傍に表示するようになされているため、そのとき発言しているユーザの当該発言を妨げることなく、事前にかつさりげなく自己の発言意思や所望する発言内容を他のユーザに伝えることができる。 Furthermore, in this video conference system 1, either one of the video switches in the dedicated switch group 53 or 56 in the video conference terminal 2 or the video conference server 3 is pressed, or the synthesized video 32 is input via the keyboard 55 or 58. A comment image (support mark image 50 or text image 51) is displayed in the vicinity of the user 7 who performed the operation in response to an input of a character string to be displayed above. Without disturbing the user's speech, the user's intention to speak and desired speech content can be communicated to other users in advance.

さらに、このテレビ会議システム１は、ユーザ操作に応じて通信可能な状態に接続すると、光源５９及び６０に対して駆動電圧の供給を開始することによって当該光源５９を点灯させる一方、その後ユーザ操作に応じて接続を切断すると、かかる光源５９及びへの駆動電圧の供給を停止することによって当該光源５９及び６０を消灯させるようになされているため、テレビ会議中であることを容易に周囲に認識させることによって、テレビ会議中に周囲の人から話しかけられるのを未然かつ有効に防止でき、これによりテレビ会議を中断させることなく円滑に進行させることができる。 Further, when the video conference system 1 is connected to a state where communication is possible according to a user operation, the light source 59 is turned on by starting to supply a drive voltage to the light sources 59 and 60, and thereafter the user operation is performed. When the connection is cut accordingly, the light source 59 and 60 are turned off by stopping the supply of the drive voltage to the light source 59 and the surroundings can be easily recognized by the surroundings. Therefore, it is possible to effectively prevent people from talking during the video conference, and thus the video conference can proceed smoothly without being interrupted.

以上の構成によれば、ユーザ７のドキュメント操作に応じてドキュメント画像３１及びユーザ７のユーザ映像２７を拡大又は縮小するように表示することにより、ユーザ７に余分な操作をさせることなく、ドキュメント操作時等にはドキュメント画像３１をメインに捉えさせる一方、その操作を行ったユーザ７のユーザ映像２７をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現することができる。 According to the above configuration, the document image 31 and the user video 27 of the user 7 are displayed so as to be enlarged or reduced in accordance with the document operation of the user 7, thereby allowing the user 7 to perform the document operation without extra operations. In some cases, the document image 31 can be captured mainly, while the user video 27 of the user 7 who performed the operation can be captured mainly, thus realizing a video conference system with high communication and full of realism. Can do.

また、合成映像３２内にドキュメント画像３１を表示させたときには、ユーザ７がそのドキュメント画像３１上でカーソル操作を行っている時間を監視し、かかるカーソル操作を同一のユーザ７が継続して行っているときには、所定時間（例えば１〜２分程度）ごとにドキュメント画像３１を段階的に順次拡大するように表示するようになされているため、ユーザ７に余分な操作をさせることなく、ユーザ７がそのドキュメント画像３１上でカーソル操作を行っているときには、ドキュメント画像３１をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現することができる。 When the document image 31 is displayed in the composite video 32, the time during which the user 7 performs the cursor operation on the document image 31 is monitored, and the same user 7 continuously performs the cursor operation. The document image 31 is displayed so as to be enlarged step by step every predetermined time (for example, about 1 to 2 minutes), so that the user 7 does not need to perform an extra operation. When a cursor operation is performed on the document image 31, the document image 31 can be captured as a main, thus realizing a video conference system with high communication and full of realism.

さらに、ユーザ７の発言の有無に応じてユーザ７のユーザ映像２７を拡大又は縮小するように表示するようになされているため、ユーザ７に余分な操作をさせることなく、発言を行ったユーザ７のユーザ映像２７をメインに捉えさせることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現することができる。 Furthermore, since the user video 27 of the user 7 is displayed so as to be enlarged or reduced according to the presence or absence of the user 7's speech, the user 7 who made the speech without causing the user 7 to perform an extra operation. Thus, it is possible to realize a video conference system that is highly communicative and full of realism.

さらに、いずれかのテレビ会議端末２又はテレビ会議サーバ３において専用スイッチ群５３又は５６内のいずれかの専用スイッチが押圧操作、又はキーボード５５又は５８を介して合成映像３２上に表示すべき文字列の入力に応じて発言画像（支援マーク画像５０又はテキスト画像５１）をその操作を行ったユーザ７の近傍に表示するようになされているため、そのとき発言しているユーザの当該発言を妨げることなく、事前にかつさりげなく自己の発言意思や所望する発言内容を他のユーザに伝えることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現することができる。 Furthermore, in any of the video conference terminals 2 or the video conference server 3, any of the dedicated switches in the dedicated switch group 53 or 56 is pressed, or a character string to be displayed on the composite video 32 via the keyboard 55 or 58. The comment image (support mark image 50 or text image 51) is displayed in the vicinity of the user 7 who performed the operation in response to the input of the user, so that the user who is speaking at that time is prevented from the comment. In addition, it is possible to inform the user of his / her intention to speak and the desired content of the speech in a casual manner in advance, thus realizing a video conferencing system with high communicability and full of realism.

さらに、ユーザ操作に応じて通信可能な状態に接続すると、光源５９及び６０に対して駆動電圧の供給を開始することによって当該光源５９を点灯させる一方、その後ユーザ操作に応じて接続を切断すると、かかる光源５９及びへの駆動電圧の供給を停止することによって当該光源５９及び６０を消灯させるようになされているため、テレビ会議中であることを容易に周囲に認識させることによって、テレビ会議中に周囲の人から話しかけられるのを未然かつ有効に防止でき、これによりテレビ会議を中断させることなく円滑に進行させることができ、かくしてコミュニケーション性が高く、臨場感あふれるテレビ会議システムを実現することができる。 Furthermore, when connected in a communicable state according to a user operation, the light source 59 is turned on by starting the supply of driving voltage to the light sources 59 and 60, and then the connection is disconnected according to a user operation. Since the light sources 59 and 60 are turned off by stopping the supply of the drive voltage to the light source 59 and the video conference, it is possible to easily recognize that the video conference is in progress. Talking with people around you can be prevented in advance and effectively, allowing the video conference to proceed smoothly without interruption, thus realizing a highly conducive and realistic video conference system. .

（５）他の実施の形態
なお上述の実施の形態においては、テレビ会議サーバ３と、２台のテレビ会議端末２（２Ａ、２Ｂ）とからなるテレビ会議システム１について述べたが、本発明はこれに限らず、１又は２以上の複数の端末装置と、情報処理装置とからなるこの他種々のリアルタイムに映像及び音声を送受信するシステムに適用することができる。 (5) Other Embodiments In the above-described embodiment, the video conference system 1 including the video conference server 3 and the two video conference terminals 2 (2A, 2B) has been described. The present invention is not limited to this, and the present invention can be applied to various other real-time video and audio transmission / reception systems including one or more terminal devices and an information processing device.

また上述の実施の形態においては、ドキュメント画像３１が表示されているときに、ユーザ７の発言の有無に応じてユーザ７のユーザ映像２７を拡大又は縮小するように表示する場合について述べたが、本発明はこれに限らず、ドキュメント画像３１が表示されていないときにも、上述と同様にユーザ７の発言の有無に応じてユーザ７のユーザ映像２７を拡大又は縮小するように表示することができる。 In the above-described embodiment, when the document image 31 is displayed, the case where the user video 27 of the user 7 is displayed to be enlarged or reduced according to the presence or absence of the user 7 has been described. The present invention is not limited to this, and even when the document image 31 is not displayed, the user video 27 of the user 7 can be displayed so as to be enlarged or reduced according to the presence or absence of the user 7 utterance as described above. it can.

さらに上述の実施の形態においては、上述のようにテレビ会議端末２及びテレビ会議サーバ３を構成し、処理を行った場合について述べたが、本発明はこれに限らず、例えば、テレビ会議サーバは送信される各種データに対して合成処理及び状況判断制御処理のみを行い、そのデータを送信するようにしても良く、また送信される各種データに対して各テレビ会議端末それぞれにおいて合成処理及び状況判断制御処理を行い、そのデータを送信するようにしても良く、この他種々の構成を広く適用することができる。 Furthermore, in the above-described embodiment, the case where the video conference terminal 2 and the video conference server 3 are configured and processed as described above has been described. However, the present invention is not limited thereto, and for example, the video conference server Only the synthesis process and the situation determination control process may be performed on the various types of data to be transmitted, and the data may be transmitted. The control process may be performed to transmit the data, and various other configurations can be widely applied.

さらに上述の実施の形態においては、発言画像生成部は、テレビ会議端末２及びテレビ会議サーバ３それぞれに設けられている場合について述べたが、本発明はこれに限らず、例えばテレビ会議サーバのみに設けられている場合であっても良く、この他種々の構成を広く適用することができる。 Further, in the above-described embodiment, the case where the speech image generation unit is provided in each of the video conference terminal 2 and the video conference server 3 has been described. However, the present invention is not limited thereto, and for example, only the video conference server is provided. It may be a case where it is provided, and various other configurations can be widely applied.

また上述の実施の形態においては、各カーソル画像２９がドキュメント上で１〜２秒以上移動するようなカーソル操作又はユーザの発言が１〜２秒以上検出が行われたときに、ユーザ７のユーザ映像２７を拡大又は縮小するように表示する場合について述べたが、本発明はこれに限らず、この他種々の影響を考慮して当該時間よりも長く又は短くするようにしても良い。 Further, in the above-described embodiment, when the cursor operation or the user's utterance in which each cursor image 29 moves on the document for 1 to 2 seconds or more is detected for 1 to 2 seconds or more, the user 7 user Although the case where the video 27 is displayed so as to be enlarged or reduced has been described, the present invention is not limited thereto, and may be made longer or shorter than the time in consideration of various other effects.

さらに上述の実施の形態においては、同時にユーザ７のカーソル操作及び又はユーザ７の発言が行われたときには、その操作時間及び又は検出時間に基づいてドキュメント画像３１及びユーザ７のユーザ映像２７を対応する大きさに拡大又は縮小するように表示した場合について述べたが、本発明はこれに限らず、例えば音量の大きさ、カーソルの動きの大きさ及び又は単位時間内のドキュメント操作や音声検出の数量に基づいてドキュメント画像３１及びユーザ映像２７を対応する大きさに拡大又は縮小するように表示するようにしても良く、この他種々の判断基準を用いることができる。 Further, in the above-described embodiment, when the user's 7 cursor operation and / or the user's 7 speech are performed at the same time, the document image 31 and the user's 7 user video 27 are associated based on the operation time and / or detection time. Although the case where the image is displayed so as to be enlarged or reduced to the size has been described, the present invention is not limited to this. For example, the size of the volume, the size of the movement of the cursor, and the number of document operations and voice detection within a unit time. The document image 31 and the user video 27 may be displayed so as to be enlarged or reduced to a corresponding size, and various other determination criteria can be used.

さらに上述の実施の形態においては、いずれかのテレビ会議端末２又はテレビ会議サーバ３において専用スイッチ群５３又は５６内のいずれかの専用スイッチが押圧操作、又はキーボード５５又は５８を介して合成映像３２上に表示すべき文字列の入力に応じて発言画像（支援マーク画像５０又はテキスト画像５１）をその操作を行ったユーザ７（７Ａ又は７Ｂ）又は２４の近傍に表示した場合について述べたが、本発明はこれに限らず、例えば発言画像（支援マーク画像５０又はテキスト画像５１）をその操作を行ったユーザ７の近傍に表示すると共に、その操作を行ったユーザ７の映像２７を中央に大きく表示して優先的に発言をさせるようにしても良い。 Furthermore, in the above-described embodiment, any one of the dedicated switches in the dedicated switch group 53 or 56 is pressed by any one of the video conference terminals 2 or the video conference server 3, or the composite video 32 via the keyboard 55 or 58. Although the speech image (the support mark image 50 or the text image 51) is displayed near the user 7 (7A or 7B) or 24 who performed the operation according to the input of the character string to be displayed above, The present invention is not limited to this. For example, a speech image (support mark image 50 or text image 51) is displayed in the vicinity of the user 7 who performed the operation, and the video 27 of the user 7 who performed the operation is enlarged in the center. You may make it display and have it speak preferentially.

さらに上述の実施の形態においては、発言しているユーザ７の音声の音量を当該所定時間ごとに段階的に順次上げ、かつこれと同期して他のユーザ７の音声の音量を当該所定時間ごとに段階的に順次下げた場合について述べたが、本発明はこれに限らず、他のユーザ７の音声の音量を当該所定時間ごとに段階的に順次下げるだけにしても良く、要は発言しているユーザについての音量を、他の上記ユーザについての音量に対して相対的に大きくさせるようにすれば良い。 Further, in the above-described embodiment, the volume of the voice of the user 7 who is speaking is increased stepwise in increments of the predetermined time, and the volume of the voice of the other user 7 is increased in synchronism with the predetermined time. However, the present invention is not limited to this, and the volume of the voice of the other user 7 may be lowered step by step at the predetermined time. The sound volume for the user who is present may be increased relatively to the sound volume for the other users.

本発明は、テレビ会議システム以外の種々の機器においてリアルタイムに映像及び音声を送受信して表示させる場合等にも利用可能である。 The present invention can also be used in the case of transmitting and displaying video and audio in real time on various devices other than the video conference system.

本実施の形態によるテレビ会議システムの構成を示す略線図である。It is a basic diagram which shows the structure of the video conference system by this Embodiment. 本実施の形態によるテレビ会議端末の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the video conference terminal by this Embodiment. 本実施の形態によるテレビ会議サーバの内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the video conference server by this Embodiment. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. 付加情報管理テーブルの説明に供する概念図である。It is a conceptual diagram with which it uses for description of an additional information management table. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. ディスプレイにおける表示の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the display in a display. テレビ会議時と自己の作業との区別の説明に供する概念図である。It is a conceptual diagram with which it uses for description of the distinction between the time of a video conference, and a self-work. ドキュメント操作時等の表示制御処理手順を示すフローチャートである。It is a flowchart which shows the display control processing procedure at the time of document operation. 音声検出時等の表示及び音量制御処理手順を示すフローチャートである。It is a flowchart which shows the display at the time of audio | voice detection, etc. and a volume control processing procedure.

Explanation of symbols

１……テレビ会議システム、２……テレビ会議端末、３……テレビ会議サーバ、４、２１……カメラ部、５、２２……マイクロホン、６、２３……ポインティングデバイス、７……ユーザ、１３……状況判断部、１９……状況判断制御部、２０……映像合成部、２７……ユーザ映像、２９……カーソル画像、３１……ドキュメント画像、３２……合成映像、３５、４６……ディスプレイ、３７……音声合成部、４９……付加情報管理テーブル、ＮＴ……ネットワーク、ＲＴ１……ドキュメント操作時等の表示制御処理手順、ＲＴ２……音声検出時等の表示及び音量制御処理手順。
DESCRIPTION OF SYMBOLS 1 ... Video conference system, 2 ... Video conference terminal, 3 ... Video conference server, 4, 21 ... Camera part, 5, 22 ... Microphone, 6, 23 ... Pointing device, 7 ... User, 13 ...... Situation judgment unit, 19 ... Situation judgment control unit, 20 ... Video composition unit, 27 ... User video, 29 ... Cursor image, 31 ... Document image, 32 ... Composite video, 35, 46 ... Display, 37... Speech synthesis unit, 49... Additional information management table, NT... Network, RT1... Display control processing procedure at the time of document operation, RT2.

Claims

In a video conference system including one or more terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network,
The terminal device is
Photographing means for photographing the user;
Sound collecting means for collecting the user's voice;
A document operation means for performing a predetermined document operation on the document;
Transmitting means for transmitting video data output from the photographing means and audio data output from the sound collecting means to the information processing apparatus via the network;
Display means for displaying video based on composite video data transmitted from the information processing apparatus via the network,
The information processing apparatus
Video synthesizing means for generating a synthesized video obtained by synthesizing the video of the user based on the video data transmitted from the terminal device via the network and a necessary image of the document;
Based on the document operation using the document operation means and / or presence / absence of the user's speech detected based on the audio data, the image of the document and / or the corresponding user image in the synthesized video is enlarged or reduced. Control means for controlling the composite video means so as to
A video conferencing system comprising: transmission means for transmitting the composite video data, which is video data of the composite video generated by the video synthesis means, to the terminal device via the network.

The information processing apparatus
Comprising speech synthesis means for generating synthesized speech formed by synthesizing speech based on the speech data transmitted from each of the terminal devices via the network;
The control means includes
According to the user's speech time, the voice synthesis means is controlled so that the volume for the corresponding user in the synthesized voice is relatively larger than the volume for the other users,
The transmission means is
Transmitting synthesized voice data consisting of voice data of the synthesized voice generated by the voice synthesizing means to the terminal device via the network;
The terminal device is
The video conference system according to claim 1, further comprising audio output means for outputting the synthesized voice based on the synthesized voice data transmitted from the information processing apparatus.

The terminal device is
A light source that emits light;
A blinking control means for controlling blinking of the light source,
The blinking control means is
The video conference system according to claim 1, wherein when connected to the information processing apparatus, the light source is turned on, and when the connection with the information processing apparatus is disconnected, the light source is turned off.

In one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network,
Video synthesizing means for generating a synthesized video obtained by synthesizing the video of the user and a necessary document image based on the video data of the photographed user transmitted from the terminal device via the network;
The synthesis based on the presence or absence of the user's speech detected based on the predetermined document operation on the document transmitted from the terminal device via the network and / or the voice data of the collected user's voice Control means for controlling the composite video composition means so as to enlarge or reduce the image of the document in the video and / or the corresponding video of the user;
An information processing apparatus comprising: transmission means for transmitting synthesized video data composed of video data of the synthesized video generated by the video synthesizing means to the terminal device via the network.

The above document operation
An operation for displaying or changing the document or changing the page of the document.
The control means includes
The video synthesizing unit is controlled to enlarge the image of the document in the synthesized video and / or the video of the user who performed the document manipulation when the document operation is performed. 5. The information processing apparatus according to 4.

The control means includes
6. The image synthesizing unit is controlled so that when the document operation is not performed for a predetermined time, the image of the document and the corresponding video of the user are gradually reduced every predetermined time. The information processing apparatus described in 1.

The above document operation
An operation of moving the cursor displayed on the composite video on the image of the document,
The control means includes
The information processing apparatus according to claim 4, wherein the video composition unit is controlled so as to enlarge the image of the document in the composite video in a stepwise manner according to an operation time of the document operation.

The control means includes
8. The video synthesizing unit is controlled so that when the document operation is not performed for a predetermined time, the image of the document and the corresponding video of the user are gradually reduced every predetermined time. The information processing apparatus described in 1.

The control means includes
When the document operation using the document operation means and / or the user's speech detected based on the audio data is not performed continuously for a predetermined time or more, the document operation or the user's speech is not performed The information processing apparatus according to claim 4, wherein the image synthesizing unit is controlled.

The control means includes
5. The information processing apparatus according to claim 4, wherein the video synthesizing unit is controlled so as to gradually expand the corresponding video of the user in the synthesized video in accordance with the speech time of the user.

Comprising speech synthesis means for generating synthesized speech formed by synthesizing speech based on the speech data transmitted from each of the terminal devices via the network;
The control means includes
According to the user's speech time, the voice synthesis means is controlled so that the volume for the corresponding user in the synthesized voice is relatively larger than the volume for the other users,
The transmission means is
5. The information processing apparatus according to claim 4, wherein synthesized voice data including voice data of the synthesized voice generated by the voice synthesizer is transmitted to the terminal device via the network.

The control means includes
The information processing apparatus according to claim 4, wherein the video synthesizing unit is controlled to display a predetermined image supporting the user's speech on the synthesized video in response to a predetermined user operation.

The control means includes
5. The information processing apparatus according to claim 4, wherein the video synthesizing unit is controlled to display a text image representing a character string input by the user on the synthesized video in response to the user operation. .

In an information processing method in one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network,
A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured video data of the user transmitted from the terminal device via the network;
The synthesis based on the presence or absence of the user's speech detected based on the predetermined document operation on the document transmitted from the terminal device via the network and / or the voice data of the collected user's voice A second step of enlarging or reducing the image of the document in the video and / or the corresponding video of the user;
And a third step of transmitting the composite video data composed of the video data of the composite video to the terminal device via the network.

For one or a plurality of terminal devices installed at different points and an information processing device connected to the terminal device via a predetermined network,
A first step of generating a synthesized video obtained by synthesizing a user's video and a necessary document image based on the captured video data of the user transmitted from the terminal device via the network;
The synthesis based on the presence or absence of the user's speech detected based on a predetermined document operation on the document transmitted from the terminal device via the network and / or voice data of the collected user's voice A second step of enlarging or reducing the image of the document in the video and / or the corresponding video of the user;
A program for executing a third step of transmitting composite video data composed of video data of the composite video to the terminal device via the network.