JPH10336617A

JPH10336617A - Video conference system and video conference method

Info

Publication number: JPH10336617A
Application number: JP9154441A
Authority: JP
Inventors: Takaaki Tsurukame; 崇昭鶴亀; Ritsu Katayama; 立片山
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1997-05-27
Filing date: 1997-05-27
Publication date: 1998-12-18

Abstract

PROBLEM TO BE SOLVED: To reduce the delay of images to voice and to prevent the exertion of adverse effect on other communication systems. SOLUTION: The voice data and image data of a conference participant and standard image data when a speaker is not specified, etc., are registered in a data memory 22 and photographing conditions are set. When voice input from a present area is present, the voice data are transmitted to the other area through a network 14. Voice recognition is performed in a speech recognition part 36, and when the speaker is specified as the speaker already registered in the data memory 22 as a result, the pertinent image data are loaded from the data memory 22. When the speaker is not specified, corresponding to an image mode, the standard image data are loaded or the image data from an image preparation part 24 are inputted and the image data are transmitted to the other area. On the other hand, when the voice input from the other area is present, the image data from the other area are acquired. Then, in respective units 12 for video conference, corresponding to a recording mode, the voice data and the image data are processed by voice/image output device 40 and the data memory 42 for proceedings.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明はＴＶ会議システム
およびＴＶ会議方法に関し、特にたとえばネットワーク
に接続された複数のＴＶ会議用ユニットを利用して複数
地区間でＴＶ会議を行う、ＴＶ会議システムおよびＴＶ
会議方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a TV conference system and a TV conference method, and more particularly to, for example, a TV conference system and a TV conference in which a plurality of districts use a plurality of TV conference units connected to a network to hold a TV conference.
Concerning the meeting method.

【従来の技術】従来のＴＶ会議は、複数の離れた地区を
ネットワークを介して接続し、会議中に発言があった場
合、ネットワークを介して発言者の音声データと発言者
を写した動画データとを他の地区に送信していた。2. Description of the Related Art In a conventional TV conference, a plurality of distant districts are connected via a network, and when a speech is made during the conference, voice data of the speaker and moving image data of the speaker are transmitted via the network. And sent to other districts.

【０００２】[0002]

【発明が解決しようとする課題】この場合、動画データ
の容量は音声データの容量に比べて極めて大きいため、
動画データの処理は音声データの処理に比べて遅延を生
じ、その結果、音声に対して画像の遅延が生じてしま
う。また、ＴＶ会議システムを制御するための負荷に加
えてネットワーク上の通信量も増大するため、特に専用
回線ではなく電話回線などの汎用回線からなるネットワ
ークを用いてＴＶ会議システムを構成した場合には、そ
のネットワークを用いて構成された他の通信システムの
性能を落としてしまうという問題点があった。それゆえ
に、この発明の主たる目的は、音声に対する画像の遅延
が少なく、かつ他の通信システムへ悪影響を及ぼさな
い、ＴＶ会議システムおよびＴＶ会議方法を提供するこ
とである。In this case, since the capacity of the moving image data is much larger than the capacity of the audio data,
Processing of moving image data has a delay compared to processing of audio data, and as a result, image delay occurs for audio. In addition, since the amount of communication on the network increases in addition to the load for controlling the TV conference system, in particular, when the TV conference system is configured using a network including a general-purpose line such as a telephone line instead of a dedicated line, However, there is a problem that the performance of another communication system configured using the network is deteriorated. SUMMARY OF THE INVENTION Therefore, a main object of the present invention is to provide a TV conference system and a TV conference method which have a small image delay with respect to audio and do not adversely affect other communication systems.

【０００３】[0003]

【課題を解決するための手段】上記目的を達成するため
に、請求項１に記載のＴＶ会議システムは、ネットワー
クに接続された複数のＴＶ会議用ユニットを用いて複数
地区間でＴＶ会議を行うＴＶ会議システムであって、各
ＴＶ会議用ユニットは、会議の参加者毎の第１音声デー
タと画像データとを記憶する記憶手段、自地区で音声が
発せられたとき、音声に対応する第２音声データと記憶
手段に記憶された第１音声データとに基づいて音声を発
した発言者を特定する発言者特定手段、特定された発言
者の画像データと第２音声データとをすべての他地区の
ＴＶ会議用ユニットに送信するデータ送信手段、他地区
で音声が発せられたとき、他地区のＴＶ会議用ユニット
からの画像データと第２音声データとを受信するデータ
受信手段、および発言者の画像データと第２音声データ
とにそれぞれ対応する画像と音声とを出力する音声・画
像出力手段を備える。According to a first aspect of the present invention, there is provided a TV conference system for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network. In a TV conference system, each TV conference unit stores a first voice data and an image data for each participant of the conference, and a second voice corresponding to the voice when the voice is emitted in the local area. Speaker specifying means for specifying a speaker who has made a voice based on the voice data and the first voice data stored in the storage means, and the image data of the specified speaker and the second voice data are stored in all other areas. Data transmitting means for transmitting to the TV conference unit of the other area, data receiving means for receiving image data and second audio data from the TV conference unit of the other area when a voice is emitted in another area, and And an image data and audio and video output means for outputting an image and a sound corresponding respectively to the second audio data of the word's.

【０００４】このＴＶ会議システムでは、会議で音声が
発せられたとき、音声が発せられた地区のＴＶ会議用ユ
ニットに含まれる発言者特定手段によって、その音声に
対応する第２音声データと記憶手段に記憶された第１音
声データとに基づいて発言者が特定される。特定された
発言者の画像データと第２音声データとが、データ送信
手段によってすべての他地区のＴＶ会議用ユニットに送
信され、他地区のＴＶ会議用ユニットは、データ受信手
段によってそれらのデータを受信する。そして、各ＴＶ
会議用ユニットでは、音声・画像出力手段によって対応
する画像および音声が出力される。[0004] In this TV conference system, when a voice is uttered in a conference, the second voice data and storage means corresponding to the voice are transmitted by the speaker identifying means included in the TV conference unit in the district where the voice is radiated. The speaker is specified based on the first voice data stored in. The image data and the second voice data of the specified speaker are transmitted to the TV conference units in all other districts by the data transmitting means, and the TV conference units in the other districts transmit the data by the data receiving means. Receive. And each TV
In the conference unit, the corresponding image and sound are output by the sound / image output means.

【０００５】請求項２に記載のＴＶ会議システムは、ネ
ットワークに接続された複数のＴＶ会議用ユニットを用
いて複数地区間でＴＶ会議を行うＴＶ会議システムであ
って、各ＴＶ会議用ユニットは、会議の参加者毎の第１
音声データとＩＤデータと画像データとを記憶する記憶
手段、自地区で音声が発せられたとき、音声に対応する
第２音声データと記憶手段に記憶された第１音声データ
とに基づいて音声を発した発言者を特定する発言者特定
手段、特定された発言者のＩＤデータと第２音声データ
とをすべての他地区のＴＶ会議用ユニットに送信するデ
ータ送信手段、他地区で音声が発せられたとき、他地区
のＴＶ会議用ユニットからのＩＤデータと第２音声デー
タとを受信するデータ受信手段、記憶手段に記憶された
画像データの中からＩＤデータに対応する画像データを
選択する手段、および発言者の画像データと第２音声デ
ータとにそれぞれ対応する画像と音声とを出力する音声
・画像出力手段を備える。According to a second aspect of the present invention, there is provided a TV conference system for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network. First for each participant in the conference
Storage means for storing voice data, ID data, and image data; when a voice is emitted in a local area, a voice is generated based on second voice data corresponding to the voice and first voice data stored in the storage means. Speaker identification means for identifying the speaking speaker, data transmitting means for transmitting the ID data of the identified speaker and the second audio data to the TV conference units in all other districts, and voice being emitted in other districts Data receiving means for receiving the ID data and the second audio data from the TV conference unit in another area, means for selecting image data corresponding to the ID data from image data stored in the storage means, And voice / image output means for outputting an image and a voice corresponding to the image data of the speaker and the second voice data, respectively.

【０００６】このＴＶ会議システムでは、会議で音声が
発せられたとき、音声が発せられた地区のＴＶ会議用ユ
ニットに含まれる発言者特定手段によって、その音声に
対応する第２音声データと記憶手段に記憶された第１音
声データとに基づいて発言者が特定される。特定された
発言者のＩＤデータと第２音声データとが、データ送信
手段によってすべての他地区のＴＶ会議用ユニットに送
信され、他地区のＴＶ会議用ユニットは、データ受信手
段によってそれらのデータを受信する。そして、各ＴＶ
会議用ユニットでは、ＩＤデータに対応する画像データ
が、記憶手段に記憶された画像データの中から選択さ
れ、音声・画像出力手段によって画像データおよび第２
音声データにそれぞれ対応する画像および音声が出力さ
れる。In this TV conference system, when a voice is uttered in the conference, the second voice data corresponding to the voice and the storage means are stored in the TV conference unit of the TV conference unit in the district where the voice was uttered. The speaker is specified based on the first voice data stored in. The ID data and the second voice data of the specified speaker are transmitted to the TV conference units in all other districts by the data transmitting means, and the TV conference units in the other districts transmit the data by the data receiving means. Receive. And each TV
In the conference unit, the image data corresponding to the ID data is selected from the image data stored in the storage unit, and the image data and the second image data are output by the audio / image output unit.
Images and sounds corresponding to the sound data are output.

【０００７】請求項３に記載のＴＶ会議システムは、ネ
ットワークに接続された複数のＴＶ会議用ユニットを用
いて複数地区間でＴＶ会議を行うＴＶ会議システムであ
って、各ＴＶ会議用ユニットは、会議の参加者毎の第１
音声データと画像データとを記憶する記憶手段、自地区
で音声が発せられたとき、音声に対応する第２音声デー
タをすべての他地区のＴＶ会議用ユニットに送信するデ
ータ送信手段、他地区で音声が発せられたとき、他地区
のＴＶ会議用ユニットからの第２音声データを受信する
データ受信手段、第２音声データと記憶手段に記憶され
た第１音声データとに基づいて音声を発した発言者を特
定する発言者特定手段、および特定された発言者の画像
データと第２音声データとにそれぞれ対応する画像と音
声とを出力する音声・画像出力手段を備える。According to a third aspect of the present invention, there is provided a TV conference system for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network. First for each participant in the conference
Storage means for storing voice data and image data; data transmission means for transmitting second voice data corresponding to voice to a TV conference unit in all other districts when voice is emitted in the own district; When the voice is issued, the data receiving means for receiving the second voice data from the TV conference unit in another area, and the voice is generated based on the second voice data and the first voice data stored in the storage means. A speaker specifying unit for specifying a speaker, and a voice / image output unit for outputting an image and a voice corresponding to the image data and the second voice data of the specified speaker, respectively, are provided.

【０００８】このＴＶ会議システムでは、会議で音声が
発せられたとき、音声が発せられた地区のＴＶ会議用ユ
ニットに含まれるデータ送信手段によって、その音声に
対応する第２音声データがすべての他地区のＴＶ会議用
ユニットに送信され、他地区のＴＶ会議用ユニットは、
データ受信手段によって第２音声データを受信する。各
地区のＴＶ会議用ユニットでは、発言者特定手段によっ
て、第２音声データと記憶手段に記憶された第１音声デ
ータとに基づいて発言者が特定される。そして、各ＴＶ
会議用ユニットでは、特定された発言者の画像データお
よび第２音声データにそれぞれ対応する画像および音声
が音声・画像出力手段によって出力される。In this TV conference system, when a voice is uttered in a conference, the data transmission means included in the TV conference unit in the district where the voice is radiated causes the second voice data corresponding to the voice to be transmitted to all other voice data. It is sent to the TV conference unit in the district, and the TV conference unit in other districts is
The second audio data is received by the data receiving means. In the TV conference unit in each district, the speaker is specified by the speaker specifying means based on the second voice data and the first voice data stored in the storage means. And each TV
In the conference unit, an image and a sound corresponding to the image data and the second sound data of the specified speaker are output by the sound / image output means.

【０００９】請求項４に記載のＴＶ会議システムは、ネ
ットワークに接続された制御ユニットおよび複数のＴＶ
会議用ユニットを用いて複数地区間でＴＶ会議を行うＴ
Ｖ会議システムであって、制御ユニットは、会議の参加
者毎の第１音声データとＩＤデータとを記憶する記憶手
段、いずれかの地区で音声が発せられたとき、その地区
のＴＶ会議用ユニットからの音声に対応する第２音声デ
ータを受信する音声データ受信手段、受信した第２音声
データと記憶手段に記憶された第１音声データとに基づ
いて音声を発した発言者を特定する発言者特定手段、お
よび特定された発言者のＩＤデータと第２音声データと
をすべての地区のＴＶ会議用ユニットに送信するデータ
送信手段を備え、各ＴＶ会議用ユニットは、会議の参加
者毎のＩＤデータと画像データとを記憶する記憶手段、
自地区で音声が発せられたとき、音声に対応する第２音
声データを制御ユニットに送信する音声データ送信手
段、制御ユニットからの第２音声データとＩＤデータと
を受信するデータ受信手段、受信したＩＤデータに対応
する画像データを記憶手段に記憶された画像データの中
から選択する手段、および発言者の画像データと第２音
声データとにそれぞれ対応する画像と音声とを出力する
音声・画像出力手段を備える。According to a fourth aspect of the present invention, there is provided a TV conference system comprising: a control unit connected to a network;
T to hold a TV conference between multiple districts using a conference unit
A V conference system, wherein the control unit is a storage means for storing first voice data and ID data for each participant of the conference, and when a voice is emitted in any district, a TV conference unit in that district Voice data receiving means for receiving second voice data corresponding to voice from the user, and a speaker for specifying a speaker who has made a voice based on the received second voice data and the first voice data stored in the storage means. Identification means, and data transmission means for transmitting the ID data of the identified speaker and the second voice data to the TV conference units in all the districts, wherein each TV conference unit has an ID for each conference participant. Storage means for storing data and image data,
When a voice is emitted in the own area, voice data transmitting means for transmitting second voice data corresponding to the voice to the control unit, data receiving means for receiving the second voice data and ID data from the control unit, Means for selecting image data corresponding to the ID data from the image data stored in the storage means, and voice / image output for outputting an image and a voice corresponding to the image data of the speaker and the second voice data, respectively Means.

【００１０】このＴＶ会議システムでは、会議で音声が
発せられたとき、音声が発せられた地区のＴＶ会議用ユ
ニットに含まれる音声データ送信手段によって、その音
声に対応する第２音声データが制御ユニットに送信さ
れ、制御ユニットは音声データ受信ユニットによって第
２音声データを受信する。制御ユニットでは、発言者特
定手段によって、受信した第２音声データと記憶手段に
記憶された第１音声データとに基づいて音声を発した発
言者が特定される。そして、制御ユニットは、特定され
た発言者のＩＤデータと第２音声データとをデータ送信
手段によってすべての地区のＴＶ会議用ユニットに送信
する。各ＴＶ会議用ユニットは、データ受信手段によっ
て、制御ユニットからのＩＤデータと第２音声データと
を受信する。そして、ＩＤデータに対応する画像データ
が、記憶手段に記憶された画像データの中から選択さ
れ、選択された画像データおよび第２音声データにそれ
ぞれ対応する画像および音声が音声・画像出力手段によ
って出力される。In this TV conference system, when a voice is emitted in the conference, the second voice data corresponding to the voice is transmitted to the control unit by the voice data transmitting means included in the TV conference unit in the district where the voice is generated. And the control unit receives the second audio data by the audio data receiving unit. In the control unit, the speaker specifying the speaker based on the received second voice data and the first voice data stored in the storage unit is specified by the speaker specifying unit. Then, the control unit transmits the ID data and the second voice data of the specified speaker to the TV conference units in all the districts by the data transmission means. Each TV conference unit receives the ID data and the second voice data from the control unit by the data receiving means. Then, image data corresponding to the ID data is selected from the image data stored in the storage means, and an image and a sound corresponding to the selected image data and the second sound data are output by the sound / image output means. Is done.

【００１１】請求項５に記載のＴＶ会議システムは、各
ＴＶ会議用ユニットの記憶手段は発言者不特定時の画像
データとなる標準画像データをさらに記憶し、各ＴＶ会
議用ユニットは、会議に関する画像データを作成する画
像作成手段、発言者不特定時に採用する画像データを設
定する画像モード設定手段、画像モードが「発言者不特
定時には標準画像データを採用するモード」であれば、
発言者不特定時には、標準画像データと第２音声データ
とにそれぞれ対応する画像と音声とを出力する手段、お
よび画像モードが「発言者不特定時には画像作成手段で
作成された画像データを採用するモード」であれば、発
言者不特定時には、画像作成手段で作成された画像デー
タと第２音声データとにそれぞれ対応する画像と音声と
を出力する手段をさらに備える、請求項１ないし４のい
ずれかに記載のＴＶ会議システムである。このＴＶ会議
システムでは、発言者を特定できないときの画像データ
として、画像モードに応じて、標準画像データまたは画
像作成手段で作成された画像データが用いられる。According to a fifth aspect of the present invention, in the TV conference system, the storage means of each TV conference unit further stores standard image data serving as image data when a speaker is unspecified, and each TV conference unit relates to a conference. Image creation means for creating image data, image mode setting means for setting image data to be adopted when the speaker is unspecified, if the image mode is `` mode to adopt standard image data when the speaker is unspecified '',
When the speaker is unspecified, means for outputting an image and a sound corresponding to the standard image data and the second audio data, respectively, and the image mode is "when the speaker is unspecified, adopts the image data created by the image creating means. 5. The method according to claim 1, further comprising: means for outputting an image and a sound corresponding to the image data and the second sound data created by the image creating means when the speaker is unspecified. Or a TV conference system described in In this TV conference system, standard image data or image data created by image creation means is used as image data when a speaker cannot be identified, depending on the image mode.

【００１２】請求項６に記載のＴＶ会議システムは、各
ＴＶ会議用ユニットは、発言者の画像データと第２音声
データとを記録する記録手段、ならびに発言者の画像デ
ータおよび第２音声データを記録手段によって記録する
か否かを設定する記録モード設定手段をさらに備える、
請求項１ないし５のいずれかに記載のＴＶ会議システム
である。このＴＶ会議システムでは、画像データと第２
音声データとにそれぞれ対応する画像および音声を出力
するだけではなく、処理モードに応じて、データを記録
するか否かが設定される。According to a sixth aspect of the present invention, in each of the TV conference systems, each TV conference unit includes a recording unit for recording the image data and the second audio data of the speaker, and the image data and the second audio data of the speaker. Further comprising a recording mode setting means for setting whether to record by the recording means,
A TV conference system according to any one of claims 1 to 5. In this TV conference system, the image data and the second
In addition to outputting an image and a sound corresponding to the sound data, whether to record the data is set according to the processing mode.

【００１３】請求項７に記載のＴＶ会議システムは、記
録モード設定手段は、さらに、第２音声データだけを記
録するのか第２音声データと画像データとをともに記録
するのかを設定する手段をさらに備える、請求項６に記
載のＴＶ会議システムである。このＴＶ会議システムで
は、データを記録する場合、第２音声データだけを記録
するのか、画像データおよび第２音声データの両方を記
録するのかが設定される。According to a seventh aspect of the present invention, the recording mode setting means further comprises means for setting whether to record only the second audio data or both the second audio data and the image data. The TV conference system according to claim 6, comprising: In this TV conference system, when recording data, it is set whether to record only the second audio data or to record both the image data and the second audio data.

【００１４】請求項８に記載のＴＶ会議方法は、ネット
ワークに接続された複数のＴＶ会議用ユニットを用いて
複数地区間でＴＶ会議を行うＴＶ会議方法であって、会
議の参加者毎の第１音声データと画像データとを記憶す
るステップ、自地区で音声が発せられたとき、音声に対
応する第２音声データと記憶された第１音声データとに
基づいて音声を発した発言者を特定するステップ、特定
された発言者の画像データと第２音声データとをすべて
の他地区のＴＶ会議用ユニットに送信するステップ、他
地区で音声が発せられたとき、他地区のＴＶ会議用ユニ
ットからの画像データと第２音声データとを受信するス
テップ、および発言者の画像データと第２音声データと
にそれぞれ対応する画像と音声とを出力するステップを
各ＴＶ会議用ユニットで実行する。[0014] A TV conference method according to claim 8 is a TV conference method for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network. (1) a step of storing voice data and image data, and when a voice is generated in a local area, a speaker who has generated a voice is identified based on the second voice data corresponding to the voice and the stored first voice data. Transmitting the image data and the second audio data of the identified speaker to all the TV conference units in the other districts; Receiving the image data and the second voice data of the speaker, and outputting the image and the voice corresponding to the image data and the second voice data of the speaker, respectively. To run in the door.

【００１５】このＴＶ会議方法では、会議で音声が発せ
られたとき、音声が発せられた地区のＴＶ会議用ユニッ
トで、その音声に対応する第２音声データと記憶された
第１音声データとに基づいて発言者が特定される。特定
された発言者の画像データと第２音声データとが、すべ
ての他地区のＴＶ会議用ユニットに送信され、他地区の
ＴＶ会議用ユニットは、それらのデータを受信する。そ
して、各ＴＶ会議用ユニットでは、発言者の画像データ
および音声データにそれぞれ対応する画像および音声が
出力される。In this TV conference method, when a voice is uttered in the conference, the TV conference unit in the district where the voice was uttered converts the second voice data corresponding to the voice and the stored first voice data. The speaker is specified based on the speaker. The image data and the second audio data of the specified speaker are transmitted to all the TV conference units in other districts, and the TV conference units in other districts receive the data. Then, each TV conference unit outputs an image and a sound corresponding to the image data and the sound data of the speaker, respectively.

【００１６】請求項９に記載のＴＶ会議方法は、ネット
ワークに接続された複数のＴＶ会議用ユニットを用いて
複数地区間でＴＶ会議を行うＴＶ会議方法であって、会
議の参加者毎の第１音声データとＩＤデータと画像デー
タとを記憶するステップ、自地区で音声が発せられたと
き、音声に対応する第２音声データと記憶された第１音
声データとに基づいて音声を発した発言者を特定するス
テップ、特定された発言者のＩＤデータと第２音声デー
タとをすべての他地区のＴＶ会議用ユニットに送信する
ステップ、他地区で音声が発せられたとき、他地区のＴ
Ｖ会議用ユニットからのＩＤデータと第２音声データと
を受信するステップ、記憶された画像データの中からＩ
Ｄデータに対応する画像データを選択するステップ、お
よび発言者の画像データと第２音声データとにそれぞれ
対応する画像と音声とを出力するステップを各ＴＶ会議
用ユニットで実行する。A TV conference method according to a ninth aspect is a TV conference method for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network. (1) a step of storing voice data, ID data, and image data; when a voice is uttered in the own area, a voice uttered based on the second voice data corresponding to the voice and the stored first voice data Identifying the speaker, transmitting the identified speaker's ID data and the second voice data to all the TV conference units in other districts, and when a voice is uttered in another district,
Receiving the ID data and the second audio data from the V conference unit;
The step of selecting the image data corresponding to the D data and the step of outputting the image and the sound corresponding to the image data of the speaker and the second sound data, respectively, are executed in each TV conference unit.

【００１７】このＴＶ会議方法では、会議で音声が発せ
られたとき、音声が発せられた地区のＴＶ会議用ユニッ
トで、その音声に対応する第２音声データと記憶された
第１音声データとに基づいて発言者が特定される。特定
された発言者のＩＤデータと第２音声データとが、すべ
ての他地区のＴＶ会議用ユニットに送信され、他地区の
ＴＶ会議用ユニットは、それらのデータを受信する。そ
して、各ＴＶ会議用ユニットでは、ＩＤデータに対応す
る画像データが、記憶された画像データの中から選択さ
れ、画像データおよび第２音声データにそれぞれ対応す
る画像および音声が出力される。In this TV conference method, when a voice is uttered in a conference, the TV conference unit in the district where the voice is uttered converts the second voice data corresponding to the voice and the stored first voice data. The speaker is specified based on the speaker. The ID data and the second voice data of the specified speaker are transmitted to all the TV conference units in other districts, and the TV conference units in other districts receive the data. Then, in each TV conference unit, the image data corresponding to the ID data is selected from the stored image data, and the image and the sound corresponding to the image data and the second sound data are output.

【００１８】[0018]

【発明の実施の形態】以下、この発明の実施の形態につ
いて図面を参照して説明する。図１には、α地区、β地
区、γ地区の３地区間でＴＶ会議を実施する場合を示
す。図１に示すＴＶ会議システム１０は、α地区、β地
区、γ地区にそれぞれＴＶ会議用ユニット１２を有す
る。各ＴＶ会議用ユニット１２は、電話回線や専用回線
などを用いたネットワーク１４を介して接続され、デー
タ伝送はたとえばパケット通信によって行われる。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a case where a TV conference is carried out between three districts, α district, β district and γ district. The TV conference system 10 shown in FIG. 1 has a TV conference unit 12 in each of an α area, a β area, and a γ area. Each TV conference unit 12 is connected via a network 14 using a telephone line or a dedicated line, and data transmission is performed by, for example, packet communication.

【００１９】ＴＶ会議用ユニット１２は、パソコンやワ
ークステーションなどのコンピュータからなるユニット
本体１６と、レンズ１８およびマイク２０とを含む。ユ
ニット本体１６はデータメモリ２２を含む。データメモ
リ２２としてはたとえば内蔵メモリやＰＣカードなどが
用いられる。データメモリ２２には、ＴＶ会議用ユニッ
ト１２の動作を制御するためのプログラムが記憶される
他に、会議参加者毎の音声データと画像データとが登録
され、さらに、参加者以外の画像データとして発言者不
特定時の画像データである標準画像データが登録され、
発言者を特定するＩＤデータも登録され得る。画像デー
タとしては、たとえば、音声データの登録を行った人の
顔や上半身（図２参照）の静止画データや、名前、プロ
フィールなどを記した文字情報など、データ量の少ない
情報が用いられる。これらの画像データは、イメージデ
ータだけではなくテキストデータによって構成されても
よい。標準画像データとしても、データ量の少ない情報
が用いられ、たとえば、いずれかの地区の会議室内の全
体を概観した、すなわち全体をざっと見渡したような構
図で撮影して作成された画像データなどが用いられる。
さらに、データメモリ２２には、後述する画像モードの
フラグや記録モードのフラグが格納される。The TV conference unit 12 includes a unit main body 16 composed of a computer such as a personal computer and a workstation, a lens 18 and a microphone 20. The unit main body 16 includes a data memory 22. As the data memory 22, for example, a built-in memory or a PC card is used. In the data memory 22, a program for controlling the operation of the TV conference unit 12 is stored, and in addition, audio data and image data for each conference participant are registered. Standard image data, which is image data when the speaker is unspecified, is registered,
ID data specifying the speaker can also be registered. As the image data, for example, information with a small amount of data such as still image data of the face and upper body (see FIG. 2) of the person who registered the audio data and character information describing the name and profile are used. These image data may be constituted by not only image data but also text data. As the standard image data, information with a small amount of data is used, for example, image data created by taking an overview of the entire conference room in one of the districts, that is, photographing with a composition that overlooks the entirety, etc. Used.
Further, the data memory 22 stores an image mode flag and a recording mode flag described later.

【００２０】また、ユニット本体１６は画像作成部２４
を含む。レンズ１８と画像作成部２４とが画像作成手段
を構成する。画像作成部２４は、ＣＣＤなどの固体撮像
素子、増幅回路、Ａ／Ｄ変換回路、信号処理回路など
（図示せず）を含む。そして、レンズ１８を通して受光
した被写体の明るさを固体撮像素子で光電変換し、画像
信号をアナログ信号として出力する。この画像信号は増
幅回路で増幅された後、Ａ／Ｄ変換回路によってデジタ
ル変換され、デジタル化された画像信号が信号処理回路
で処理され色調整などが行われ、画像データとされる。
この画像データは、データ制御部３４（後述する）に与
えられる。なお、画像作成部２４からの画像データが特
に必要となるのは、会議において音声を発している発言
者がデータメモリ２２に登録済みの発言者に特定できな
い場合（発言者不特定時）である。The unit body 16 includes an image forming unit 24.
including. The lens 18 and the image creating unit 24 constitute an image creating unit. The image creation unit 24 includes a solid-state imaging device such as a CCD, an amplification circuit, an A / D conversion circuit, a signal processing circuit, and the like (not shown). Then, the brightness of the subject received through the lens 18 is photoelectrically converted by the solid-state imaging device, and the image signal is output as an analog signal. This image signal is amplified by an amplifier circuit, then converted into a digital signal by an A / D converter circuit, and the digitized image signal is processed by a signal processing circuit to perform color adjustment and the like, thereby obtaining image data.
This image data is provided to a data control unit 34 (described later). The image data from the image creating unit 24 is particularly necessary when the speaker who is uttering the voice in the conference cannot be specified as the speaker registered in the data memory 22 (when the speaker is not specified). .

【００２１】また、ユニット本体１６は画像モード設定
部２６を有する。画像モード設定部２６としてはたとえ
ばキーボードやマウスなどの任意の入力手段が用いら
れ、「発言者不特定時には標準画像データを採用するモ
ード」か、「発言者不特定時には画像作成手段によって
作成された画像データを採用するモード」かのいずれか
の画像モードに設定する。この画像モードは、データメ
モリ２２のたとえば１ビットのフラグで表される。さら
に、ユニット本体１６は記録モード設定部２８を有す
る。記録モード設定部２８としてはたとえばキーボード
やマウスなどの任意の入力手段が用いられる。記録モー
ドによって、データを記録するか否か、そして、データ
を記録する場合には、記録するデータは音声データだけ
かそれとも音声データと画像データとの両方かが設定さ
れる。この記録モードは、データメモリ２２のたとえば
２ビットのフラグで表される。The unit body 16 has an image mode setting section 26. As the image mode setting unit 26, an arbitrary input means such as a keyboard or a mouse is used. For example, "a mode in which standard image data is adopted when the speaker is not specified" or "a mode created by the image generating means when the speaker is not specified" is used. Mode that employs image data ”. This image mode is represented by, for example, a 1-bit flag in the data memory 22. Further, the unit body 16 has a recording mode setting section 28. As the recording mode setting unit 28, any input means such as a keyboard and a mouse is used. Depending on the recording mode, whether to record data or not, and when recording data, whether to record only audio data or both audio data and image data is set. This recording mode is represented by, for example, a 2-bit flag in the data memory 22.

【００２２】また、ユニット本体１６は、マイク２０か
らの音声が入力される音声入力部３０を有する。音声入
力部３０は、Ａ／Ｄ変換回路および雑音を遮断して会議
参加者の音声を通すためのフィルタ（ともに図示せず）
を含む。入力されたアナログ情報の音声をＡ／Ｄ変換回
路でデジタル変換し、フィルタで雑音を遮断する。した
がって、音声入力部３０からは発言者のデジタル化され
た音声データが出力され、この音声データはＣＰＵ３２
のデータ制御部３４に与えられ、さらに、データ制御部
３４を介して音声認識部３６へ与えられる。Further, the unit main body 16 has a voice input unit 30 to which a voice from the microphone 20 is input. The voice input unit 30 is an A / D conversion circuit and a filter for cutting off noise and passing the voice of the conference participant (both not shown).
including. The audio of the input analog information is digitally converted by an A / D conversion circuit, and noise is cut off by a filter. Accordingly, digitized voice data of the speaker is output from the voice input unit 30, and this voice data is output to the CPU 32.
, And further to the voice recognition unit 36 via the data control unit 34.

【００２３】音声認識部３６では、周知の技術によって
音声認識が行われる。音声認識部３６での音声認識の過
程を図３を参照して簡単に説明する。なお、音声認識部
３６には、会議参加者の音声データを音声認識時と同様
の方法で処理して得られた標準パターンが登録されてい
る。図３を参照して、まず、発言者の音声がディジタル
化された音声データを発言者の特徴をより明確に表わす
パラメータの系列に変換する（音響分析）。このとき用
いられるパラメータとしては、音声のエネルギー、振
幅、ホルトマン周波数、短時間スペクトル、周波数スペ
クトル、線形予測パラメータ、ピッチ周波数、基本周波
数、ケプトラムなどがある。ついで、パラメータの系列
について、時間軸方向の平均化、セグメンテーション、
標準パターンとの時間の対応付けなどの処理により、固
定長のパターンベクトルを構成する（パターン変換）。
そして、このパターンベクトルと各発言者の登録された
標準パターンとの距離を計算する（標準パターンとの比
較）。最後に、発言者が、登録されている中のどの発言
者であるかを判定する（判定）。The voice recognition section 36 performs voice recognition by a known technique. The process of voice recognition in the voice recognition unit 36 will be briefly described with reference to FIG. The voice recognition unit 36 registers a standard pattern obtained by processing voice data of a conference participant in the same manner as in voice recognition. Referring to FIG. 3, first, voice data obtained by digitizing the voice of the speaker is converted into a series of parameters that more clearly express the characteristics of the speaker (acoustic analysis). The parameters used at this time include speech energy, amplitude, Haltmann frequency, short-time spectrum, frequency spectrum, linear prediction parameter, pitch frequency, fundamental frequency, ceptoram, and the like. Next, averaging in the time axis direction, segmentation,
A fixed-length pattern vector is formed by processing such as associating a time with a standard pattern (pattern conversion).
Then, the distance between the pattern vector and the registered standard pattern of each speaker is calculated (comparison with the standard pattern). Finally, it is determined which speaker is the registered speaker (determination).

【００２４】そして、音声認識部３６での音声認識結果
はデータ制御部３４に与えられる。データ制御部３４で
は、音声認識結果に応じて出力すべき画像データ（また
はＩＤデータ）が選択され（それによって画像が切り替
えられ）、選択された画像データ（またはＩＤデータ）
がデータメモリ２２からデータ制御部３４へロードされ
る。データ制御部３４は、選択された画像データ（ま
たはＩＤデータ）および音声データをデータ送受信部３
８に与える。The result of the voice recognition by the voice recognition unit 36 is given to the data control unit 34. In the data control unit 34, image data (or ID data) to be output is selected according to the voice recognition result (the image is switched thereby), and the selected image data (or ID data) is output.
Is loaded from the data memory 22 to the data control unit 34. The data control unit 34 transmits the selected image data (or ID data) and audio data to the data transmitting / receiving unit 3.
Give 8

【００２５】データ送受信部３８は、記録モードに応じ
て、データ制御部３４からの画像データ（またはＩＤデ
ータ）および音声データを、音声・画像出力装置４０お
よび会議録用データメモリ４２に与える。また、データ
送受信部３８は、ネットワーク１４を介して、画像デー
タ（またはＩＤデータ）および音声データを、他地区の
ＴＶ会議用ユニット１２に送信する。さらに、データ送
受信部３８は、他地区のＴＶ会議用ユニット１２からの
画像データ（またはＩＤデータ）および音声データを受
信し、データ制御部３４に与え、記録モードに応じて、
音声・画像出力装置４０および会議録用データメモリ４
２に与える。The data transmission / reception section 38 supplies image data (or ID data) and audio data from the data control section 34 to the audio / image output device 40 and the conference record data memory 42 in accordance with the recording mode. The data transmitting / receiving unit 38 transmits image data (or ID data) and audio data to the TV conference unit 12 in another area via the network 14. Further, the data transmitting / receiving unit 38 receives the image data (or ID data) and the audio data from the TV conference unit 12 in another area, gives the data to the data control unit 34, and, according to the recording mode,
Audio / image output device 40 and conference record data memory 4
Give to 2.

【００２６】音声・画像出力装置４０は、会議での発言
者の音声データおよび画像データにそれぞれ対応する音
声および画像を出力する。画像は、たとえば図４に示す
ように表示される。図４には、α地区のＡさんが発言し
た場合を示す。また、会議録用データメモリ４２には、
記録モードに応じて、会議での発言者の音声データ、ま
たは、会議での発言者の音声データおよびその発言者に
関する画像データの両方が記録される。The voice / image output device 40 outputs voice and image corresponding to voice data and image data of a speaker in the conference, respectively. The image is displayed, for example, as shown in FIG. FIG. 4 shows a case where Mr. A in the α area made a comment. In the data memory 42 for meeting minutes,
Depending on the recording mode, the voice data of the speaker at the conference, or both the voice data of the speaker at the conference and the image data related to the speaker are recorded.

【００２７】ここで、音声・画像出力装置４０として
は、スピーカとモニタ（またはディスプレイ）とを有す
る機器、たとえばＴＶ受像機やパソコンなどが用いられ
る。また、会議録用データメモリ４２としては、コンピ
ュータに用いられるハードディスクなどの大容量の記録
媒体が好ましいが、内蔵メモリやＰＣカードなどが用い
られてもよい。会議録用データメモリ４２として内蔵メ
モリやＰＣカードが用いられるときには、データメモリ
２２を会議録用データメモリ４２に兼用でき、そのとき
は、会議録用データメモリ４２は不要となる。また、会
議での発言者の音声データおよびその発言者に関する画
像データはビデオレコーダに記録されてもよい。Here, as the audio / image output device 40, a device having a speaker and a monitor (or display), for example, a TV receiver or a personal computer is used. Further, as the conference record data memory 42, a large-capacity recording medium such as a hard disk used for a computer is preferable, but a built-in memory or a PC card may be used. When a built-in memory or a PC card is used as the conference record data memory 42, the data memory 22 can also be used as the conference record data memory 42, and in that case, the conference record data memory 42 becomes unnecessary. In addition, the voice data of the speaker in the conference and the image data of the speaker may be recorded on the video recorder.

【００２８】このように構成されるＴＶ会議システム１
０は、音声が発せられた地区で予め音声認識して発言者
を特定し、特定された発言者の音声データおよび画像デ
ータを他地区に送信する場合（図５）、音声が発せられ
た地区で予め音声認識して発言者を特定し、特定された
発言者の音声データおよびＩＤデータを他地区に送信す
る場合（図６）、音声が発せられた地区から他地区へは
音声データだけを送信し、各地区でそれぞれ音声認識す
る場合（図７）の、各動作を行うことができる。The TV conference system 1 configured as described above
0 indicates the district where the voice was uttered when the voice was previously recognized in the district where the voice was uttered to identify the speaker, and the voice data and image data of the specified speaker were transmitted to other districts (FIG. 5). When a speaker is identified in advance by voice recognition and the voice data and ID data of the specified speaker are transmitted to another district (FIG. 6), only the voice data is transmitted from the district where the voice is emitted to the other district. Each operation can be performed when transmitting and performing voice recognition in each area (FIG. 7).

【００２９】最初に、図５を参照して、音声が発せられ
た地区で予め音声認識して発言者を特定し、発言者の音
声データおよび画像データを他地区に送信する場合の各
ＴＶ会議用ユニット１２の動作について説明する。ま
ず、各ＴＶ会議用ユニット１２の電源（図示せず）をオ
ンしＴＶ会議システム１０を起動すると、会議参加者の
音声データ、それに対応する画像データ、発言者不特定
時の標準画像データ、発言者不特定時の画像作成手段
（レンズ１８および画像作成部２４）による撮影条件
（たとえば、撮影時間など）、画像モード、および記録
モードがデータメモリ２２に登録されるとともに、発言
者不特定時の画像作成手段による撮影条件（たとえば、
撮影する位置、構図など）が設定される（ステップＳ
１）。First, referring to FIG. 5, each TV conference in the case where voice recognition is performed in advance in a district where a voice is uttered to specify a speaker, and voice data and image data of the speaker are transmitted to another district. The operation of the unit 12 will be described. First, when the power supply (not shown) of each TV conference unit 12 is turned on and the TV conference system 10 is started, the voice data of the conference participants, the corresponding image data, the standard image data when the speaker is unspecified, the voice The photographing conditions (for example, photographing time), the image mode, and the recording mode by the image creating means (lens 18 and image creating unit 24) when the speaker is unspecified are registered in the data memory 22, and when the speaker is not specified. Shooting conditions (for example,
The shooting position, composition, etc.) are set (step S).
1).

【００３０】これらのデータは、各地区毎に登録されて
もよいし、全地区に共通のデータが登録されてもよい。
全地区に共通のデータを登録する場合には、まず、ある
地区でデータを登録しておき、その後同一のデータをネ
ットワーク１４を介して他地区に送信する方法や、共通
のデータが収録された記録媒体を各地区毎に保持し、そ
の記憶媒体のデータを各地区毎に書き込む方法などがあ
る。[0030] These data may be registered for each district, or data common to all districts may be registered.
When registering common data in all districts, first register data in a certain district, and then transmit the same data to other districts via the network 14, or record common data. There is a method of holding a recording medium for each area and writing data of the storage medium for each area.

【００３１】なお、各参加者の画像データや発言者不特
定時の標準画像データを予めデータメモリ２２に登録し
ておけば、それぞれに対応する音声をその場（たとえ
ば、各地区毎の会議場）でマイク２０から入力して、デ
ータメモリ２２に画像データに対応する音声データを登
録することもできる。この場合、マイク２０からの音声
データを、データメモリ２２に登録すると同時に、音声
認識部３６で音声データに所定の処理を施して標準パタ
ーンとして保持することもできる。また、画像作成手段
およびマイク２０を用いて、必要な音声データおよび画
像データをその場で入力し、データメモリ２２に登録す
ることもできる。さらに、発言者不特定時には、画像作
成手段によって、静止画データだけではなく所定時間分
の動画データも取り込める。動画データを取り込むとき
は、短時間だけ撮影して取り込んだ動画データを、発言
者不特定時に連続再生すれば、メモリ消費を少なくでき
る。すなわち、データ量およびメモリ消費の抑制という
面から見れば、発言者不特定時の撮影時間は短い方が望
ましい。If the image data of each participant and the standard image data when the speaker is unspecified are registered in the data memory 22 in advance, the sound corresponding to each participant can be recorded on the spot (for example, a conference hall for each district). ), The audio data corresponding to the image data can be registered in the data memory 22. In this case, the voice data from the microphone 20 can be registered in the data memory 22 and at the same time, the voice data can be subjected to predetermined processing by the voice recognition unit 36 and held as a standard pattern. In addition, necessary voice data and image data can be inputted on the spot using the image creating means and the microphone 20 and registered in the data memory 22. Further, when the speaker is unspecified, not only still image data but also moving image data for a predetermined time can be captured by the image creating means. When capturing moving image data, memory consumption can be reduced by continuously reproducing the captured moving image data when the speaker is not specified. That is, from the viewpoint of suppressing the data amount and the memory consumption, it is preferable that the photographing time when the speaker is unspecified is short.

【００３２】また、画像モードおよび処理モードも上述
のデータと同様に登録できるほか、ＴＶ会議システム１
０の動作中であっても、それらのモードを切り替えるこ
とができる。なお、予め、これらのデータおよびモード
が登録されていれば、ステップＳ１における登録を行う
必要はない。これらのデータの登録方法等は、後述の各
動作においても同様に適用できる。The image mode and the processing mode can be registered in the same manner as the data described above.
The mode can be switched even during the operation of “0”. If these data and mode are registered in advance, there is no need to perform registration in step S1. These data registration methods and the like can be similarly applied to each operation described later.

【００３３】そして、ステップＳ１における登録・設定
が完了したか否かが判断され（ステップＳ３）、登録・
設定が完了していなければ、完了するまで待つ。ステッ
プＳ１の動作が完了すれば、音声入力があるか否かが判
断される（ステップＳ５）。このとき、データ制御部３
４は、音声入力部３０からの音声データとデータ送受信
部３８で受信した他地区からの音声データのうちで、先
着した方を音声入力として採用する。後述する動作例に
おいても同様である。ステップＳ５において、音声入力
がなければ音声入力があるまで待機し、音声入力があれ
ば、それは他地区からの音声データか否かが判断される
（ステップＳ７）。他地区からの音声データでなけれ
ば、すなわち自地区の音声データであれば、その音声デ
ータを他地区へ送信し（ステップＳ９）、音声認識部３
６で音声認識を実施する（ステップＳ１１）。Then, it is determined whether or not the registration / setting in step S1 is completed (step S3).
If the setting is not completed, wait until it is completed. When the operation of step S1 is completed, it is determined whether or not there is a voice input (step S5). At this time, the data control unit 3
Reference numeral 4 designates, as the voice input, the first one of the voice data from the voice input unit 30 and the voice data from another district received by the data transmitting / receiving unit 38. The same applies to an operation example described later. In step S5, if there is no voice input, the process waits until there is a voice input, and if there is a voice input, it is determined whether or not it is voice data from another district (step S7). If it is not voice data from another district, that is, if it is voice data of its own district, the voice data is transmitted to another district (step S9), and the voice recognition unit 3
In step 6, speech recognition is performed (step S11).

【００３４】音声認識の結果を用いて、発言者をデータ
メモリ２２に登録済みの発言者に特定できるか否かがデ
ータ制御部３４で判断される（ステップＳ１３）。登録
済みの発言者に特定できれば、該当する発言者の画像デ
ータをデータメモリ２２からデータ制御部３４へロード
し（ステップＳ１５）、その画像データを他地区へ送信
する（ステップＳ１７）。The data controller 34 determines whether or not the speaker can be specified as a speaker registered in the data memory 22 using the result of the voice recognition (step S13). If the registered speaker can be specified, the image data of the corresponding speaker is loaded from the data memory 22 to the data control unit 34 (step S15), and the image data is transmitted to another area (step S17).

【００３５】そして、データ制御部３４で記録モードが
判断される（ステップＳ１９）。「データを出力するだ
けで、記録しない」記録モードであれば、データ制御部
３４は、音声データおよび該当する画像データを、音声
・画像出力装置４０に与え、音声・画像出力装置４０か
ら音声および画像が出力される（ステップＳ２１）。
「データを出力しかつ記録する」記録モードであれば、
データ制御部３４では、音声データだけを記録するか否
かが判断される（ステップＳ２３）。「音声データだけ
を記録する」記録モードであれば、音声データおよび該
当する画像データを音声・画像出力装置４０に与え、音
声データを会議録用データメモリ４２に与える。そし
て、音声および画像が音声・画像出力装置４０から出力
され、音声データが会議録用データメモリ４２に記録さ
れる（ステップＳ２５）。「音声データおよび画像デー
タを記録する」記録モードであれば、データ制御部３４
は、音声データおよび該当する画像データを、音声・画
像出力装置４０および会議録用データメモリ４２に与
え、音声・画像出力装置４０からは音声および画像が出
力され、会議録用データメモリ４２には音声データおよ
び該当する画像データが記録される（ステップＳ２
７）。Then, the data control unit 34 determines the recording mode (step S19). If the recording mode is “only output data and do not record”, the data control unit 34 supplies the audio data and the corresponding image data to the audio / image output device 40, and outputs the audio and image data from the audio / image output device 40. An image is output (step S21).
In the "output and record data" recording mode,
The data control unit 34 determines whether to record only the audio data (step S23). If the recording mode is "record only audio data", the audio data and the corresponding image data are supplied to the audio / image output device 40, and the audio data is supplied to the conference record data memory 42. Then, the audio and the image are output from the audio / image output device 40, and the audio data is recorded in the conference record data memory 42 (step S25). If the recording mode is "record audio data and image data", the data control unit 34
Supplies the audio data and the corresponding image data to the audio / image output device 40 and the conference record data memory 42, and outputs the audio and image from the audio / image output device 40. Voice data and corresponding image data are recorded (step S2).
7).

【００３６】一方、ステップＳ１３において、発言者を
登録済みの発言者に特定できなければ、発言者不特定時
の登録済みの標準画像データを採用するか否か、すなわ
ち画像モードが判断される（ステップＳ２９）。標準画
像データを採用する画像モードであれば、発言者不特定
時の標準画像データがデータメモリ２２からデータ制御
部３４へロードされ（ステップＳ３１）、標準画像デー
タが他地区へ送信され（ステップＳ１７）、ステップＳ
１９へ進む。そして、上述のように、記録モードに応じ
て、音声データおよび標準画像データが、ステップＳ２
１ないしステップＳ２７において処理される。On the other hand, if it is determined in step S13 that the speaker cannot be specified as a registered speaker, it is determined whether to adopt the registered standard image data when the speaker is not specified, that is, the image mode is determined (step S13). Step S29). If the image mode employs the standard image data, the standard image data when the speaker is not specified is loaded from the data memory 22 to the data control unit 34 (step S31), and the standard image data is transmitted to another area (step S17). ), Step S
Proceed to 19. Then, as described above, the audio data and the standard image data are stored in step S2 according to the recording mode.
The processing is performed from 1 to S27.

【００３７】また、ステップＳ２９において、標準画像
データを採用しない、すなわち画像作成手段によって作
成された画像データを採用する画像モードであれば、発
言者不特定時の画像データとして、画像作成部２４から
の画像データがデータ制御部３４へ入力され（ステップ
Ｓ３３）、画像作成部２４からの画像データが他地区へ
送信され（ステップＳ１７）、ステップＳ１９へ進む。
そして、上述のように、記録モードに応じて、音声デー
タおよび画像作成部２４からの画像データが、ステップ
Ｓ２１ないしステップＳ２７において処理される。If it is determined in step S29 that the standard image data is not used, that is, if the image mode uses the image data created by the image creating means, the image creating unit 24 outputs the image data when the speaker is not specified. Is input to the data control unit 34 (step S33), the image data from the image creating unit 24 is transmitted to another district (step S17), and the process proceeds to step S19.
Then, as described above, the audio data and the image data from the image creating unit 24 are processed in steps S21 to S27 according to the recording mode.

【００３８】また、ステップＳ７において、他地区から
の音声データが入力されたと判断されれば、他地区から
の画像データが獲得され（ステップＳ３５）、ステップ
Ｓ１９へ進む。そして、上述のように、記録モードに応
じて、音声データおよび他地区からの画像データが、ス
テップＳ２１ないしステップＳ２７において処理され
る。If it is determined in step S7 that voice data from another district has been input, image data from another district is obtained (step S35), and the flow advances to step S19. Then, as described above, the audio data and the image data from another area are processed in steps S21 to S27 according to the recording mode.

【００３９】このように動作するＴＶ会議システム１０
によれば、会議の状況を、音声・画像出力装置４０を通
して視聴したり、会議録用データメモリ４２に記録する
場合、会議中の発言者を常時撮影した動画データを用い
るのではなく、発言者特定時には、予めデータメモリ２
２に登録された画像データの中から会議中の発言者に関
する適切な画像データを用い、発言者不特定時には、予
めデータメモリ２２に登録された標準画像データ、また
は画像作成部２４からの所定の画像データを用いる。The TV conference system 10 operating as described above
According to the above, when the status of a conference is viewed through the audio / image output device 40 or recorded in the conference record data memory 42, the video data obtained by constantly photographing the video conference speaker is not used. At the time of identification, the data memory 2
When the speaker is unspecified, standard image data registered in the data memory 22 in advance or predetermined image data from the image creating unit 24 is used when appropriate speaker data is used from among the image data registered in step 2. Use image data.

【００４０】このように、画像データとして、会議中の
発言者を常時撮影した動画データよりデータ量の少ない
画像データを用いるので、各ＴＶ会議用ユニット１２の
音声・画像出力装置４０から画像および音声を出力する
とき、音声に対する画像の遅延量を少なくでき、違和感
なく視聴することができる。また、ネットワーク１４上
の通信量が大幅に少なくなるので、他の通信システムに
悪影響を及ぼさず、ＴＶ会議における地区間の通信量を
大幅に削減できるので、高速レスポンスのＴＶ会議シス
テム１０を実現できる。As described above, since the image data having a smaller data amount than the moving image data obtained by constantly photographing the speaker in the conference is used as the image data, the image and audio are output from the audio / image output device 40 of each TV conference unit 12. Is output, the amount of delay of the image with respect to the audio can be reduced, and the user can view the video without discomfort. Further, since the communication amount on the network 14 is significantly reduced, the communication amount between the areas in the TV conference can be significantly reduced without adversely affecting other communication systems, so that the high-speed response TV conference system 10 can be realized. .

【００４１】さらに、会議録用データメモリ４２に音声
データおよび画像データを記録するとき、メモリ消費の
一層少ないデータ量で会議を記録でき、容易に議事録を
作成できる。また、このようなデータ量の少ない画像デ
ータを用いるにもかかわらず、会議において、誰が発言
しているかを正確に知ることができる。さらに、画像モ
ードの設定によって、ユーザーは発言者不特定時に用い
る画像データを選択できる。このとき、発言者不特定時
に画像作成部２４からの画像データを用いれば、会議の
現場を映し出すといった、会議の状況を知る上で有効な
画像データを用いることができる。Further, when audio data and image data are recorded in the conference record data memory 42, the conference can be recorded with a smaller amount of memory and the minutes can be easily created. In addition, despite using such image data having a small data amount, it is possible to know exactly who is speaking at the meeting. Further, by setting the image mode, the user can select image data to be used when the speaker is unspecified. At this time, if the image data from the image creating unit 24 is used when the speaker is unspecified, it is possible to use image data that is effective in knowing the status of the conference, such as displaying the site of the conference.

【００４２】また、ユーザーは、記録モードを設定する
ことによって、データを視聴して会議を行うだけなの
か、会議を行いながらデータの記録も行うのか、さら
に、データを記録する場合音声データだけを記録するの
かそれとも音声データおよび画像データをともに記録す
るのかを、選択できる。このように、ユーザーのニーズ
に応じた利用が可能となる。By setting the recording mode, the user can either view the data and hold a meeting or record the data while holding the meeting. It is possible to select whether to record or to record both audio data and image data. In this way, utilization according to the needs of the user becomes possible.

【００４３】つぎに、図６を参照して、音声が発せられ
た地区で予め音声認識して発言者を特定し、発言者の音
声データおよびＩＤデータを他地区に送信する場合の各
ＴＶ会議用ユニット１２の動作について説明する。ま
ず、各ＴＶ会議用ユニット１２の電源（図示せず）をオ
ンしＴＶ会議システム１０を起動すると、会議参加者の
音声データ、それに対応する画像データ、発言者不特定
時の標準画像データ、ＩＤデータ、発言者不特定時の画
像作成手段による撮影条件（たとえば、撮影時間な
ど）、画像モード、および記録モードがデータメモリ２
２に登録されるとともに、発言者不特定時の画像作成手
段による撮影条件（たとえば、撮影する位置、構図な
ど）が設定される（ステップＳ５１）。これらのデータ
は、全地区において共通であることが望ましく、少なく
とも発言者を特定するＩＤデータは全地区において共通
であることが必要である。全地区に共通のデータを登録
する方法は、図５のステップ１で述べた方法と同様であ
る。Next, referring to FIG. 6, each TV conference in the case where voice recognition is performed in advance in a district where a voice is uttered to specify a speaker, and voice data and ID data of the speaker are transmitted to another district. The operation of the unit 12 will be described. First, when the power supply (not shown) of each TV conference unit 12 is turned on and the TV conference system 10 is started, the voice data of the conference participants, the corresponding image data, the standard image data when the speaker is unspecified, the ID The data, photographing conditions (for example, photographing time) by the image creating means when the speaker is unspecified, the image mode, and the recording mode are stored in the data memory 2.
2 and the shooting conditions (for example, shooting position, composition, etc.) by the image creating means when the speaker is unspecified are set (step S51). These data are desirably common in all districts, and at least the ID data specifying the speaker needs to be common in all districts. The method of registering data common to all districts is the same as the method described in step 1 of FIG.

【００４４】そして、ステップＳ５１における登録・設
定が完了したか否かが判断され（ステップＳ５３）、登
録・設定が完了していなければ、完了するまで待つ。ス
テップＳ５１の動作が完了すれば、音声入力があるか否
かが判断される（ステップＳ５５）。音声入力がなけれ
ば音声入力があるまで待機し、音声入力があれば、他地
区からの音声データか否かが判断される（ステップＳ５
７）。他地区からの音声データでなければ、すなわち自
地区の音声データであれば、その音声データを他地区へ
送信し（ステップＳ５９）、音声認識部３６で音声認識
を実施する（ステップＳ６１）。Then, it is determined whether the registration / setting in step S51 is completed (step S53). If the registration / setting is not completed, the process waits until the registration / setting is completed. When the operation in step S51 is completed, it is determined whether or not there is a voice input (step S55). If there is no voice input, it waits until there is a voice input, and if there is a voice input, it is determined whether or not it is voice data from another district (step S5).
7). If it is not voice data from another district, that is, if it is voice data of the own district, the voice data is transmitted to another district (step S59), and voice recognition is performed by the voice recognition unit 36 (step S61).

【００４５】音声認識の結果を用いて、発言者をデータ
メモリ２２に登録済みの発言者に特定できるか否かがデ
ータ制御部３４で判断される（ステップＳ６３）。登録
済みの発言者に特定できれば、該当する発言者のＩＤデ
ータをデータメモリ２２からデータ制御部３４へロード
し（ステップＳ６５）、そのＩＤデータを他地区へ送信
する（ステップＳ６７）。その後、ＩＤデータに対応す
る画像データをデータメモリ２２からロードする（ステ
ップＳ６９）。The data controller 34 determines whether or not the speaker can be specified as a speaker registered in the data memory 22 using the result of the voice recognition (step S63). If the registered speaker can be specified, the ID data of the corresponding speaker is loaded from the data memory 22 to the data control unit 34 (step S65), and the ID data is transmitted to another area (step S67). Thereafter, the image data corresponding to the ID data is loaded from the data memory 22 (step S69).

【００４６】そして、データ制御部３４で記録モードが
判断される（ステップＳ７１）。「データを出力するだ
けで、記録しない」記録モードであれば、データ制御部
３４は、音声データおよび該当する画像データを、音声
・画像出力装置４０に与え、音声・画像出力装置４０か
ら音声および画像が出力される（ステップＳ７３）。
「データを出力しかつ記録する」記録モードであれば、
データ制御部３４では、音声データだけを記録するか否
かが判断される（ステップＳ７５）。「音声データだけ
を記録する」記録モードであれば、音声データおよび該
当する画像データを音声・画像出力装置４０に与え、音
声データを会議録用データメモリ４２に与える。そし
て、音声および画像が音声・画像出力装置４０から出力
され、音声データが会議録用データメモリ４２に記録さ
れる（ステップＳ７７）。「音声データおよび画像デー
タを記録する」記録モードであれば、データ制御部３４
は、音声データおよび該当する画像データを、音声・画
像出力装置４０および会議録用データメモリ４２に与
え、音声・画像出力装置４０からは音声および画像が出
力され、会議録用データメモリ４２には音声データおよ
び画像データが記録される（ステップＳ７９）。Then, the data control section 34 determines the recording mode (step S71). If the recording mode is “only output data and do not record”, the data control unit 34 supplies the audio data and the corresponding image data to the audio / image output device 40, and outputs the audio and image data from the audio / image output device 40. An image is output (step S73).
In the "output and record data" recording mode,
The data control unit 34 determines whether to record only the audio data (step S75). If the recording mode is "record only audio data", the audio data and the corresponding image data are supplied to the audio / image output device 40, and the audio data is supplied to the conference record data memory 42. Then, audio and images are output from the audio / image output device 40, and audio data is recorded in the conference record data memory 42 (step S77). If the recording mode is "record audio data and image data", the data control unit 34
Supplies the audio data and the corresponding image data to the audio / image output device 40 and the conference record data memory 42, and outputs the audio and image from the audio / image output device 40. Audio data and image data are recorded (step S79).

【００４７】一方、ステップＳ６３において、発言者を
登録済みの発言者に特定できなければ、発言者不特定時
の登録済みの標準画像データを採用するか否か、すなわ
ち画像モードが判断される（ステップＳ８１）。標準画
像データを採用する画像モードであれば、標準画像デー
タに対応するＩＤデータがデータメモリ２２からデータ
制御部３４へロードされ（ステップＳ８３）、ステップ
Ｓ６７へ進む。そして、上述のように、ステップＳ６７
ないしステップＳ７９において処理される。On the other hand, if it is determined in step S63 that the speaker cannot be specified as a registered speaker, it is determined whether to use the registered standard image data when the speaker is not specified, that is, the image mode is determined (step S63). Step S81). If the image mode employs the standard image data, the ID data corresponding to the standard image data is loaded from the data memory 22 to the data control unit 34 (step S83), and the process proceeds to step S67. Then, as described above, step S67
Through step S79.

【００４８】また、ステップＳ８１において、標準画像
データを採用しない、すなわち画像作成手段によって作
成された画像データを採用する画像モードであれば、発
言者不特定時の画像データとして、画像作成部２４から
の画像データに対応するＩＤデータがデータ制御部３４
へ入力され（ステップＳ８５）、ステップＳ６７へ進
む。そして、上述のように、ステップＳ６７ないしステ
ップＳ７９において処理される。また、ステップＳ５７
において、他地区からの音声データが入力されたと判断
されれば、他地区からのＩＤデータが獲得され（ステッ
プＳ８７）、ステップＳ６９へ進む。そして、上述のよ
うに、ステップＳ６９ないしステップＳ７９において処
理される。In step S81, if the standard image data is not used, that is, if the image mode is to use the image data created by the image creating means, the image creating unit 24 determines the image data when the speaker is not specified. ID data corresponding to the image data of
(Step S85), and the process proceeds to step S67. Then, as described above, the processing is performed in steps S67 to S79. Step S57
If it is determined that voice data from another district has been input, ID data from another district is obtained (step S87), and the process proceeds to step S69. Then, as described above, the processing is performed in steps S69 to S79.

【００４９】このように動作するＴＶ会議システム１０
によれば、図５に示す動作を行う場合と同様の効果が得
られる。また、音声が発せられた地区からその他の地区
へは画像データよりデータ量の少ないＩＤデータが送信
されるので、ネットワーク上の通信量がさらに少なくな
り、その結果、他の通信システムへの影響もさらに小さ
くなる。The TV conference system 10 operating as described above
According to this, the same effect as in the case of performing the operation shown in FIG. 5 can be obtained. In addition, since ID data having a smaller data amount than image data is transmitted from the area where the voice is emitted to other areas, the communication amount on the network is further reduced, and as a result, the influence on other communication systems is also reduced. It becomes even smaller.

【００５０】つぎに、図７を参照して、音声が発せられ
た地区から他地区へは音声データだけを送信し、各地区
でそれぞれ音声認識する場合の各ＴＶ会議用ユニット１
２の動作について説明する。まず、各ＴＶ会議用ユニッ
ト１２の電源（図示せず）をオンしＴＶ会議システム１
０を起動すると、会議参加者の音声データ、それに対応
する画像データ、発言者不特定時の標準画像データ、発
言者不特定時の画像作成手段による撮影条件（たとえ
ば、撮影時間など）、画像モード、および記録モードが
データメモリ２２に登録されるとともに、発言者不特定
時の画像作成手段による撮影条件（たとえば、撮影する
位置、構図など）が設定される（ステップＳ１０１）。
これらのデータは、全地区において共通であることが望
ましい。全地区に共通のデータを登録する方法は、図５
のステップＳ１において述べた方法と同様である。Next, referring to FIG. 7, each TV conference unit 1 transmits only voice data from a district where a voice is emitted to another district and performs voice recognition in each district.
Operation 2 will be described. First, the power supply (not shown) of each TV conference unit 12 is turned on and the TV conference system 1 is turned on.
When 0 is activated, the voice data of the conference participant, the corresponding image data, the standard image data when the speaker is not specified, the shooting conditions (for example, shooting time) by the image creating means when the speaker is not specified, the image mode And the recording mode are registered in the data memory 22, and the photographing conditions (for example, photographing position, composition, etc.) by the image creating means when the speaker is not specified are set (step S101).
It is desirable that these data be common to all districts. Figure 5 shows how to register data common to all districts.
This is the same as the method described in step S1.

【００５１】そして、ステップＳ１０１における登録・
設定が完了したか否かが判断され（ステップＳ１０
３）、登録・設定が完了していなければ、完了するまで
待つ。ステップＳ１０１の動作が完了すれば、音声入力
があるか否かが判断される（ステップＳ１０５）。音声
入力がなければ音声入力があるまで待機し、音声入力が
あれば、他地区からの音声データか否かが判断される
（ステップＳ１０７）。他地区からの音声データでなけ
れば、すなわち自地区の音声データであれば、その音声
データを他地区へ送信し（ステップＳ１０９）、音声認
識部３６で音声認識を実施する（ステップＳ１１１）。
ステップＳ１０７において、他地区からの音声データで
あれば、直接ステップＳ１１１へ進む。Then, registration and registration in step S101
It is determined whether the setting has been completed (step S10).
3) If the registration / setting is not completed, wait until it is completed. When the operation in step S101 is completed, it is determined whether or not there is a voice input (step S105). If there is no voice input, it waits until there is a voice input, and if there is a voice input, it is determined whether or not it is voice data from another district (step S107). If it is not the voice data from another district, that is, if it is the voice data of the own district, the voice data is transmitted to the other district (step S109), and the voice recognition unit 36 performs the voice recognition (step S111).
If it is determined in step S107 that the data is audio data from another district, the process directly proceeds to step S111.

【００５２】音声認識の結果を用いて、発言者をデータ
メモリ２２に登録済みの発言者に特定できるか否かがデ
ータ制御部３４で判断される（ステップＳ１１３）。登
録済みの発言者に特定できれば、該当する発言者の画像
データをデータメモリ２２からデータ制御部３４へロー
ドする（ステップＳ１１５）。The data controller 34 determines whether or not the speaker can be specified as a speaker registered in the data memory 22 using the result of the voice recognition (step S113). If the registered speaker can be specified, the image data of the corresponding speaker is loaded from the data memory 22 to the data control unit 34 (step S115).

【００５３】そして、データ制御部３４で記録モードが
判断される（ステップＳ１１７）。「データを出力する
だけで、記録しない」記録モードであれば、データ制御
部３４は、音声データおよび該当する画像データを、音
声・画像出力装置４０に与え、音声・画像出力装置４０
から音声および画像が出力される（ステップＳ１１
９）。「データを出力しかつ記録する」記録モードであ
れば、データ制御部３４では、音声データだけを記録す
るか否かが判断される（ステップＳ１２１）。「音声デ
ータだけを記録する」記録モードであれば、音声データ
および該当する画像データを音声・画像出力装置４０に
与え、音声データを会議録用データメモリ４２に与え
る。そして、音声および画像が音声・画像出力装置４０
から出力され、音声データが会議録用データメモリ４２
に記録される（ステップＳ１２３）。「音声データおよ
び画像データを記録する」記録モードであれば、データ
制御部３４は、音声データおよび該当する画像データ
を、音声・画像出力装置４０および会議録用データメモ
リ４２に与え、音声・画像出力装置４０からは音声およ
び画像が出力され、会議録用データメモリ４２には音声
データおよび画像データが記録される（ステップＳ１２
５）。Then, the data control section 34 determines the recording mode (step S117). If the recording mode is “output data only and do not record”, the data control unit 34 provides the audio / image output device 40 with the audio data and the corresponding image data.
Output audio and images (step S11).
9). If the recording mode is “output and record data”, the data control unit 34 determines whether to record only audio data (step S121). If the recording mode is "record only audio data", the audio data and the corresponding image data are supplied to the audio / image output device 40, and the audio data is supplied to the conference record data memory 42. Then, the sound and the image are output to the sound / image output device 40.
The audio data output from the
(Step S123). If the recording mode is "record audio data and image data", the data control unit 34 supplies the audio data and the corresponding image data to the audio / image output device 40 and the conference recording data memory 42, Audio and images are output from the output device 40, and audio data and image data are recorded in the conference record data memory 42 (step S12).
5).

【００５４】一方、ステップＳ１１３において、発言者
を登録済みの発言者に特定できなければ、発言者不特定
時の登録済みの標準画像データを採用するか否か、すな
わち画像モードが判断される（ステップＳ１２７）。標
準画像データを採用する画像モードであれば、標準画像
データがデータメモリ２２からデータ制御部３４へロー
ドされ（ステップＳ１２９）、ステップＳ１１７へ進
む。そして、上述のように、ステップＳ１１７ないしス
テップＳ１２５において処理される。On the other hand, if it is determined in step S113 that the speaker cannot be specified as a registered speaker, it is determined whether to use the registered standard image data when the speaker is not specified, that is, the image mode is determined (step S113). Step S127). If the image mode employs the standard image data, the standard image data is loaded from the data memory 22 to the data control unit 34 (step S129), and the process proceeds to step S117. Then, as described above, the processing is performed in steps S117 to S125.

【００５５】また、ステップＳ１２７において、標準画
像データを採用しない、すなわち画像作成手段によって
作成された画像データを採用する画像モードであれば、
発言者不特定時の画像データとして、画像作成部２４か
らの画像データがデータ制御部３４へ入力され（ステッ
プＳ１３１）、ステップＳ１１７へ進む。そして、上述
のように、ステップＳ１１７ないしステップＳ１２５に
おいて処理される。このように動作するＴＶ会議システ
ム１０によれば、図５に示す動作を行う場合と同様の効
果が得られる他、音声が発せられた地区からその他の地
区へは音声データだけが送信されるので、ネットワーク
上の通信量を格段に少なくでき、その結果、他の通信シ
ステムへの影響もさらに小さくなる。If it is determined in step S127 that the standard image data is not used, that is, if the image mode is one in which the image data created by the image creating means is adopted,
Image data from the image creating unit 24 is input to the data control unit 34 as image data when the speaker is not specified (step S131), and the process proceeds to step S117. Then, as described above, the processing is performed in steps S117 to S125. According to the TV conference system 10 operating as described above, the same effect as in the case of performing the operation shown in FIG. 5 can be obtained, and only the audio data is transmitted from the area where the audio is emitted to other areas. The traffic on the network can be significantly reduced, and as a result, the influence on other communication systems is further reduced.

【００５６】つぎに、図８に、クライアント／サーバ方
式を用いて、α地区、β地区、γ地区の３地区間でＴＶ
会議を実施する場合を示す。図８に示すＴＶ会議システ
ム１００は、α地区、β地区、γ地区にそれぞれクライ
アントとなるＴＶ会議用ユニット１０２を有し、さら
に、それらとは別個にサーバとなる制御ユニット１０４
を有し、各ＴＶ会議用ユニット１０２および制御ユニッ
ト１０４はネットワーク１０６を介して接続される。Ｔ
Ｖ会議用ユニット１０２は、パソコンやワークステーシ
ョンなどのコンピュータからなるユニット本体１０８
と、ユニット本体１０８に接続されるマイク１１０を含
む。Next, FIG. 8 shows a TV between the three districts α, β and γ using the client / server system.
This shows the case where a meeting is held. The TV conference system 100 shown in FIG. 8 has a TV conference unit 102 serving as a client in each of the α district, the β district, and the γ district, and further, a control unit 104 serving as a server separately from them.
And each TV conference unit 102 and control unit 104 are connected via a network 106. T
The V-conference unit 102 is a unit body 108 composed of a computer such as a personal computer and a workstation.
And a microphone 110 connected to the unit main body 108.

【００５７】ユニット本体１０８はデータメモリ１１２
を含む。データメモリ１１２としてはたとえば内蔵メモ
リやＰＣカードなどが用いられる。データメモリ１１２
には、ＴＶ会議用ユニット１０２の動作を制御するため
のプログラムが記憶される他に、図１に示すデータメモ
リ２２と同様の音声データ、画像データ、ＩＤデータお
よび記録モードが登録される。また、ユニット本体１０
８は記録モード設定部１１４を有する。記録モード設定
部１１４は、記録モード設定部２６と同様に構成され、
記録モードを設定するために用いられる。The unit body 108 has a data memory 112
including. As the data memory 112, for example, a built-in memory or a PC card is used. Data memory 112
1 stores a program for controlling the operation of the TV conference unit 102, and also registers the same audio data, image data, ID data, and recording mode as those in the data memory 22 shown in FIG. The unit body 10
8 has a recording mode setting unit 114. The recording mode setting unit 114 is configured similarly to the recording mode setting unit 26,
Used to set the recording mode.

【００５８】また、ユニット本体１０８は、マイク１１
０からの音声が入力される音声入力部１１６を有する。
音声入力部１１６は、図１に示す音声入力部３０と同様
に構成され、音声入力部１１６からは発言者のデジタル
化された音声データが出力され、ＣＰＵによって構成さ
れるデータ制御部１１８、データ送受信部１２０を介し
て、ネットワーク１０６上に送信され、制御ユニット１
０４に与えられる。また、データ送受信部１２０は、制
御ユニット１０４からの音声データおよびＩＤデータを
ネットワーク１０６を介して受信し、ＩＤデータをデー
タ制御部１１８に与える。データ制御部１１８はそのＩ
Ｄデータに対応する画像データをデータメモリ１１２か
らロードし、データ送受信部１２０に与える。The unit main body 108 is connected to the microphone 11
It has a voice input unit 116 to which voice from 0 is input.
The voice input unit 116 has the same configuration as the voice input unit 30 shown in FIG. 1. The voice input unit 116 outputs digitized voice data of the speaker, and the data control unit 118 includes a CPU. The data is transmitted over the network 106 via the transmission / reception unit 120 and is transmitted to the control unit 1.
04. Further, the data transmitting / receiving unit 120 receives the voice data and the ID data from the control unit 104 via the network 106, and gives the ID data to the data control unit 118. The data control unit 118
The image data corresponding to the D data is loaded from the data memory 112 and provided to the data transmitting / receiving unit 120.

【００５９】データ送受信部１２０は、記録モードに応
じて、受信した画像データおよび音声データを、音声・
画像出力装置１２２および会議録用データメモリ１２４
に与える。音声・画像出力装置１２２および会議録用デ
ータメモリ１２４は、それぞれ音声・画像出力装置４０
および会議録用データメモリ４２と同様に構成される。
また、制御ユニット１０４は、パソコンやワークステー
ションなどのコンピュータからなり、サーバとして機能
する。The data transmission / reception unit 120 converts the received image data and audio data into audio / audio data according to the recording mode.
Image output device 122 and conference record data memory 124
Give to. The audio / image output device 122 and the conference record data memory 124 are provided in the audio / image output device 40, respectively.
The configuration is the same as that of the conference record data memory 42.
The control unit 104 includes a computer such as a personal computer and a workstation, and functions as a server.

【００６０】制御ユニット１０４は、データメモリ１２
６を有する。データメモリ１２６としてはたとえば内蔵
メモリやＰＣカードなどが用いられる。データメモリ１
２６には、制御ユニット１０４の動作を制御するための
プログラムが記憶される他に、会議参加者の音声データ
およびそれに対応するＩＤデータが登録される。また、
制御ユニット１０４は音声・ＩＤ送受信部１２８を有す
る。音声・ＩＤ送受信部１２８は、音声が発せられた地
区からの音声データをネットワーク１０６を介して受信
し、その音声データをＣＰＵ１３０の発言者特定制御部
１３２を介して音声認識部１３４に与える。音声認識部
１３４は、図１に示す音声認識部３６と同様に構成さ
れ、その認識結果に応じて発言者特定制御部１３２はデ
ータメモリ１２６から該当するＩＤデータをロードす
る。そのＩＤデータは音声データとともに音声・ＩＤ送
受信部１２８からネットワーク１０６を介して各地区に
送信される。The control unit 104 includes the data memory 12
6. As the data memory 126, for example, a built-in memory or a PC card is used. Data memory 1
In 26, a program for controlling the operation of the control unit 104 is stored, and in addition, voice data of a conference participant and ID data corresponding thereto are registered. Also,
The control unit 104 has a voice / ID transmitting / receiving unit 128. The voice / ID transmission / reception unit 128 receives voice data from the area where the voice was emitted via the network 106, and provides the voice data to the voice recognition unit 134 via the speaker identification control unit 132 of the CPU 130. The voice recognition unit 134 has the same configuration as the voice recognition unit 36 shown in FIG. 1, and the speaker identification control unit 132 loads the corresponding ID data from the data memory 126 according to the recognition result. The ID data is transmitted together with the voice data from the voice / ID transmission / reception unit 128 to each district via the network 106.

【００６１】このように構成されるＴＶ会議システム１
００の動作を図９および図１０を参照して説明する。ま
ず、図９を参照して、クライアント側すなわち各地区の
ＴＶ会議用ユニット１０２の動作を説明する。各ＴＶ会
議用ユニット１０２および制御ユニット１０４の電源
（図示せず）をオンしＴＶ会議システム１００を起動す
ると、会議参加者の音声データ、それに対応する画像デ
ータ、発言者不特定時の標準画像データ、ＩＤデータ、
記録モードがデータメモリ１１２に登録される（ステッ
プＳ２０１）。The TV conference system 1 thus configured
The operation of 00 will be described with reference to FIGS. First, the operation of the TV conference unit 102 on the client side, that is, in each district, will be described with reference to FIG. When the power (not shown) of each TV conference unit 102 and control unit 104 is turned on and the TV conference system 100 is started, voice data of the conference participants, corresponding image data, standard image data when the speaker is not specified , ID data,
The recording mode is registered in the data memory 112 (Step S201).

【００６２】これらのデータおよびモードは、図５のス
テップＳ１と同様に登録され得る。少なくともＩＤデー
タは、全地区のＴＶ会議用ユニット１０２および制御ユ
ニット１０４において共通であることが必要である。Ｉ
Ｄデータの登録方法としては、たとえば、まず、ある地
区のＴＶ会議用ユニット１０２と制御ユニット１０４と
の間で共通のＩＤデータを決定・登録しておき、そのＩ
Ｄデータを他地区のＴＶ会議用ユニット１０２にも登録
するものであってもよい。These data and mode can be registered in the same manner as in step S1 of FIG. At least the ID data needs to be common to the TV conference unit 102 and the control unit 104 in all areas. I
As a method of registering the D data, for example, first, common ID data is determined and registered between the TV conference unit 102 and the control unit 104 in a certain area, and the I data is registered.
The D data may be registered in the TV conference unit 102 in another area.

【００６３】そして、ステップＳ２０１における登録が
完了したか否かが判断され（ステップＳ２０３）、登録
が完了していなければ、完了するまで待つ。ステップＳ
２０１の動作が完了すれば、音声入力があるか否かが判
断される（ステップＳ２０５）。音声入力がなければ音
声入力があるまで待機し、音声入力があれば、サーバす
なわち制御ユニット１０４からの音声データか否かが判
断される（ステップＳ２０７）。サーバからの音声デー
タでなければ、すなわち自地区の音声データであれば、
その音声データをネットワーク１０６を介してサーバへ
送信し（ステップＳ２０９）、ステップＳ２０５に戻
る。それに応じて、サーバは後述する図１０の動作を行
う。Then, it is determined whether or not the registration in step S201 is completed (step S203). If the registration is not completed, the process waits until the registration is completed. Step S
When the operation of 201 is completed, it is determined whether or not there is a voice input (step S205). If there is no voice input, the process waits until there is a voice input. If there is a voice input, it is determined whether or not the data is voice data from the server, that is, the control unit 104 (step S207). If it is not audio data from the server, that is, if it is audio data of your area,
The voice data is transmitted to the server via the network 106 (step S209), and the process returns to step S205. In response, the server performs the operation of FIG. 10 described later.

【００６４】ステップＳ２０７において、サーバからの
音声データであると判断されれば、サーバから音声デー
タに対応するＩＤデータを受信できたか否かが判断され
る（ステップＳ２１１）。ＩＤデータを受信できれば、
そのＩＤデータに対応する画像データをデータメモリ１
１２からデータ制御部１１８にロードする（ステップＳ
２１３）。このとき、ＩＤデータが、発言者不特定時の
ＩＤデータであれば、標準画像データをロードする。If it is determined in step S207 that the data is audio data from the server, it is determined whether ID data corresponding to the audio data has been received from the server (step S211). If you can receive ID data,
The image data corresponding to the ID data is stored in the data memory 1
12 to the data control unit 118 (step S
213). At this time, if the ID data is the ID data when the speaker is not specified, the standard image data is loaded.

【００６５】そして、データ制御部１１８で記録モード
が判断される（ステップＳ２１５）。「データを出力す
るだけで、記録しない」記録モードであれば、データ制
御部１１８は、音声データおよび該当する画像データ
を、音声・画像出力装置１２２に与え、音声・画像出力
装置１２２から音声および画像が出力される（ステップ
Ｓ２１７）。「データを出力しかつ記録する」記録モー
ドであれば、データ制御部１１８では、音声データだけ
を記録するか否かが判断される（ステップＳ２１９）。
「音声データだけを記録する」記録モードであれば、音
声データおよび該当する画像データを音声・画像出力装
置１２２に与え、音声データを会議録用データメモリ１
２４に与える。そして、音声および画像が音声・画像出
力装置１２２から出力され、音声データが会議録用デー
タメモリ１２４に記録される（ステップＳ２２１）。
「音声データおよび画像データを記録する」記録モード
であれば、データ制御部１１８は、音声データおよび該
当する画像データを、音声・画像出力装置１２２および
会議録用データメモリ１２４に与え、音声・画像出力装
置１２２からは音声および画像が出力され、会議録用デ
ータメモリ１２４には音声データおよび画像データが記
録される（ステップＳ２２３）。Then, the recording mode is determined by the data control unit 118 (step S215). If the recording mode is “only output data and do not record”, the data control unit 118 supplies the audio data and the corresponding image data to the audio / image output device 122, and outputs the audio and image data from the audio / image output device 122. An image is output (step S217). If the recording mode is "output and record data", the data control unit 118 determines whether to record only audio data (step S219).
In the “record only audio data” recording mode, the audio data and the corresponding image data are supplied to the audio / image output device 122, and the audio data is stored in the conference record data memory 1.
Give 24. Then, audio and images are output from the audio / image output device 122, and audio data is recorded in the conference record data memory 124 (step S221).
If the recording mode is "record audio data and image data", the data control unit 118 supplies the audio data and the corresponding image data to the audio / image output device 122 and the conference record data memory 124, and outputs the audio / image data. Audio and images are output from the output device 122, and audio data and image data are recorded in the conference record data memory 124 (step S223).

【００６６】一方、ステップＳ２１１において、ＩＤデ
ータを受信できなければ、ステップＳ２１５へ進み、記
録モードに応じて、音声データだけが、ステップＳ２１
７ないしステップＳ２２３において処理される。On the other hand, if the ID data cannot be received in step S211, the process proceeds to step S215, and only the audio data is stored in step S21 according to the recording mode.
7 to step S223.

【００６７】ついで、図１０を参照して、サーバ側すな
わち制御ユニット１０４の動作を説明する。各ＴＶ会議
用ユニット１０２および制御ユニット１０４の電源（図
示せず）をオンしＴＶ会議システム１００を起動する
と、会議参加者の音声データおよびそれに対応するＩＤ
データがデータメモリ１２６に登録される（ステップＳ
２５１）。これらのデータは、図５のステップＳ１と同
様に登録され得る。Next, the operation of the server side, that is, the operation of the control unit 104 will be described with reference to FIG. When the power (not shown) of each TV conference unit 102 and control unit 104 is turned on and the TV conference system 100 is started, the voice data of the conference participants and the ID corresponding thereto are displayed.
The data is registered in the data memory 126 (step S
251). These data can be registered in the same manner as in step S1 of FIG.

【００６８】そして、ステップＳ２５１における登録が
完了したか否かが判断され（ステップＳ２５３）、登録
が完了していなければ、完了するまで待つ。ステップＳ
２５１の動作が完了すれば、音声入力があるか否かが判
断される（ステップＳ２５５）。音声入力がなければ
音声入力があるまで待機し、音声入力があれば、音声認
識部１３４で音声認識を実施する（ステップＳ２５
７）。音声認識の結果を用いて、発言者をデータメモリ
１２６に登録済みの発言者に特定できるか否かが発言者
特定制御部１３２で判断される（ステップＳ２５９）。
登録済みの発言者に特定できれば、該当する発言者のＩ
Ｄデータをデータメモリ１２６から発言者特定制御部１
３２へロードし（ステップＳ２６１）、ネットワーク１
０６を介して各クライアントに音声データおよびＩＤデ
ータを送信し（ステップＳ２６３）、ステップＳ２５５
に戻る。Then, it is determined whether or not the registration in step S251 has been completed (step S253). If the registration has not been completed, the process waits until the registration is completed. Step S
When the operation of S 251 is completed, it is determined whether or not there is a voice input (step S 255). If there is no voice input, the process stands by until there is a voice input, and if there is a voice input, the voice recognition unit 134 performs voice recognition (step S25).
7). The speaker identification control unit 132 determines whether the speaker can be identified as a speaker registered in the data memory 126 using the result of the voice recognition (step S259).
If the registered speaker can be identified, I
D data from the data memory 126 to the speaker identification control unit 1
32 (step S261), and the network 1
Then, the voice data and the ID data are transmitted to each client via 06 (Step S263), and Step S255 is performed.
Return to

【００６９】一方、ステップＳ２５９において、発言者
を登録済みの発言者に特定できなければ、発言者不特定
時のＩＤデータがデータメモリ１２６に登録されている
か否かが判断される（ステップＳ２６５）。発言者不特
定時のＩＤデータが登録されていれば、発言者不特定時
のＩＤデータがデータメモリ１２６から発言者特定制御
部１３２へロードされ（ステップＳ２６７）、ステップ
Ｓ２６３へ進む。ステップＳ２６５において、発言者不
特定時のＩＤデータが登録されていなければ、ネットワ
ーク１０６を介して各クライアントに音声データだけが
送信され（ステップＳ２６９）、ステップＳ２５５へ戻
る。このようなサーバからの送信状況に応じて、各クラ
イアントは上述した図９の動作を行う。On the other hand, if the speaker cannot be identified as a registered speaker in step S259, it is determined whether or not the ID data when the speaker is not specified is registered in data memory 126 (step S265). . If the speaker's unspecified ID data is registered, the speaker's unspecified ID data is loaded from the data memory 126 into the speaker specifying control unit 132 (step S267), and the process proceeds to step S263. In step S265, if the ID data when the speaker is not specified is not registered, only the voice data is transmitted to each client via the network 106 (step S269), and the process returns to step S255. Each client performs the above-described operation of FIG. 9 according to the transmission status from the server.

【００７０】ＴＶ会議システム１００によれば、音声認
識処理を、サーバすなわち制御ユニット１０４だけで行
い、クライアントすなわちＴＶ会議用ユニット１０２で
は行わないので、簡易に構成されたＴＶ会議用ユニット
１０２を用いてＴＶ会議システム１００を構成できる。
また、ＴＶ会議システム１００においても、上述のＴＶ
会議システム１０と同様に、データ量の少ない画像デー
タを用いることによる効果、ＩＤデータを用いることに
よる効果、記録モードを設定することによる効果が得ら
れる。According to the TV conference system 100, the voice recognition processing is performed only by the server, that is, the control unit 104, but not by the client, that is, the TV conference unit 102. The TV conference system 100 can be configured.
Also, in the TV conference system 100, the TV
As with the conference system 10, an effect obtained by using image data having a small data amount, an effect obtained by using ID data, and an effect obtained by setting a recording mode are obtained.

【００７１】なお、ＴＶ会議システム１００にさらに、
ＴＶ会議システム１０と同様のレンズ１８、画像作成部
２４および画像モード設定部２６を付加すれば、発言者
不特定時に、画像作成手段によって作成された画像デー
タを用いることも可能となり、上述のＴＶ会議システム
１０と同様に、画像モードを設定することによる効果が
得られる。上述のＴＶ会議システム１００では、制御ユ
ニット１０４から各ＴＶ会議用ユニット１０２へ、音声
認識結果に基づいてＩＤデータを送信するようにした
が、音声認識結果に基づいて画像データを送信するよう
にしてもよい。この場合、前提として、制御ユニット１
０４のデータメモリ１２６には画像データが登録され
る。The TV conference system 100 further includes
If a lens 18, an image creating unit 24, and an image mode setting unit 26 similar to those in the TV conference system 10 are added, the image data created by the image creating means can be used when the speaker is unspecified. As with the conference system 10, the effect of setting the image mode is obtained. In the TV conference system 100 described above, the ID data is transmitted from the control unit 104 to each TV conference unit 102 based on the voice recognition result. However, the image data is transmitted based on the voice recognition result. Is also good. In this case, it is assumed that the control unit 1
The image data is registered in the data memory 126 of FIG.

【００７２】さらに、図１に示すＴＶ会議システム１０
を以下のように用いて、α地区、β地区、γ地区の３地
区間でＴＶ会議を実施するようにしてもよい。図１に示
すＴＶ会議システム１０において、α地区、β地区、γ
地区のＴＶ会議用ユニット１２のうち、音声認識処理を
行うＴＶ会議用ユニット１２を設定しておき、その他の
ＴＶ会議用ユニット１２では音声認識処理は行わないよ
うにする。すなわち、あるＴＶ会議用ユニット１２にク
ライアントとサーバとの機能を兼用させ、その他のＴＶ
会議用ユニット１２をクライアントとして機能させる。
以下、クライアントとサーバとを兼用するＴＶ会議用ユ
ニット１２を、サーバ兼用ＴＶ会議用ユニット１２ａと
いう。Further, the TV conference system 10 shown in FIG.
May be used as follows to carry out a TV conference among three districts, α district, β district, and γ district. In the TV conference system 10 shown in FIG.
Among the TV conference units 12 in the district, a TV conference unit 12 that performs voice recognition processing is set, and the other TV conference units 12 do not perform voice recognition processing. That is, a certain TV conference unit 12 is made to have both functions of a client and a server, and other TV conference units are used.
The conference unit 12 functions as a client.
Hereinafter, the TV conference unit 12 serving both as a client and a server is referred to as a server TV meeting unit 12a.

【００７３】したがって、ある地区で音声が発せられれ
ばその音声データが、その地区のＴＶ会議用ユニット１
２から、サーバ兼用ＴＶ会議用ユニット１２ａへ送信さ
れる。そして、サーバ兼用ＴＶ会議用ユニット１２ａで
音声認識処理を行い、その結果に基づいてサーバ兼用Ｔ
Ｖ会議用ユニット１２ａはデータをすべての他地区のＴ
Ｖ会議用ユニット１２へ送信する。Therefore, if a voice is emitted in a certain area, the voice data is transmitted to the TV conference unit 1 in that area.
2 is transmitted to the server / TV conference unit 12a. Then, the server / TV conference unit 12a performs voice recognition processing, and based on the result, executes the server / TV conference.
The V conference unit 12a transfers the data to the T
It is transmitted to the V conference unit 12.

【００７４】このように動作させるＴＶ会議は、図１に
示すＴＶ会議システム１０を用いて行うことができるの
で、その構成の説明は省略する。図１１および図１２を
参照して、この場合の動作を説明する。まず、図１１を
参照して、クライアントとして機能するＴＶ会議用ユニ
ット１２の動作を説明する。各ＴＶ会議用ユニット１２
およびサーバ兼用ＴＶ会議用ユニット１２ａの電源（図
示せず）をオンしＴＶ会議システム１０を起動すると、
会議参加者の音声データ、それに対応する画像データ、
発言者不特定時の標準画像データ、ＩＤデータ、発言者
不特定時の画像作成手段による撮影条件（たとえば、撮
影時間など）、画像モード、記録モードがデータメモリ
２２に登録される。さらに、発言者不特定時の画像作成
手段による撮影条件（たとえば、撮影する位置、構図な
ど）、サーバ兼用ＴＶ会議用ユニット１２ａが設定され
る（ステップＳ３０１）。Since the TV conference operated as described above can be performed using the TV conference system 10 shown in FIG. 1, the description of the configuration is omitted. The operation in this case will be described with reference to FIGS. First, the operation of the TV conference unit 12 functioning as a client will be described with reference to FIG. Each TV conference unit 12
When the power supply (not shown) of the server / TV conference unit 12a is turned on and the TV conference system 10 is started,
Audio data of the meeting participants, corresponding image data,
Standard image data and ID data when the speaker is unspecified, shooting conditions (for example, shooting time) by the image creating means when the speaker is unspecified, an image mode, and a recording mode are registered in the data memory 22. Further, the photographing conditions (for example, photographing position, composition, etc.) by the image creating means when the speaker is unspecified, and the server / TV conference unit 12a are set (step S301).

【００７５】これらのデータおよびモード等は、図５の
ステップＳ１と同様に登録・設定され得る。そして、ス
テップＳ３０１における登録・設定が完了したか否かが
判断され（ステップＳ３０３）、登録・設定が完了して
いなければ、完了するまで待つ。ステップＳ３０１の動
作が完了すれば、音声入力があるか否かが判断される
（ステップＳ３０５）。音声入力がなければ音声入力が
あるまで待機し、音声入力があれば、サーバ兼用ＴＶ会
議用ユニット１２ａからの音声データか否かが判断され
る（ステップＳ３０７）。サーバ兼用ＴＶ会議用ユニッ
ト１２ａからの音声データでなければ、すなわち自地区
の音声データであれば、その音声データをネットワーク
１０６を介してサーバ兼用ＴＶ会議用ユニット１２ａへ
送信し（ステップＳ３０９）、ステップＳ３０５に戻
る。それに応じて、サーバ兼用ＴＶ会議用ユニット１２
ａは後述する図１２の動作を行う。These data and mode can be registered and set in the same manner as in step S1 of FIG. Then, it is determined whether the registration / setting in step S301 is completed (step S303). If the registration / setting is not completed, the process waits until the registration / setting is completed. When the operation in step S301 is completed, it is determined whether or not there is a voice input (step S305). If there is no voice input, the process waits until there is a voice input. If there is a voice input, it is determined whether or not the voice data is from the server / TV conference unit 12a (step S307). If it is not the audio data from the server / video conference unit 12a, that is, if it is the audio data of the own area, the audio data is transmitted to the server / video conference unit 12a via the network 106 (step S309). It returns to S305. Accordingly, the server / TV conference unit 12
a performs the operation of FIG. 12 described later.

【００７６】ステップＳ３０７において、サーバ兼用Ｔ
Ｖ会議用ユニット１２ａからの音声データであれば、サ
ーバ兼用ＴＶ会議用ユニット１２ａから音声データに対
応するＩＤデータを受信できたか否かが判断される（ス
テップＳ３１１）。ＩＤデータを受信できれば、該当す
る画像データをデータメモリ２２からデータ制御部３４
へロードする（ステップＳ３１３）。このとき、発言者
をデータメモリ２２に登録済みの発言者に特定できれ
ば、該当する発言者の画像データがデータメモリ２２か
らロードされる。In step S307, the server T
If the audio data is from the V conference unit 12a, it is determined whether ID data corresponding to the audio data has been received from the server / TV conference unit 12a (step S311). If the ID data can be received, the corresponding image data is transferred from the data memory 22 to the data control unit 34.
(Step S313). At this time, if the speaker can be specified as a speaker registered in the data memory 22, the image data of the corresponding speaker is loaded from the data memory 22.

【００７７】一方、発言者を登録済みの発言者に特定で
きなければ、発言者不特定時の登録済みの標準画像デー
タを採用するか否か、すなわち画像モードが判断され
る。標準画像データを採用する画像モードであれば、発
言者不特定時の標準画像データがデータメモリ２２から
データ制御部３４へロードされる。一方、標準画像デー
タを採用しない、すなわち画像作成手段によって作成さ
れた画像データを採用する画像モードであれば、発言者
不特定時の画像データとして、画像作成部２４からの画
像データがデータ制御部３４へ入力される。On the other hand, if the speaker cannot be specified as a registered speaker, it is determined whether or not to use the registered standard image data when the speaker is not specified, that is, the image mode is determined. If the image mode employs the standard image data, the standard image data when the speaker is not specified is loaded from the data memory 22 to the data control unit 34. On the other hand, if the image mode does not employ the standard image data, that is, if the image mode employs the image data created by the image creating means, the image data from the image creating section 24 is used as the image data when the speaker is not specified. 34.

【００７８】そして、データ制御部３４で記録モードが
判断される（ステップＳ３１５）。「データを出力する
だけで、記録しない」記録モードであれば、データ制御
部３４は、音声データおよび該当する画像データを、音
声・画像出力装置４０に与え、音声・画像出力装置４０
から音声および画像が出力される（ステップＳ３１
７）。「データを出力しかつ記録する」記録モードであ
れば、データ制御部３４では、音声データだけを記録す
るか否かが判断される（ステップＳ３１９）。「音声デ
ータだけを記録する」記録モードであれば、音声データ
および該当する画像データを音声・画像出力装置４０に
与え、音声データを会議録用データメモリ４２に与え
る。そして、音声および画像が音声・画像出力装置４０
から出力され、音声データが会議録用データメモリ４２
に記録される（ステップＳ３２１）。「音声データおよ
び画像データを記録する」記録モードであれば、データ
制御部３４は、音声データおよび該当する画像データ
を、音声・画像出力装置４０および会議録用データメモ
リ４２に与え、音声・画像出力装置４０からは音声およ
び画像が出力され、会議録用データメモリ４２には音声
データおよび画像データが記録される（ステップＳ３２
３）。Then, the data control section 34 determines the recording mode (step S315). If the recording mode is “output data only and do not record”, the data control unit 34 provides the audio / image output device 40 with the audio data and the corresponding image data.
Output audio and images (step S31).
7). If the recording mode is "output and record data", the data control unit 34 determines whether to record only audio data (step S319). If the recording mode is "record only audio data", the audio data and the corresponding image data are supplied to the audio / image output device 40, and the audio data is supplied to the conference record data memory 42. Then, the sound and the image are output to the sound / image output device 40.
The audio data output from the
(Step S321). If the recording mode is "record audio data and image data", the data control unit 34 supplies the audio data and the corresponding image data to the audio / image output device 40 and the conference recording data memory 42, Audio and images are output from the output device 40, and audio data and image data are recorded in the conference record data memory 42 (step S32).
3).

【００７９】一方、ステップＳ３１１において、ＩＤデ
ータを受信できなければ、ステップＳ３１５へ進み、記
録モードに応じて、音声データだけが、ステップＳ３１
７ないしステップＳ３２３において処理される。On the other hand, if the ID data cannot be received in step S311, the process proceeds to step S315, and only the audio data is stored in step S31 according to the recording mode.
7 to step S323.

【００８０】ついで、図１２を参照して、サーバ兼用Ｔ
Ｖ会議用ユニット１２ａの動作を説明する。各ＴＶ会議
用ユニット１２およびサーバ兼用ＴＶ会議用ユニット１
２ａの電源（図示せず）をオンしＴＶ会議システム１０
を起動すると、会議参加者の音声データ、それに対応す
る画像データ、発言者不特定時の標準画像データ、ＩＤ
データ、発言者不特定時の画像作成手段による撮影条件
（たとえば、撮影時間など）、画像モード、および記録
モードがデータメモリ２２に登録される。さらに、発言
者不特定時の画像作成手段による撮影条件（たとえば、
撮影する位置、構図など）、およびサーバ兼用ＴＶ会議
用ユニット１２ａが設定される（ステップＳ３５１）。
これらのデータおよびモードは、図６のステップＳ１と
同様に登録・設定され得る。Next, referring to FIG.
The operation of the V conference unit 12a will be described. Each TV conference unit 12 and server / TV conference unit 1
2a is turned on (not shown) and the TV conference system 10 is turned on.
Is activated, the voice data of the conference participants, the corresponding image data, the standard image data when the speaker is unspecified, the ID
Data, photographing conditions (for example, photographing time) by the image creating means when the speaker is unspecified, an image mode, and a recording mode are registered in the data memory 22. Furthermore, shooting conditions by the image creating means when the speaker is unspecified (for example,
The position for shooting, composition, etc.) and the server / TV conference unit 12a are set (step S351).
These data and mode can be registered and set in the same manner as in step S1 of FIG.

【００８１】そして、ステップＳ３５１における登録・
設定が完了したか否かが判断され（ステップＳ３５
３）、登録・設定が完了していなければ、完了するまで
待つ。ステップＳ３５１の動作が完了すれば、音声入力
があるか否かが判断される（ステップＳ３５５）。音声
入力がなければ音声入力があるまで待機し、音声入力が
あれば、音声認識部３６で音声認識を実施する（ステッ
プＳ３５７）。音声認識の結果を用いて、発言者をデー
タメモリ２２に登録済みの発言者に特定できるか否かが
データ制御部３４で判断される（ステップＳ３５９）。
登録済みの発言者に特定できれば、該当する発言者のＩ
Ｄデータがデータメモリ２２からデータ制御部３４へロ
ードされ（ステップＳ３６１）、ネットワーク１４を介
して他地区のＴＶ会議用ユニット１２に音声データおよ
びＩＤデータが送信される（ステップＳ３６３）。Then, the registration /
It is determined whether the setting has been completed (step S35).
3) If the registration / setting is not completed, wait until it is completed. When the operation in step S351 is completed, it is determined whether or not there is a voice input (step S355). If there is no voice input, the process waits until there is a voice input, and if there is a voice input, the voice recognition unit 36 performs voice recognition (step S357). The data control unit 34 determines whether or not the speaker can be specified as a speaker registered in the data memory 22 using the result of the voice recognition (step S359).
If the registered speaker can be identified, I
The D data is loaded from the data memory 22 to the data control unit 34 (step S361), and the voice data and the ID data are transmitted to the TV conference unit 12 in another area via the network 14 (step S363).

【００８２】そして、ＩＤデータに対応する画像データ
がデータメモリ２２からロードされる（ステップＳ３６
５）。このとき、ＩＤデータが発言者特定時のＩＤデー
タであれば、そのＩＤデータに対応する発言者の画像デ
ータがデータメモリ２２からロードされる。Then, the image data corresponding to the ID data is loaded from the data memory 22 (step S36).
5). At this time, if the ID data is the ID data at the time of specifying the speaker, the image data of the speaker corresponding to the ID data is loaded from the data memory 22.

【００８３】そして、データ制御部３４で記録モードが
判断される（ステップＳ３６７）。「データを出力する
だけで、記録しない」記録モードであれば、データ制御
部３４は、音声データおよび該当する画像データを、音
声・画像出力装置４０に与え、音声・画像出力装置４０
から音声および画像が出力される（ステップＳ３６
９）。「データを出力しかつ記録する」記録モードであ
れば、データ制御部３４では、音声データだけを記録す
るか否かが判断される（ステップＳ３７１）。「音声デ
ータだけを記録する」記録モードであれば、音声データ
および該当する画像データを音声・画像出力装置４０に
与え、音声データを会議録用データメモリ４２に与え
る。そして、音声および画像が音声・画像出力装置４０
から出力され、音声データを会議録用データメモリ４２
に記録する（ステップＳ３７３）。「音声データおよび
画像データを記録する」記録モードであれば、データ制
御部３４は、音声データおよび該当する画像データを、
音声・画像出力装置４０および会議録用データメモリ４
２に与え、音声・画像出力装置４０からは音声および画
像が出力され、会議録用データメモリ４２には音声デー
タおよび画像データが記録される（ステップＳ３７
５）。Then, the data control section 34 determines the recording mode (step S367). If the recording mode is “output data only and do not record”, the data control unit 34 provides the audio / image output device 40 with the audio data and the corresponding image data.
Output a voice and an image (step S36).
9). If the recording mode is “output and record data”, the data control unit 34 determines whether to record only audio data (step S371). If the recording mode is "record only audio data", the audio data and the corresponding image data are supplied to the audio / image output device 40, and the audio data is supplied to the conference record data memory 42. Then, the sound and the image are output to the sound / image output device 40.
The audio data output from the
(Step S373). If the recording mode is “record audio data and image data”, the data control unit 34 converts the audio data and the corresponding image data into
Audio / image output device 40 and conference record data memory 4
2, the sound / image output device 40 outputs sound and image, and the meeting record data memory 42 records sound data and image data (step S37).
5).

【００８４】一方、ステップＳ３５９において、発言者
を登録済みの発言者に特定できなければ、発言者不特定
時のＩＤデータがデータメモリ２２に登録されているか
否かが判断される（ステップＳ３７７）。発言者不特定
時のＩＤデータが登録されていれば、発言者不特定時の
ＩＤデータがデータメモリ２２からデータ制御部３４へ
ロードされ（ステップＳ３７９）、ステップＳ３６３へ
進み、上述のように、ステップＳ３６３ないしステップ
Ｓ３７５において処理される。On the other hand, if it is determined in step S359 that the speaker cannot be specified as a registered speaker, it is determined whether or not the ID data when the speaker is not specified is registered in the data memory 22 (step S377). . If the speaker's unspecified ID data is registered, the speaker's unspecified ID data is loaded from the data memory 22 to the data control unit 34 (step S379), and the process proceeds to step S363, as described above. The processing is performed in steps S363 to S375.

【００８５】ここで、ステップＳ３６５においては、該
当する画像データがデータメモリ２２からロードされ
る。このとき、ＩＤデータが発言者不特定時のＩＤデー
タであれば、登録済みの標準画像データを採用するか否
か、すなわち画像モードが判断される。標準画像データ
を採用する画像モードであれば、発言者不特定時の標準
画像データがデータメモリ２２からデータ制御部３４へ
ロードされる。一方、標準画像データを採用しない、す
なわち画像作成手段によって作成された画像データを採
用する画像モードであれば、発言者不特定時の画像デー
タとして、画像作成部２４からの画像データがデータ制
御部３４へ入力される。Here, in step S365, the corresponding image data is loaded from the data memory 22. At this time, if the ID data is the ID data when the speaker is unspecified, it is determined whether to use the registered standard image data, that is, the image mode is determined. If the image mode employs the standard image data, the standard image data when the speaker is not specified is loaded from the data memory 22 to the data control unit 34. On the other hand, if the image mode does not employ the standard image data, that is, if the image mode employs the image data created by the image creating means, the image data from the image creating section 24 is used as the image data when the speaker is not specified. 34.

【００８６】ステップＳ３７７において、発言者不特定
時のＩＤデータが登録されていなければ、ネットワーク
１０６を介して他地区のＴＶ会議用ユニット１２に音声
データだけが送信され（ステップＳ３８１）、ステップ
Ｓ３６７へ進み、上述のように、ステップＳ３６７ない
しステップＳ３７５において処理される。In step S377, if the ID data when the speaker is not specified is not registered, only the voice data is transmitted to the TV conference unit 12 in another area via the network 106 (step S381), and the flow advances to step S367. Proceeding and processing is performed in steps S367 through S375 as described above.

【００８７】このように動作するＴＶ会議システム１０
によれば、特定のＴＶ会議用ユニット１２をサーバ兼用
ＴＶ会議用ユニット１２ａとすることによって、サーバ
となる制御ユニットを別途設ける必要がない。また、こ
のように動作するＴＶ会議システム１０においても、図
５の動作を行う場合と同様の効果が得られる。図１１お
よび図１２の動作を行うＴＶ会議システム１０では、サ
ーバ兼用ＴＶ会議用ユニット１２ａから各ＴＶ会議用ユ
ニット１２へ、音声認識結果に基づいたＩＤデータを送
信したが、音声認識結果に基づいた画像データを送信す
るようにしてもよい。The TV conference system 10 operating as described above
According to the above, the specific TV conference unit 12 is used as the server TV conference unit 12a, so that there is no need to separately provide a control unit serving as a server. Also, in the TV conference system 10 operating as described above, the same effect as in the case of performing the operation in FIG. 5 can be obtained. In the TV conference system 10 performing the operations in FIGS. 11 and 12, ID data based on the voice recognition result is transmitted from the server / TV conference unit 12a to each TV conference unit 12, but based on the voice recognition result. Image data may be transmitted.

【００８８】なお、音声・画像出力装置からの画像の出
力態様は、図１３に示すように、画面上に各地区毎の領
域を設け、発言者のいる地区の画像だけを出力するよう
にしてもよい。図１３では、α地区でＡさんの発言があ
り、他地区では発言がない状況が示され、β地区および
γ地区の画面上のそれぞれの領域には「音声無しの場合
の標準画像」が出力されている。また、上述の各発明の
実施の形態では、音声・画像出力装置および会議録用デ
ータメモリを備えたＴＶ会議システムについて説明した
が、必ずしも音声・画像出力装置および会議録用データ
メモリを両方とも備えている必要はなく、少なくとも音
声・画像出力装置を備えていればよい。As shown in FIG. 13, the output mode of the image from the audio / image output device is such that an area for each district is provided on the screen and only the image of the district where the speaker is present is output. Is also good. FIG. 13 shows a situation where Mr. A made a statement in the α district and no comment in other districts, and “a standard image without sound” is output to each area on the screen of the β district and γ district. Have been. In the embodiments of the invention described above, the TV conference system including the audio / image output device and the conference record data memory has been described. However, both the audio / image output device and the conference record data memory are not necessarily provided. It is not necessary to provide at least an audio / image output device.

【００８９】さらに、上述の各発明の実施の形態におい
て、音声データとそれに対応する画像データとは、発言
時間を測定して得られる時間データによって関連付けら
れるようにしてもよい。このようにすれば、ある発言者
の発言中はその発言者に関する画像が出力され、発言者
が替わるとそれに伴って出力される画像も切り替えられ
る。このように、時間データに基づいて、音声と画像と
が正確に対応づけられ、発言者の変更に伴って、画像を
切り替えることができる。Further, in each of the embodiments of the present invention described above, the audio data and the corresponding image data may be associated with each other by time data obtained by measuring the speech time. In this way, an image related to a certain speaker is output while the speaker is speaking, and the image output is switched when the speaker changes. In this manner, the sound and the image are accurately associated with each other based on the time data, and the image can be switched according to the change of the speaker.

【００９０】また、音声データと画像データとを関連付
けるために、たとえば、対応する音声データと画像デー
タとにそれぞれ同一番号を付与しておき、音声データお
よび画像データを出力・記録するときにこの番号を参照
して、音声データおよび画像データを正確に対応付けて
出力・記録するようにしてもよい。また、音声データと
画像データとを特別な情報で関連付けることなく、それ
ぞれ入力順に並べておき、音声データが切り替わる（発
言者が切り替わる）毎に画像データも切り替えるように
してもよい。このように、音声データと画像データと
は、種々の手段で関連付けることができる。For associating the audio data with the image data, for example, the same numbers are assigned to the corresponding audio data and the image data, respectively, and these numbers are used when outputting and recording the audio data and the image data. , The audio data and the image data may be output and recorded in an accurate correspondence. Also, the audio data and the image data may be arranged in the order of input without associating them with special information, and the image data may be switched every time the audio data is switched (the speaker is switched). As described above, audio data and image data can be associated with each other by various means.

【００９１】また、画像作成手段は、撮影することによ
って画像データを作成するものに限定されず、任意の方
法で画像データを作成し得る。さらに、会議中に、音声
が発せられた各地区で各発言者を撮影して発言者の画像
データを作成し、その画像データとその発言者の音声デ
ータとを関連付けて各地区に送信し、音声・画像出力装
置から音声および画像を出力し、会議録用データメモリ
に音声データおよび画像データを記録するようにしても
よい。このときの画像データは、データ量を抑制する点
から、静止画データなどが好ましい。また、標準画像デ
ータはデータメモリ２２（１１２）とは別個のメモリに
記録されてもよい。Further, the image creating means is not limited to one that creates image data by shooting, but can create image data by any method. Further, during the meeting, each speaker is photographed in each district where the voice was emitted, image data of the speaker is created, and the image data is associated with the voice data of the speaker and transmitted to each district, The audio and image output device may output audio and images, and record the audio data and the image data in the conference record data memory. The image data at this time is preferably still image data or the like from the viewpoint of suppressing the data amount. Further, the standard image data may be recorded in a memory separate from the data memory 22 (112).

【００９２】[0092]

【発明の効果】請求項１に記載の発明によれば、画像デ
ータとして、予め記憶手段に記憶されたデータ量の少な
い画像データを用いるので、音声に対する画像の遅延の
程度が少なくて済む。また、ネットワーク上の通信量も
少なくなるので、専用回線ではない電話回線などを用い
てネットワークを構築した場合でも、他の通信システム
への影響を小さくできる。請求項２に記載の発明によれ
ば、請求項１に記載の発明と同様の効果が得られる。そ
れに加えて、音声データに対応する画像データではな
く、ＩＤデータを他地区に伝送するので、ネットワーク
上の通信量をより少なくでき、他の通信システムへの影
響もより小さくできる。According to the first aspect of the present invention, since the image data having a small data amount stored in the storage means in advance is used as the image data, the degree of the delay of the image with respect to the sound can be reduced. In addition, since the amount of communication on the network is reduced, the influence on other communication systems can be reduced even when the network is constructed using a telephone line or the like that is not a dedicated line. According to the second aspect of the invention, the same effect as the first aspect of the invention can be obtained. In addition, since the ID data, not the image data corresponding to the audio data, is transmitted to another area, the traffic on the network can be reduced, and the influence on other communication systems can be reduced.

【００９３】請求項３に記載の発明によれば、請求項１
に記載の発明と同様の効果が得られる。それに加えて、
画像データやＩＤデータは伝送せず、音声データだけを
伝送するので、ネットワーク上の通信量をさらに少なく
でき、他の通信システムへの影響もさらに小さくでき
る。請求項４に記載の発明によれば、請求項１に記載の
発明と同様の効果が得られる。それに加えて、各ＴＶ会
議用ユニットは音声認識処理を行わないで済むので、Ｔ
Ｖ会議用ユニットの簡略化が図れる。According to the invention set forth in claim 3, according to claim 1
The same effect as the invention described in (1) is obtained. In addition to it,
Since image data and ID data are not transmitted and only audio data is transmitted, the traffic on the network can be further reduced, and the influence on other communication systems can be further reduced. According to the fourth aspect, the same effect as that of the first aspect is obtained. In addition, since each TV conference unit does not need to perform voice recognition processing,
The V conference unit can be simplified.

【００９４】請求項５に記載の発明によれば、発言者不
特定時であっても、ユーザーの画像モード設定によって
「標準画像データ」または「画像作成手段からの適切な
画像データ」のいずれかを選択して画像データとして用
いることができる。したがって、発言者不特定時であっ
ても、ユーザーの要望に応じた画像を出力できる。請求
項６に記載の発明によれば、ユーザーの記録モード設定
によって、音声および画像を出力するだけなのか、それ
ともデータの記録も行うのかを設定できる。さらに、請
求項７に記載の発明によれば、音声データだけを記録す
るのか、音声データおよび画像データを両方とも記録す
るのかを設定できる。このように、ユーザーの要望に応
じたデータ処理が可能となる。According to the fifth aspect of the present invention, even when the speaker is unspecified, any one of "standard image data" or "appropriate image data from the image creating means" is set depending on the image mode setting of the user. Can be selected and used as image data. Therefore, even when the speaker is unspecified, an image according to the user's request can be output. According to the sixth aspect of the present invention, it is possible to set whether to output only sound and images or to also record data by setting the recording mode by the user. Furthermore, according to the invention described in claim 7, it is possible to set whether to record only audio data or to record both audio data and image data. Thus, data processing according to the user's request can be performed.

【００９５】請求項８に記載の発明によれば、画像デー
タとして、予め記憶されたデータ量の少ない画像データ
を用いるので、音声に対する画像の遅延の程度が少なく
て済む。また、ネットワーク上の通信量も少なくなるの
で、専用回線ではない電話回線などを用いてネットワー
クを構築した場合でも、他の通信システムへの影響を小
さくできる。請求項９に記載の発明によれば、請求項８
に記載の発明と同様の効果が得られる。それに加えて、
音声データに対応する画像データではなく、ＩＤデータ
を他地区に伝送するので、ネットワーク上の通信量をよ
り少なくでき、他の通信システムへの影響もより小さく
できる。According to the eighth aspect of the present invention, since image data having a small amount of data stored in advance is used as image data, the degree of image delay with respect to audio can be reduced. In addition, since the amount of communication on the network is reduced, the influence on other communication systems can be reduced even when the network is constructed using a telephone line or the like that is not a dedicated line. According to the ninth aspect, the eighth aspect is provided.
The same effect as the invention described in (1) is obtained. In addition to it,
Since ID data, not image data corresponding to audio data, is transmitted to another area, the traffic on the network can be reduced, and the influence on other communication systems can be reduced.

[Brief description of the drawings]

【図１】この発明の一実施形態のＴＶ会議システムを示
すブロック図である。FIG. 1 is a block diagram showing a TV conference system according to an embodiment of the present invention.

【図２】データメモリに登録される画像データの一例を
示す図解図である。FIG. 2 is an illustrative view showing one example of image data registered in a data memory;

【図３】音声認識部での音声認識過程を説明するための
ブロック図である。FIG. 3 is a block diagram for explaining a speech recognition process in a speech recognition unit.

【図４】音声・画像出力装置から出力される画像の一例
を示す図解図である。FIG. 4 is an illustrative view showing one example of an image output from an audio / image output device;

【図５】図１のＴＶ会議システムの動作の一例を示すフ
ロー図である。FIG. 5 is a flowchart showing an example of the operation of the TV conference system in FIG. 1;

【図６】図１のＴＶ会議システムの動作の一例を示すフ
ロー図である。FIG. 6 is a flowchart showing an example of the operation of the TV conference system in FIG. 1;

【図７】図１のＴＶ会議システムの動作の一例を示すフ
ロー図である。FIG. 7 is a flowchart showing an example of the operation of the TV conference system in FIG. 1;

【図８】この発明のその他の実施形態のＴＶ会議システ
ムを示すブロック図である。FIG. 8 is a block diagram showing a TV conference system according to another embodiment of the present invention.

【図９】図８のＴＶ会議システムのＴＶ会議用ユニット
の動作の一例を示すフロー図である。FIG. 9 is a flowchart showing an example of the operation of the TV conference unit of the TV conference system of FIG. 8;

【図１０】図８のＴＶ会議システムの制御ユニットの動
作の一例を示すフロー図である。FIG. 10 is a flowchart showing an example of the operation of the control unit of the TV conference system in FIG. 8;

【図１１】図１のＴＶ会議システムのＴＶ会議用ユニッ
トの動作の一例を示すフロー図である。11 is a flowchart showing an example of the operation of the TV conference unit of the TV conference system in FIG. 1;

【図１２】図１のＴＶ会議システムのサーバ兼用ＴＶ会
議用ユニットの動作の一例を示すフロー図である。FIG. 12 is a flowchart showing an example of the operation of the server and TV conference unit of the TV conference system of FIG. 1;

【図１３】音声・画像出力装置から出力される画像のそ
の他の例を示す図解図である。FIG. 13 is an illustrative view showing another example of the image output from the audio / image output device;

[Explanation of symbols]

１０、１００ＴＶ会議システム１２、１０２ＴＶ会議用ユニット１２ａサーバ兼用ＴＶ会議用ユニット１４、１０６ネットワーク１８レンズ２０、１１０マイク２２、１１２、１２６データメモリ２４画像作成部２６画像モード設定部２８、１１４記録モード設定部３０、１１６音声入力部３４、１１８データ制御部３６、１３４音声認識部３８、１２０データ送受信部４０、１２２音声・画像出力装置４２、１２４会議録用データメモリ１０４制御ユニット１２８音声・ＩＤ送受信部１３２発言者特定制御部 10, 100 TV conference system 12, 102 TV conference unit 12a Server / TV conference unit 14, 106 Network 18 Lens 20, 110 Microphone 22, 112, 126 Data memory 24 Image creation unit 26 Image mode setting unit 28, 114 Recording Mode setting unit 30, 116 Voice input unit 34, 118 Data control unit 36, 134 Voice recognition unit 38, 120 Data transmission / reception unit 40, 122 Voice / image output device 42, 124 Data memory for conference recording 104 Control unit 128 Voice / ID Transceiver 132 Speaker control unit

Claims

[Claims]

1. A TV conference system for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network, wherein each of the TV conference units includes a first video conference unit for each conference participant. Storage means for storing voice data and image data; when a voice is emitted in the own area, the voice based on second voice data corresponding to the voice and the first voice data stored in the storage means; Speaker identifying means for identifying the speaker who has issued the data, data transmitting means for transmitting the image data and the second voice data of the identified speaker to all the TV conference units in other districts, Data receiving means for receiving image data and second audio data from the TV conference unit in the other district when a voice is emitted, and the image data and the second audio data of the speaker; Comprising audio and image output means for outputting an image and a sound corresponding respectively to the data, TV conference system.

2. A TV conference system for performing a TV conference between a plurality of districts by using a plurality of TV conference units connected to a network, wherein each of the TV conference units includes a first TV conference unit for each conference participant. Storage means for storing voice data, ID data and image data, based on second voice data corresponding to the voice and the first voice data stored in the storage means when a voice is emitted in a local area Speaker identification means for identifying the speaker who has uttered the voice, data transmission means for transmitting the ID data and the second voice data of the identified speaker to all the TV conference units in other districts, Data receiving means for receiving ID data and second voice data from the TV conference unit in the other district when voice is emitted in the other district, the image stored in the storage means Means for selecting image data corresponding to the ID data from the data, and audio / image output means for outputting an image and audio corresponding to the image data and the second audio data of the speaker, respectively. , TV conference system.

3. A TV conference system for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network, wherein each of the TV conference units includes a first video conference unit for each conference participant. Storage means for storing voice data and image data; data transmission means for transmitting second voice data corresponding to the voice to all the TV conference units in other districts when voice is emitted in the own district; Data receiving means for receiving the second voice data from the TV conference unit in the other area when voice is generated in the second area, based on the second voice data and the first voice data stored in the storage means. Speaker identifying means for identifying the speaker who uttered the audio, and outputs an image and a sound corresponding to the image data and the second audio data of the identified speaker, respectively. Comprising audio and image output means for, TV conference system.

4. A method for controlling a time interval between a plurality of districts using a control unit and a plurality of TV conference units connected to a network.
A TV conference system for performing a V conference, the control unit comprising: storage means for storing first voice data and ID data for each participant of the conference; T
Voice data receiving means for receiving second voice data corresponding to the voice from the V-conference unit; and outputting the voice based on the received second voice data and the first voice data stored in the storage means. Speaker identifying means for identifying the speaker who has made the speech, and data transmitting means for transmitting the ID data and the second voice data of the identified speaker to the TV conference units in all districts, The TV conference unit has storage means for storing ID data and image data for each participant in the conference, and transmits the second voice data corresponding to the voice to the control unit when a voice is emitted in the own area. Audio data transmitting means, the second audio data from the control unit and the ID
Data receiving means for receiving data and data; means for selecting image data corresponding to the received ID data from the image data stored in the storage means; and the image data and the second voice data of the speaker And audio to output images and audio corresponding to
A TV conference system including an image output unit.

5. The storage means of each TV conference unit further stores standard image data serving as image data when a speaker is unspecified, and each TV conference unit creates image data relating to the conference. Image creating means, image mode setting means for setting image data to be adopted when the speaker is unspecified, if the image mode is "mode to adopt standard image data when the speaker is unspecified", Means for outputting an image and a sound corresponding to the image data and the second sound data, respectively, and if the image mode is "a mode in which the image data created by the image creating means is adopted when the speaker is not specified", Means for outputting an image and a sound corresponding to the image data and the second sound data created by the image creating means, respectively, when the speaker is unspecified The TV conference system according to claim 1, further comprising:

6. The TV conference unit comprises: a recording unit for recording the image data of the speaker and the second audio data; and the recording unit for recording the image data and the second audio data of the speaker. The TV conference system according to any one of claims 1 to 5, further comprising a recording mode setting unit that sets whether or not to perform recording.

7. The recording mode setting means according to claim 6, further comprising means for setting whether to record only said second audio data or to record both said second audio data and said image data. The TV conference system according to the above.

8. A TV conference method for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network, wherein first audio data and image data are stored for each participant of the conference. When a voice is uttered in the own area, a step of specifying the speaker who uttered the voice based on the second voice data corresponding to the voice and the stored first voice data, Transmitting the image data and the second audio data of the speaker to all the TV conference units in the other districts; when a voice is uttered in another district, the image from the TV conference unit in the other district is transmitted; Receiving the data and the second voice data, and outputting the image and the voice corresponding to the image data and the second voice data of the speaker, respectively, Run on the meeting unit,
TV conference method.

9. A TV conference method for performing a TV conference between a plurality of districts using a plurality of TV conference units connected to a network, the first voice data, the ID data, and the image data for each participant of the conference. And when a voice is uttered in the own district, a step of specifying a speaker who has uttered the voice based on the second voice data corresponding to the voice and the stored first voice data. Transmitting the ID data and the second voice data of the identified speaker to all the TV conference units in other districts; when a voice is uttered in another district, the TV conference unit in the other district is transmitted; Receiving the ID data and the second audio data from the MFP, selecting image data corresponding to the ID data from the stored image data, and saying Wherein an image and outputting a voice corresponding respectively to the image data and the second audio data executed by the TV conferencing unit, TV conference method.