JP2012054693A

JP2012054693A - Video audio transmission/reception device, video audio transmission/reception system, computer program, and recording medium

Info

Publication number: JP2012054693A
Application number: JP2010194458A
Authority: JP
Inventors: Yasuyuki Noda; 靖之野田
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2010-08-31
Filing date: 2010-08-31
Publication date: 2012-03-15

Abstract

PROBLEM TO BE SOLVED: To realize a smooth conversation between users.SOLUTION: A video audio transmission/reception device that is used by a user and accommodated in a network at each of plural bases, comprises: own base user video audio acquisition means for acquiring a video and audio of an own base user; transmission means for transmitting the acquired video and audio of the own base user to a video audio transmission/reception device at another base via the network; another-base user video audio acquisition means for acquiring a video and audio of an another-base user via the network; and control means for controlling predetermined video output means and audio output means so that the video and audio of the another-base user to be acquired are output with information synchronizing with the video and audio of the another-base user to be acquired on a time axis of the another base among own base user information to be generated based on at least one of the acquired video and audio of the own base user.

Description

本発明は、例えば遠隔地のユーザ相互間において映像を伴った会話を円滑に進行させることが可能な映像音声送受信装置及び映像音声送受信システム並びに当該装置を実現するためのコンピュータプログラム及び記録媒体の技術分野に関する。 The present invention relates to, for example, a video / audio transmission / reception apparatus and a video / audio transmission / reception system capable of smoothly proceeding with a conversation with video between users at a remote location, and a computer program and a recording medium for realizing the apparatus. Related to the field.

この種の装置として、例えば、特許文献１に開示されたテレビ電話装置がある。特許文献１のテレビ電話装置によれば、自画像符号化回路内の動き補償用可変遅延機能を持つ画像メモリに蓄えられた符号化、復号化後の自画像信号を利用し、通信中に受信画像と同時に自端末のモニタに符号化、復号化後の自画像を表示することにより、簡易な構成で符号化、復号化された自画像を表示させることが可能であるとされている。 An example of this type of device is a videophone device disclosed in Patent Document 1. According to the videophone device of Patent Document 1, a self-image signal after encoding and decoding stored in an image memory having a motion compensation variable delay function in the self-image encoding circuit is used, and a received image is transmitted during communication. At the same time, by displaying the encoded and decoded own image on the monitor of the own terminal, the encoded and decoded own image can be displayed with a simple configuration.

特開平６−３３４９９７号公報JP-A-6-334997

特許文献１に開示されたテレビ電話装置によれば、エンコーダ内の映像が受信映像にＰｉｎＰ（Picture in Picture）されるため、デコーダを複数必要としないが、自端末のモニタに表示される自画像と、通信相手の画像とは時間的に同期していない。 According to the videophone device disclosed in Patent Document 1, since the video in the encoder is PinP (Picture in Picture) into the received video, a plurality of decoders are not required, but the self-image displayed on the monitor of the own terminal It is not synchronized in time with the image of the communication partner.

ところで、ネットワークを介したデータ伝送においては、ネットワークには何某かの伝送遅延が存在することの方が多い。このため、一方の拠点における映像や音声は、一定又は不定の時間遅延の後に他方の拠点に伝送される。また、この伝送された一方の拠点の映像や音声を受けて他方の拠点で発話等がなされた場合、この発話に対応する映像や音声もまた、然るべき時間遅延の後に一方の拠点に伝送される。 By the way, in data transmission via a network, there are often some transmission delays in the network. For this reason, video and audio at one site are transmitted to the other site after a fixed or indefinite time delay. In addition, when the transmitted video or audio of one site is received and an utterance is made at the other site, the video or audio corresponding to this utterance is also transmitted to the one site after an appropriate time delay. .

ところが、一方の拠点のユーザからすれば、自身の発話が終了した段階で、その発話に係る内容が相手側に伝わったものと考えがちである。また、この種の伝送遅延の存在について理解しているとしたところで、自身の発話内容が相手側にどの時点で伝わったのかを知ることは、実践的に見て容易でない。即ち、ネットワークを介してなされる従来の会話システムにおいては、ユーザ相互間で、自身の発話内容と相手の発話内容との整合性を把握することが困難であるという技術的問題点がある。 However, a user at one base tends to think that the content related to the utterance has been transmitted to the other party when the utterance is completed. In addition, since it is understood that there is this kind of transmission delay, it is not practically easy to know when the content of the utterance is transmitted to the other party. That is, the conventional conversation system performed via the network has a technical problem that it is difficult for the users to grasp the consistency between the utterance content of the user and the utterance content of the other party.

このような技術的問題点により、ユーザ相互間の会話は、相互いに時間軸上のズレを抱えたまま継続せざるを得ず、例えば会話が全く噛み合わない或いは会話が成立しない等、会話品質上看過し難い問題が生じ得る。 Due to such technical problems, conversations between users must continue with a time axis shift between them. For example, conversations do not mesh at all or conversations are not established. Problems that are difficult to overlook can arise.

本発明は、係る技術的問題点に鑑みてなされたものであり、ネットワークを介したユーザ同士の会話を円滑に進行させ得る映像音声送受信装置及び映像音声送受信システム並びに当該装置を実現し得るコンピュータプログラム及び記録媒体を提供することを課題とする。 The present invention has been made in view of such technical problems, and a video / audio transmission / reception apparatus and a video / audio transmission / reception system capable of smoothly advancing conversation between users via a network, and a computer program capable of realizing the apparatus. It is another object of the present invention to provide a recording medium.

上述した課題を解決するため、請求項１の映像音声送受信装置は、複数の拠点の各々においてネットワークに収容され且つユーザに使用される複数の映像音声送受信装置を備えた映像音声送受信システムにおける、一の前記映像音声送受信装置であって、自拠点ユーザの映像及び音声を取得する自拠点ユーザ映像音声取得手段と、他拠点の前記映像音声送受信装置に対し前記ネットワークを介して前記取得された自拠点ユーザの映像及び音声を送信する送信手段と、前記ネットワークを介して他拠点ユーザの映像及び音声を取得する他拠点ユーザ映像音声取得手段と、前記取得される他拠点ユーザの映像及び音声が、前記取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成される自拠点ユーザ情報のうち前記取得される他拠点ユーザの映像及び音声と前記他拠点の時間軸上で同期する情報と共に出力されるように、所定の映像出力手段及び音声出力手段を制御する制御手段とを具備することを特徴とする。 In order to solve the above-described problem, a video / audio transmission / reception device according to claim 1 is a video / audio transmission / reception system including a plurality of video / audio transmission / reception devices accommodated in a network and used by a user at each of a plurality of sites. The local audio / video transmission / reception device, the local user video / audio acquisition means for acquiring the video and audio of the local user, and the local site acquired via the network with respect to the video / audio transmission / reception device at another site The transmission means for transmitting the video and audio of the user, the other base user video / audio acquisition means for acquiring the video and audio of the other base user via the network, and the acquired video and voice of the other base user Of the acquired local user information generated based on at least one of the acquired video and audio of the local user, the acquired And a control means for controlling the predetermined video output means and the audio output means so as to be output together with the information synchronized with the video and audio of the other base user on the time base of the other base. .

上述した課題を解決するため、請求項１１の映像音声送受信システムは、複数の拠点の各々においてネットワークに収容され且つユーザに使用される複数の映像音声送受信装置を備え、該複数の映像音声送受信装置の各々は、自拠点ユーザの映像及び音声を取得する自拠点ユーザ映像音声取得手段と、他拠点の前記映像音声送受信装置に対し前記ネットワークを介して前記取得された自拠点ユーザの映像及び音声を送信する送信手段と、前記ネットワークを介して他拠点ユーザの映像及び音声を取得する他拠点ユーザ映像音声取得手段と、前記取得される他拠点ユーザの映像及び音声が、前記取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成される自拠点ユーザ情報のうち前記取得される他拠点ユーザの映像及び音声と前記他拠点の時間軸上で同期する情報と共に出力されるように、所定の映像出力手段及び音声出力手段を制御する制御手段とを具備することを特徴とする。 In order to solve the above-described problem, the video / audio transmission / reception system according to claim 11 includes a plurality of video / audio transmission / reception apparatuses accommodated in a network and used by a user at each of a plurality of bases. Each of the local site user video and audio acquisition means for acquiring the video and audio of the local site user, and the video and audio of the local site user acquired via the network to the video and audio transmission / reception device of another site Transmission means for transmitting, other-site user video / audio acquisition means for acquiring the video and audio of the other-site user via the network, and the acquired local-site user's video and audio of the acquired own-site user Among the local site user information generated based on at least one of the video and audio of As it will be output along with the information to be synchronized on the time axis of the different hub and voice, characterized by a control means for controlling a predetermined video output means and audio output means.

上述した課題を解決するため、請求項１２のコンピュータプログラムは、コンピュータシステムを、請求項１から１０のいずれか一項に記載の映像音声送受信装置として機能させることを特徴とする。 In order to solve the above-described problem, a computer program according to a twelfth aspect causes a computer system to function as the video / audio transmission / reception apparatus according to any one of the first to tenth aspects.

上述した課題を解決するため、請求項１３の記録媒体は、請求項１２のコンピュータプログラムを記録することを特徴とする。 In order to solve the above-described problem, a recording medium according to a thirteenth aspect records the computer program according to the twelfth aspect.

本発明の第１実施例に係る会話システムの構成を概念的に表してなる概略構成図である。1 is a schematic configuration diagram conceptually showing a configuration of a conversation system according to a first example of the present invention. ネットワークを介さない対面会話におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。It is a timing chart which illustrates the progress state of conversation between users in face-to-face conversation without going through a network. ネットワークを介する会話におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。It is a timing chart which illustrates the progress state of conversation between users in conversation via a network. 第１及び第２実施例におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。It is a timing chart which illustrates the progress state of conversation between users in the 1st and 2nd examples. 図１の会話システムにおける映像音声送受信装置の構成を表してなるブロック図である。It is a block diagram showing the structure of the audio-video transmission / reception apparatus in the conversation system of FIG. 図５の映像音声送受信装置における送受信装置の構成を表してなるブロック図である。It is a block diagram showing the structure of the transmission / reception apparatus in the video / audio transmission / reception apparatus of FIG. 図６の送受信装置において実行される送信制御のフローチャートである。It is a flowchart of the transmission control performed in the transmission / reception apparatus of FIG. 図６の送受信装置において実行される受信制御のフローチャートである。It is a flowchart of the reception control performed in the transmission / reception apparatus of FIG. 第１実施例の効果に係り、図８の受信制御の実行過程における拠点Ａの表示映像を例示する図である。FIG. 9 is a diagram illustrating a display image of the base A in the process of executing the reception control of FIG. 8 according to the effect of the first embodiment. 第２実施例に係る送受信装置の構成を表してなるブロック図である。It is a block diagram showing the structure of the transmission / reception apparatus which concerns on 2nd Example. 図１０の送受信装置において保持される発話データの概念図である。It is a conceptual diagram of the speech data hold | maintained in the transmission / reception apparatus of FIG. 図１０の送受信装置において実行される発話データ生成処理のフローチャートである。It is a flowchart of the speech data generation process performed in the transmission / reception apparatus of FIG. 図１０の送受信装置において実行される映像音声出力処理のフローチャートである。11 is a flowchart of video / audio output processing executed in the transmission / reception apparatus of FIG. 10. 第２実施例の効果に係り、発話状態映像フレームの各種態様を例示する図である。It is a figure which concerns on the effect of 2nd Example and illustrates the various aspects of an utterance state video frame. 第３実施例に係る送受信装置の構成を表してなるブロック図である。It is a block diagram showing the structure of the transmission / reception apparatus which concerns on 3rd Example. 図１５の送受信装置において実行される送信制御のフローチャートである。It is a flowchart of the transmission control performed in the transmission / reception apparatus of FIG.

＜映像音声送受信装置の実施形態＞ <Embodiment of Video / Audio Transmission / Reception Device>

本発明の映像音声送受信装置に係る実施形態は、複数の拠点の各々においてネットワークに収容され且つユーザに使用される複数の映像音声送受信装置を備えた映像音声送受信システムにおける、一の前記映像音声送受信装置であって、自拠点ユーザの映像及び音声を取得する自拠点ユーザ映像音声取得手段と、他拠点の前記映像音声送受信装置に対し前記ネットワークを介して前記取得された自拠点ユーザの映像及び音声を送信する送信手段と、前記ネットワークを介して他拠点ユーザの映像及び音声を取得する他拠点ユーザ映像音声取得手段と、前記取得される他拠点ユーザの映像及び音声が、前記取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成される自拠点ユーザ情報のうち前記取得される他拠点ユーザの映像及び音声と前記他拠点の時間軸上で同期する情報と共に出力されるように、所定の映像出力手段及び音声出力手段を制御する制御手段とを具備する。 An embodiment of the video / audio transmission / reception apparatus according to the present invention is the video / audio transmission / reception system in the video / audio transmission / reception system including a plurality of video / audio transmission / reception apparatuses accommodated in a network and used by a user at each of a plurality of bases. A local user video / audio acquisition unit that acquires video and audio of the local user, and the acquired video and audio of the local user via the network with respect to the video / audio transmission / reception device at another site. Transmitting means for transmitting, video data and audio acquisition means for other base users that acquire video and audio of other base users via the network, and the acquired local base video and audio of the other base users The other site user acquired from the local site user information generated based on at least one of the user's video and audio As is output together with information to be synchronized on the time axis of the different hub with video and audio, and control means for controlling a predetermined video output means and audio output means.

本発明の映像音声送受信装置に係る実施形態は、複数のユーザ相互間の会話を成立させ得るシステムとしての映像音声送受信システムを構築する、一の映像音声送受信装置である。実施形態に係る映像音声送受信装置は、夫々その設置される拠点において、ネットワークに常時、或いは何らかの条件（尚、この種の条件は、如何様にも限定されない）が満たされた場合に限定的に収容される構成となっている。 The embodiment of the video / audio transmission / reception apparatus of the present invention is one video / audio transmission / reception apparatus that constructs a video / audio transmission / reception system as a system capable of establishing a conversation between a plurality of users. The video / audio transmission / reception apparatus according to the embodiment is limited to a case where the network is always installed or some condition (this kind of condition is not limited in any way) is satisfied at the installation site. It is configured to be accommodated.

実施形態に係る「ネットワーク」とは、例えばＷＡＮ（Wide Area Network）網、ＬＡＮ（Local Area Network）網、又はこれらＷＡＮ網又はＬＡＮ網を介して或いは電話回線、ＡＤＳＬ（Asymmetric Digital Subscriber Line）又は光ファイバーケーブル等を介して適宜に接続されるインターネット網等の各種データ通信網を包括する概念である。 The “network” according to the embodiment is, for example, a WAN (Wide Area Network) network, a LAN (Local Area Network) network, or a WAN, an ADSL (Asymmetric Digital Subscriber Line), or an optical fiber via the WAN network or the LAN network. It is a concept encompassing various data communication networks such as the Internet network appropriately connected via a cable or the like.

尚、「収容される」とは、実践的には、例えば、ネットワークに応じた各種の通信規格に準じた各種の通信Ｉ／Ｆを備え、当該通信Ｉ／Ｆを介して、常時或いは限定的に、複数の映像音声送受信装置相互間で各種情報、データ、又はデータファイル等の送受信を行い得る状態にあることを意味する。尚、ネットワークを介したデータの送受信形態は、例えば然るべきサーバ装置を介したものであってもよいし、例えばＰ２Ｐ（Peer To Peer）等サーバ装置を介さないものであってもよい。必然的に、映像音声送受信装置の構成要素もまた多義的であってよい。 Note that “accommodated” is practically, for example, provided with various communication I / Fs according to various communication standards corresponding to the network, and always or limited via the communication I / F. In addition, it means that various types of information, data, data files, etc. can be transmitted / received among a plurality of video / audio transmitting / receiving apparatuses. Note that the data transmission / reception mode via the network may be, for example, via an appropriate server device, or may not be via a server device such as P2P (Peer To Peer). Inevitably, the components of the video / audio transmitting / receiving apparatus may also be ambiguous.

本発明の映像音声送受信装置に係る実施形態では、自拠点ユーザ映像音声取得手段により、自拠点（即ち、複数の拠点のうち、実施形態に係る映像音声送受信装置が設置される一の拠点を意味する）において会話に参加するユーザを意味する自拠点ユーザの映像及び音声が取得される。尚、この取得された自拠点ユーザの映像及び音声は、自拠点の時間軸上で相互に同期している。 In the embodiment according to the video / audio transmission / reception apparatus of the present invention, the local user video / audio acquisition means means the local site (that is, one of the plurality of bases where the video / audio transmission / reception apparatus according to the embodiment is installed). The video and audio of the user at the local site, which means the user who participates in the conversation, is acquired. Note that the acquired video and audio of the local user are synchronized with each other on the time axis of the local site.

この取得された自拠点ユーザの映像及び音声は、送信手段により他拠点（即ち、複数の拠点のうち自拠点とは異なる拠点を意味する）の映像音声送受信装置に対し、ネットワークを介して送信される。この際、これら映像及び音声は、好適な一形態として、例えば、所定の規格に基づいた圧縮符号化が施された後に送信されてもよい。 The acquired video and audio of the local user are transmitted via the network to the video / audio transmitting / receiving apparatus at another base (that means a base different from the local base among a plurality of bases) by the transmission means. The In this case, these video and audio may be transmitted after being subjected to compression coding based on a predetermined standard, for example, as a suitable form.

一方、映像音声送受信装置に係る実施形態においては、他拠点ユーザ映像音声取得手段により、他拠点において会話に参加するユーザ（即ち、自拠点ユーザの会話相手に相当するユーザ）を意味する他拠点ユーザの映像及び音声が、ネットワークを介して取得される。 On the other hand, in the embodiment according to the video / audio transmission / reception apparatus, the other site user means a user who participates in the conversation at the other site (that is, a user corresponding to the conversation partner of the own site user) by the other site user video / audio acquisition unit. Video and audio are acquired via the network.

尚、好適な一形態として、映像音声送受信システムを構成する個々の映像音声送受信装置は、発明特定事項に関する部分において同等であり、ネットワークを介して取得される他拠点ユーザの映像及び音声は、好適な一形態として、他拠点の映像音声送受信装置において取得されたものである。 As a preferred embodiment, the individual video / audio transmission / reception devices constituting the video / audio transmission / reception system are equivalent in the part relating to the invention-specific matters, and the video and audio of other base users acquired via the network are preferable. As one form, it is acquired in the video / audio transmission / reception apparatus of another base.

ここで、映像音声送受信装置の実施形態においては、基本的に、この取得された他拠点ユーザの映像及び音声に基づいて、各種表示装置等の映像出力手段及び各種スピーカ等の音声出力手段（これらは完全に独立したハードウェアであってもよいし、少なくとも一部が一体に構成されていてもよい）が制御される。その結果、自拠点においては、他拠点ユーザの映像が、当該映像と時間軸上で同期した音声と共に出力され、自拠点ユーザに知覚される。即ち、他拠点ユーザが発話動作中であれば、その発話内容が映像を伴って自拠点ユーザに伝達される。 Here, in the embodiment of the video / audio transmission / reception apparatus, basically, the video output means such as various display devices and the audio output means such as various speakers (theses) based on the acquired video and audio of the user at the other site. May be completely independent hardware, or at least part of the hardware may be integrated. As a result, at the own site, the video of the user at the other site is output together with the audio synchronized with the video on the time axis and perceived by the user at the local site. That is, if the user at the other site is performing an utterance operation, the content of the utterance is transmitted to the user at the local site with a video.

ところで、ネットワークを介したデータ伝送においては、映像音声送受信装置自身の処理遅延に加えて、ネットワークの伝送遅延が生じることが公知である。例えば、自拠点における自拠点ユーザの発話動作は、自拠点における当該発話動作の発生時刻から、例えば拠点間の（自拠点から他拠点への）片道分の伝送遅延時間や他拠点の映像音声送受信装置の処理遅延時間に相当する時間だけ遅れて他拠点ユーザに認識される。また、他拠点ユーザの発話動作もまた、例えば拠点間の（他拠点から自拠点への）片道分の伝送遅延時間や自拠点の映像音声送受信装置の処理遅延時間に相当する時間だけ遅れて自拠点ユーザに認識される。即ち、実際にユーザ相互間で会話を進行させるにあたっては、時間的ずれが発生する。 By the way, in data transmission via a network, it is known that a transmission delay of the network occurs in addition to the processing delay of the video / audio transmitting / receiving apparatus itself. For example, the utterance operation of the local user at the local site is, for example, the transmission delay time of one way between the bases (from the local site to another site) or the video / audio transmission / reception at the other site from the time when the speech operation occurs at the local site. It is recognized by the user at another site with a delay corresponding to the processing delay time of the apparatus. In addition, the utterance operation of the user at the other site is also delayed by a time corresponding to the transmission delay time for one way between the sites (from the other site to the own site) or the processing delay time of the video / audio transmitting / receiving device at the own site. Recognized by the base user. That is, a time lag occurs when a conversation is actually progressed between users.

この時間的ずれは、ユーザがこの種の遅延が生じることを予め認識しているか否かと無関係に、会話のすれ違いを惹起する要因であり、会話の快適性を低下させる点において、会話品質を低下させる要因となる。何故なら、自身の発話動作が相手側にいつ如何なるタイミングで認識されたかについて、絶えずユーザ側が把握し続けることは殆ど不可能だからである。一方で、この時間的ずれ自体を解消しようにも、この時間的ずれは、ネットワークを介したデータ伝送を行う以上避け難い現象である。 This time lag is a factor that causes the conversation to pass, regardless of whether or not the user knows in advance that this kind of delay will occur, and reduces the conversation quality in terms of reducing the comfort of the conversation. It becomes a factor to make. This is because it is almost impossible for the user to constantly keep track of when and when the other party recognized his / her speech movement. On the other hand, in order to eliminate this time lag itself, this time lag is an unavoidable phenomenon as long as data transmission through a network is performed.

これらの点に鑑みれば、ネットワークを介したデータ伝送を伴うユーザ同士の会話を円滑に進行せしめるためには、この時間的ずれをユーザ各々において知覚させない措置が必要となる。本発明の映像音声送受信装置に係る実施形態は、係る措置を実現するための構成を有している。 In view of these points, it is necessary to take measures to prevent each user from perceiving this time lag in order to smoothly advance conversations between users accompanied by data transmission via a network. The embodiment according to the video / audio transmission / reception apparatus of the present invention has a configuration for realizing such measures.

即ち、本発明の映像音声送受信装置に係る実施形態は、制御手段を備える。制御手段は、他拠点ユーザ映像音声取得手段により取得される他拠点ユーザの映像及び音声が、取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成される自拠点ユーザ情報のうち、当該取得される他拠点ユーザの映像及び音声と他拠点の時間軸上で同期する情報と共に出力されるように、所定の映像出力手段及び音声出力手段を制御する構成となっている。 That is, the embodiment according to the video / audio transmission / reception apparatus of the present invention includes a control unit. The control means includes the local site user information generated based on at least one of the acquired local site user's video and audio by the other site user's video and audio acquired by the other site user video and audio acquisition unit. The predetermined video output means and audio output means are controlled so as to be output together with information acquired on the time axis of the other base on the acquired video and audio of the other base user.

自拠点ユーザ情報とは、取得された自拠点ユーザの映像及び音声のうち少なくとも一方（即ち、映像又は音声或いはその両方）に基づいて生成される、自拠点ユーザの状態、動作、様子或いは行為に関連付けられた情報である。 The local site user information refers to the status, operation, state, or action of the local site user that is generated based on at least one of the acquired video and audio of the local site user (that is, video and / or audio). Associated information.

例えば、自拠点ユーザ情報は、自拠点ユーザの映像そのものであってもよいし、自拠点ユーザの映像に適宜画像処理等が施されてなる各種派生映像や類似映像であってもよい。また、この派生映像や類似映像は、例えば、元の映像を縮小したものであってもよいし、自拠点ユーザの発話動作（或いは、発話を伴わないただの状態）を何らかの基準に従ってインジケータ、アイコン或いはマーク等により表したものであってもよい。また、映像の縮小がなされる場合、映像の解像度を低下させてもよい。或いは、自拠点ユーザ情報は、例えば、自拠点ユーザの音声そのものであってもよいし、ユーザの音声レベル（端的には音圧レベルを意味する）に応じて、或いは発話タイミングに準じて出力される、各種信号音やアラーム等であってもよい。 For example, the local site user information may be a video of the local site user itself, or may be various derived videos or similar videos obtained by appropriately performing image processing or the like on the video of the local site user. Further, the derived video or similar video may be, for example, a reduced version of the original video, or an indicator or icon according to some criteria for the user's local user's utterance operation (or just the state without utterance). Alternatively, it may be represented by a mark or the like. Further, when the video is reduced, the resolution of the video may be reduced. Alternatively, the local user information may be, for example, the voice of the local user itself, or is output according to the user's voice level (which simply means the sound pressure level) or according to the utterance timing. Various signal sounds and alarms may be used.

尚、自拠点ユーザ情報の生成は、自拠点側でなされても他拠点側でなされてもよく、ネットワークにおいてこれらの間に介在する然るべきサーバ装置等でなされてもよい。また、生成された自拠点ユーザ情報と、他拠点ユーザ映像音声取得手段により取得される他拠点ユーザの映像及び音声との時間同期は、他拠点において他拠点ユーザの映像及び音声が送信される以前になされてもよいし、自拠点において他拠点ユーザの映像及び音声が取得された以後になされてもよい。即ち、実施形態に係る本質的技術思想は、他拠点ユーザの映像及び音声を、これらと時間同期した自拠点ユーザ情報と共に自拠点での出力に供する点にある。 The local site user information may be generated on the local site side or on the other site side, or may be generated by an appropriate server device or the like interposed between them in the network. Further, the time synchronization between the generated local user information and the video and audio of the other site user acquired by the other site user video and audio acquisition means is before the video and audio of the other site user is transmitted at the other site. It may be made after the video and audio of the user at another site are acquired at the site. That is, the essential technical idea according to the embodiment is that the video and audio of the user at another site are provided for output at the site together with the user information of the user at the time base synchronized with these.

実施形態に係る映像音声送受信装置によれば、自拠点ユーザは、他拠点ユーザの映像及び音声を認識するにあたって、他拠点の時間軸に同期した、自拠点ユーザに関する各種映像や音声等（分かり易く言えば、過去の自分に関する映像や音声である）を同時に認識することができる。従って、自拠点ユーザは、他拠点ユーザが、自身の如何なる発話動作に対し、如何なるタイミングで発話動作を呈するに至ったのかを、リアルタイムに認識しつつ他拠点ユーザとの会話を進行させることができる。即ち、会話を円滑に進行せしめることが可能となるのである。 According to the video / audio transmission / reception apparatus according to the embodiment, when the local user recognizes the video and audio of the other base user, various videos and sounds related to the local user synchronized with the time base of the other base (easily understood) (Speaking, it is a video and sound about yourself in the past). Therefore, the user at the local site can advance the conversation with the user at the other site while recognizing in real time what kind of utterance operation the user at the other site has brought out the utterance operation. . That is, the conversation can be smoothly advanced.

尚、第１実施形態に係る映像音声送受信装置に備わる、各手段は、少なくとも一部が、夫々或いは全体として、例えばＣＰＵ（Central Processing Unit）又はＭＰＵ（Micro Processing Unit）等の各種演算処理装置、各種プロセッサ、コントローラ又は各種機能モジュール等として構成されていてもよい。 Note that each means provided in the video / audio transmission / reception apparatus according to the first embodiment is at least partially, for example, various arithmetic processing apparatuses such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), for example, It may be configured as various processors, controllers or various functional modules.

本発明の映像音声送受信装置に係る実施形態の一の態様では、前記自拠点ユーザ情報は、前記送信された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて前記他拠点の映像音声送受信装置で生成され、前記他拠点の映像音声送受信装置で生成された自拠点ユーザ情報は、前記他拠点において取得された前記他拠点ユーザの映像及び音声と共に前記ネットワークを介して送信され、前記他拠点ユーザ映像音声取得手段は、前記送信された他拠点ユーザの映像及び音声並びに自拠点ユーザ情報を取得し、前記制御手段は、前記取得された他拠点ユーザの映像及び音声並びに自拠点ユーザ情報が出力されるように、前記映像出力手段及び前記音声出力手段を制御する。 In one aspect of the embodiment of the video / audio transmission / reception device of the present invention, the local site user information is based on at least one of the transmitted video and audio of the local site user and the video / audio transmission / reception device of the other site. The local site user information generated by the video / audio transmitting / receiving apparatus at the other site is transmitted via the network together with the video and audio of the other site user acquired at the other site, and the other site user The video / audio acquisition unit acquires the transmitted video and audio of the other site user and the own site user information, and the control unit outputs the acquired video and audio of the other site user and the own site user information. In this manner, the video output unit and the audio output unit are controlled.

この態様によれば、自拠点ユーザ情報が、自拠点から送信される自拠点ユーザの映像及び音声に基づいて他拠点側で生成される。この他拠点側で生成された自拠点ユーザ情報は、他拠点の映像音声送受信装置で取得される他拠点ユーザの映像及び音声と共に、実施形態に係る映像音声送受信装置に送信され、制御手段による各種出力手段の制御に供される。 According to this aspect, the local site user information is generated on the other site side based on the video and audio of the local site user transmitted from the local site. The local site user information generated on the other site side is transmitted to the video / audio transmitting / receiving device according to the embodiment together with the video and audio of the other site user acquired by the video / audio transmitting / receiving device of the other site, This is used for controlling the output means.

即ち、この態様によれば、制御手段は、他拠点の映像音声送受信装置から送信されてくる他拠点ユーザの映像及び音声と自拠点ユーザ情報とを出力すればよく、実施形態に係る映像音声送受信装置の制御上の負荷が軽減される。 That is, according to this aspect, the control means only needs to output the video and audio of the other site user and the local site user information transmitted from the video and audio transmission / reception device at the other site, and the video and audio transmission and reception according to the embodiment. The load on the control of the device is reduced.

ここで、この態様によれば、自拠点ユーザ情報を、自拠点ユーザの映像又は音声或いはその両方に基づいて敢えて他拠点の映像音声送受信装置で生成させることによって、生成された自拠点ユーザ情報と、自拠点で取得される他拠点ユーザの映像及び音声とを他拠点の時間軸上で同期させる旨の同期処理が実質的に不要となる。即ち、自拠点ユーザ情報は、その基となるデータまで含めれば、自拠点→他拠点→自拠点の経路を辿ることとなり、他拠点で取得される他拠点ユーザの映像及び音声と、半ば自動的に他拠点の時間軸上で相互に同期することになる。 Here, according to this aspect, the local site user information is generated by causing the video / audio transmitting / receiving apparatus at the other site to generate the local site user information based on the video and / or audio of the local site user. The synchronization processing for synchronizing the video and audio of the other site user acquired at the own site on the time axis of the other site is substantially unnecessary. In other words, if the local site user information includes the base data, it will follow the path from the local site to another site to the local site, and the video and audio of the other site user acquired at the other site will be semi-automatically. Will synchronize with each other on the time base of other bases.

補足すると、上述した時間的ずれは、ネットワークの伝送遅延や、各映像音声送受信装置の処理遅延等により生じる。従って、ネットワーク及び映像音声送受信装置を含む正規のデータ伝送経路を介して伝送される自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて自拠点ユーザ情報を生成させることによって、他拠点の映像音声送受信装置において敢えて特別な時間同期処理を講じずとも、この時間的ずれを相殺することが可能となるのである。従って、映像音声送受信システム全体の処理負荷を軽減することが可能となり、好適である。 Supplementally, the above-described time lag is caused by a network transmission delay, a processing delay of each video / audio transmission / reception device, or the like. Therefore, by generating the local user information based on at least one of the video and audio of the local user transmitted through the regular data transmission path including the network and the video / audio transmission / reception device, the video / audio of the other site is generated. This time lag can be offset without specially performing time synchronization processing in the transmission / reception device. Therefore, it is possible to reduce the processing load of the entire video / audio transmission / reception system, which is preferable.

また特に、この態様によれば、自拠点ユーザ情報が、その都度実際にネットワークを介して伝送された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成され、またその都度実際にネットワークを介して他拠点ユーザの映像及び音声と共に自拠点側に戻されるため、例えば、何らかの理由でネットワークの伝送状態や通信帯域幅等が変化したとしても、自拠点における、取得される自拠点ユーザ情報と他拠点ユーザの映像及び音声との時間的同期状態は常に維持される。即ち、ネットワークの状態に対しロバストな会話品質が提供される。 In particular, according to this aspect, the local user information is generated based on at least one of the video and audio of the local user actually transmitted via the network each time, and For example, even if the network transmission state or the communication bandwidth changes for some reason, the acquired local user information and local site user information are also returned to the local site. The temporal synchronization state with the video and audio of the other base users is always maintained. That is, a conversation quality that is robust to the state of the network is provided.

更には、自拠点ユーザ情報と他拠点ユーザの映像及び音声との時間同期が、言わば伝送遅延の発生要因たるネットワーク自身及び他拠点の映像音声送受信装置自身によって常に維持される点に鑑みれば、実施形態に係る映像音声送受信装置においては、取得された自拠点ユーザの映像及び音声或いは生成される自拠点ユーザ情報等を、一定又は不定の期間にわたって保持する必要はなく、この種の目的に供すべきバッファ等を必要としない。即ち、装置構成の肥大化を防ぐことが可能となる。 Furthermore, in view of the fact that the time synchronization between the local site user information and the video and audio of the other site user is always maintained by the network itself and the video / audio transmission / reception device itself of the other site, which is the cause of the transmission delay. In the video / audio transmitting / receiving apparatus according to the embodiment, the acquired video and audio of the local user or the generated local user information or the like need not be held for a certain or indefinite period, and should be provided for this purpose. Does not require a buffer. That is, it is possible to prevent the device configuration from becoming enlarged.

尚、この態様では、前記取得された他拠点ユーザの映像及び音声のうち少なくとも一方に基づいて他拠点ユーザ情報を生成する他拠点ユーザ情報生成手段を更に具備し、前記送信手段は、前記取得された自拠点ユーザの映像及び音声と共に前記生成された他拠点ユーザ情報を送信してもよい。 In this aspect, the information processing apparatus further includes another site user information generating unit that generates other site user information based on at least one of the acquired video and audio of the other site user, and the transmission unit is configured to acquire the acquired site information. The generated other site user information may be transmitted together with the video and audio of the own site user.

このように、自拠点側の装置及び他拠点側の装置が、夫々他拠点側の装置及び自拠点側の装置での出力に供すべき他拠点ユーザ情報及び自拠点ユーザ情報を生成する場合、自拠点ユーザ及び他拠点ユーザ双方にとって円滑な会話が実現され、映像音声送受信システムによりもたらされる実践上の利益がより大きいものとなる。 As described above, when the local site device and the local site device generate the local site user information and local site user information to be used for the output from the local site device and the local site device, respectively, Smooth conversations are realized for both the base user and other base users, and the practical benefits provided by the video / audio transmission / reception system are greater.

尚、逆に言えば、例えば自拠点ユーザが他拠点ユーザに対して明らかに支配的な或いは優位な立場に位置する場合等においては、相手側の装置でユーザ情報を生成する旨の利益が、自拠点ユーザに対してのみもたらされてもよい。即ち、実施形態に係る映像音声送受信装置が、他拠点ユーザ情報生成手段を有さぬ構成を採る場合においてユーザ相互間の会話に係る品質が担保される場合も、映像音声送受信システムの利用環境によっては、存在し得る。 In other words, for example, in the case where the own site user is clearly dominant or superior to other site users, the benefit of generating user information in the other device is, It may be provided only for the user at the local site. In other words, when the video / audio transmission / reception apparatus according to the embodiment adopts a configuration that does not include the other-site user information generation unit, the quality of the conversation between users is ensured depending on the use environment of the video / audio transmission / reception system. Can exist.

本発明の映像音声送受信装置に係る実施形態の他の態様では、前記他拠点の映像音声送受信装置との間の、前記ネットワークを介したデータ伝送に要する遅延時間を特定する特定手段と、前記取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて前記自拠点ユーザ情報を生成する自拠点ユーザ情報生成手段と、前記特定された遅延時間に基づいて、前記自拠点ユーザ情報生成手段により生成された自拠点ユーザ情報のうち前記取得された他拠点ユーザの映像及び音声と前記他拠点の時間軸上で同期する情報を選択する選択手段とを具備し、前記制御手段は、前記取得された他拠点ユーザの映像及び音声と前記選択された情報とが出力されるように前記映像出力手段及び音声出力手段を制御する。 In another aspect of the embodiment of the video / audio transmission / reception apparatus of the present invention, the specifying means for specifying a delay time required for data transmission via the network with the video / audio transmission / reception apparatus at the other site, and the acquisition The own site user information generating means for generating the own site user information based on at least one of the video and audio of the own site user, and the own site user information generating means based on the specified delay time Selection means for selecting information acquired on the time axis of the other site and the video and audio of the acquired other site user among the generated own site user information, and the control unit is the acquired The video output unit and the audio output unit are controlled so that the video and audio of the user at the other site and the selected information are output.

上述した一の態様では、他拠点において取得された他拠点ユーザの映像及び音声と、自拠点ユーザの映像又は音声或いはその両方に基づいた自拠点ユーザ情報との、他拠点の時間軸における時間同期が、他拠点の映像音声送受信装置における、通常のデータ伝送処理の実行過程において簡便にして実現された。一方、本態様では、このような構成に替えて或いは加えて、係る時間同期が、自拠点側の作用（即ち、実施形態に係る映像音声送受信装置の作用）によって実現される。 In the one aspect described above, time synchronization on the time base of the other base between the video and voice of the other base user acquired at the other base and the own base user information based on the video and / or voice of the own base user. However, it was realized in a simple manner in the process of executing normal data transmission processing in the video / audio transmission / reception apparatus at another site. On the other hand, in this aspect, instead of or in addition to such a configuration, such time synchronization is realized by the action of the own base side (that is, the action of the video / audio transmission / reception apparatus according to the embodiment).

この態様によれば、特定手段により、ネットワークを介した複数拠点間のデータ伝送に要する遅延時間が特定される。この遅延時間は、他拠点側の時間軸を規定するために必要な情報であり、好適な一形態としては、ＲＴＴ（Round Trip Time）等のネットワーク上の伝送遅延時間と、各映像音声送受信装置の処理遅延時間とによって支配的に規定される。 According to this aspect, the specifying unit specifies the delay time required for data transmission between a plurality of bases via the network. This delay time is information necessary for defining the time axis on the other site side. As a preferred form, the transmission delay time on the network such as RTT (Round Trip Time) and each video / audio transmission / reception apparatus The processing delay time is defined in a dominant manner.

尚、「特定」とは、特定対象を最終的に参照値として確定させることを包括する概念であり、その実践的態様は何ら限定されない趣旨である。例えば、特定手段は、例えばＲＴＣＰ（Real-time Transport Control Protocol）のＳＲ（Sender Report）やＲＲ（Receiver Report）を利用してＲＴＴを測定してもよい。また、映像音声送受信装置が自身の性能値として保持する機器情報（処理速度やビジー状態）に基づいて処理遅延時間を推定してもよい。 Note that “specific” is a concept encompassing the final determination of a specific target as a reference value, and its practical aspect is not limited in any way. For example, the specifying unit may measure the RTT using, for example, an RTCP (Real-time Transport Control Protocol) SR (Sender Report) or RR (Receiver Report). Further, the processing delay time may be estimated based on device information (processing speed or busy state) held by the video / audio transmitting / receiving apparatus as its performance value.

この態様によれば、上述した各種態様を有し得る自拠点ユーザ情報が、自拠点ユーザ情報生成手段により、自拠点側で生成される。 According to this aspect, the local site user information that can have the various aspects described above is generated on the local site side by the local site user information generation means.

ここで、自拠点側で自拠点ユーザ情報が生成される場合、他拠点側で生成される場合と異なり、取得される他拠点ユーザの映像及び音声に対する時間同期処理（即ち、近未来に取得される或いは既に取得された他拠点ユーザの映像及び音声が発生するに至った原因となる、自拠点ユーザの発話動作に対応する自拠点ユーザ情報を適切に選択する処理）が必要となる。この態様によれば、係る選択処理が、選択手段によって実現される。 Here, when the local site user information is generated on the local site side, unlike the case where the local site user information is generated on the local site side, time synchronization processing (that is, acquired in the near future) for the video and audio of the acquired other site user. Or processing for appropriately selecting the local site user information corresponding to the utterance operation of the local site user, which causes the generation of the video and audio of the other local site user already acquired. According to this aspect, the selection process is realized by the selection unit.

選択手段は、特定手段により特定された遅延時間に基づいて、これら相互間の時間同期を実現する。上述したように、この遅延時間は、他拠点に時間軸を規定する情報であるから、選択手段は、適切な自拠点ユーザ情報を常時選択することが可能となる。尚、選択手段に係る選択作用は、遅延時間に相当する時間的猶予を利用して自拠点ユーザ情報を、それが必要となる前に適当量生成して然るべき蓄積手段に蓄積しておく場合を、好適な一形態として想定しているが、必ずしもそれのみに限定されない。即ち、生成される自拠点ユーザ情報が絶えず、取得される他拠点ユーザの映像及び音声と時間同期した情報となるように、特定された遅延時間に基づいて自拠点ユーザ情報生成手段に係る自拠点ユーザ情報の生成タイミングを制御する旨の作用を含んでもよい。 The selecting means realizes time synchronization between them based on the delay time specified by the specifying means. As described above, since this delay time is information that defines the time axis for other bases, the selection means can always select appropriate local user information. Note that the selection operation related to the selection means is a case where the local user information is generated in an appropriate amount before it is necessary and stored in the appropriate storage means using a time delay corresponding to the delay time. However, the present invention is not necessarily limited thereto. That is, the local site user information generation means based on the specified delay time so that the generated local site user information is constantly time synchronized with the video and audio of the acquired other site users. You may include the effect | action of controlling the production | generation timing of user information.

これらの結果、この態様によれば、自拠点ユーザが、他拠点ユーザの映像及び音声を認識するにあたって、他拠点の時間軸に同期した、自拠点のユーザに関する各種映像や音声等（分かり易く言えば、過去の自分に関する映像や音声である）を同時に認識することが可能となる。従って、自拠点（他拠点）のユーザは、他拠点（自拠点）のユーザが、自身の如何なる発話動作に対し、如何なるタイミングで発話動作を呈するに至ったのかを、リアルタイムに認識しつつ他拠点のユーザと会話を進行させることができる。即ち、会話を円滑に進行せしめることが可能となるのである。 As a result, according to this aspect, when the user at the local site recognizes the video and audio of the user at the other site, various videos and sounds related to the user at the local site synchronized with the time axis of the other site (which can be said in an easy-to-understand manner). For example, it is possible to simultaneously recognize video and sound related to the past. Therefore, the user at the other site (other site) recognizes in real time what kind of utterance operation the user at the other site (own site) has come to exhibit in real time. You can make conversations with other users. That is, the conversation can be smoothly advanced.

本発明の映像音声送受信装置に係る実施形態の他の態様では、前記自拠点ユーザ情報は、前記取得された自拠点ユーザの映像に対応する映像である。 In another aspect of the embodiment of the video / audio transmission / reception apparatus of the present invention, the local user information is a video corresponding to the acquired video of the local user.

この態様によれば、自拠点ユーザ情報が、自拠点ユーザの映像に対応する映像として生成されるから、自拠点（好適には、更に他拠点）のユーザにおいて、他拠点（好適には、更に自拠点）のユーザが自身の如何なる発話動作に対し如何なるタイミングで発話動作を呈するに至ったのかを、視覚的に分かり易く把握することができる。 According to this aspect, since the local site user information is generated as a video corresponding to the video of the local site user, in the user of the local site (preferably, another site), the other site (preferably, further It is possible to easily understand visually at what timing the user at his / her own site has started to speak for what kind of speech.

尚、この態様では、前記取得された自拠点ユーザの映像に対応する映像は、表示面積が縮小された、前記取得された自拠点ユーザの映像であってもよい。 In this aspect, the video corresponding to the acquired video of the local user may be the acquired video of the local user with a reduced display area.

このように表示面積を小さくすれば、自拠点ユーザ情報が、他拠点ユーザに関する映像の表示を阻害することがなくなり、会話の品質が好適に維持される。 If the display area is reduced in this way, the local site user information does not hinder the display of the video regarding the other site users, and the quality of the conversation is suitably maintained.

本発明の映像音声送受信装置に係る実施形態の他の態様では、前記自拠点ユーザ情報は、前記取得された自拠点ユーザの音声に対応する映像である。 In another aspect of the embodiment of the video / audio transmitting / receiving apparatus of the present invention, the local user information is a video corresponding to the acquired voice of the local user.

ここで、音声に対応する映像とは、音声レベル、発話中か否かを表すアイコンやマーク、発話が終了したか否かを表すアイコンやマーク、発話の残り時間を表す数値情報やインジケータ等、各種の態様を有し得るが、いずれにせよ、ユーザの映像に基づいて自拠点ユーザ情報が生成される場合と較べて、その情報量は顕著に小さくて済む。従って、制御手段に係る出力制御に要する負荷が相対的に軽くて済む。 Here, the video corresponding to the voice is an audio level, an icon or mark indicating whether or not the utterance is being performed, an icon or mark indicating whether or not the utterance has ended, numerical information or an indicator indicating the remaining time of the utterance, etc. In any case, the amount of information may be significantly smaller than the case where the local user information is generated based on the user's video. Therefore, the load required for the output control related to the control means can be relatively light.

また、この態様によれば、ユーザの映像に何らかの加工を施して自拠点ユーザ情報が生成される場合と較べて、実際に表示される映像に係る実践上の出力態様が自由であり、画面表示にフレキシビリティを与えることが可能である。また、ユーザによるカスタマイズも比較的自由である。 Also, according to this aspect, compared to the case where the user's video is subjected to some processing and the local site user information is generated, the practical output mode related to the actually displayed video is free, and the screen display Can be given flexibility. Also, customization by the user is relatively free.

尚、自拠点ユーザ情報が自拠点で生成される態様においては、他拠点で生成される場合と異なり、少なくとも特定手段によって特定された遅延時間に相当する時間だけ、取得された映像、音声或いは自拠点ユーザ情報等を適宜保持しておく必要が生じ得る（他拠点で自拠点ユーザ情報が生成される態様では、この役目を言わばネットワークが担う）。この場合、この種のバッファとしては、例えば、揮発性記憶領域を有する各種高速メモリや不揮発性記憶領域を有するＨＤＤ等、公知の各種記憶手段を適用することができるが、そのデータ保持に要する負荷は、小さい程望ましい。その点に鑑みれば、このように自拠点ユーザ情報の情報量が小さくて済む本態様は、自拠点において自拠点ユーザ情報が生成される構成に適用されて顕著に効果的である。 It should be noted that, in a mode in which the local site user information is generated at the local site, the acquired video, audio or self is at least the time corresponding to the delay time specified by the specifying means, unlike the case where the local site user information is generated at the other site. It may be necessary to appropriately store the base user information and the like (in a mode in which the local base user information is generated at another base, the network assumes this role). In this case, as this kind of buffer, for example, various kinds of known storage means such as various high-speed memories having a volatile storage area and HDD having a nonvolatile storage area can be applied. The smaller the better. In view of this point, the present aspect in which the information amount of the local site user information is small is applied to a configuration in which the local site user information is generated at the local site and is remarkably effective.

尚、この態様では、前記取得された自拠点ユーザの音声に対応する映像は、前記自拠点ユーザの発話状態を表す映像であってもよい。 In this aspect, the video corresponding to the acquired voice of the local user may be a video representing the speech state of the local user.

自拠点ユーザの音声に対応する映像が、自拠点ユーザの発話状態を表す映像である場合、他拠点ユーザの映像及び音声が、自身の如何なる発話動作に対してなされたものであるかについての判断がより容易となり、会話品質の一層の向上に効果的である。また、この場合、場合によっては、自拠点ユーザが発話状態にない期間については特別の処理が必要とされず、装置全体の処理負荷が軽減され得る。 When the video corresponding to the voice of the user at the local site is a video representing the utterance state of the user at the local site, the determination as to what utterance operation the video and audio of the user at the other site is made Is easier and effective in further improving the conversation quality. In this case, depending on the case, special processing is not required for a period in which the user at the local site is not in an utterance state, and the processing load on the entire apparatus can be reduced.

尚、発話状態とは、発話中であるか否か等の定性的状態及び発話に際しての音声の大きさや身振りの大きさ等の定量的状態を含み、発話状態を表す映像とは、例えば、発話中であるか否かを示すアイコン、インジケータ又はマーク、発話の残り時間を示すアイコン、インジケータ又はマーク、或いは音声レベル等を表すアイコン、インジケータ又はマーク等を意味する。 Note that the utterance state includes a qualitative state such as whether or not the utterance is being made and a quantitative state such as the volume of the voice and the gesture size at the time of the utterance. Meaning an icon, an indicator or a mark indicating whether or not, an icon indicating a remaining utterance time, an indicator or a mark, an icon indicating an audio level, an indicator or a mark, and the like.

尚、この態様では、前記取得された自拠点ユーザの音声に対応する映像は、前記自拠点ユーザの発話完了までの残り時間を表す映像であってもよい。 In this aspect, the video corresponding to the acquired voice of the local user may be a video representing the remaining time until the utterance of the local user is completed.

自拠点ユーザの発話終了タイミングは、他拠点ユーザの発話開始タイミングと密接な関係を有し得る。従って、発話終了タイミングを自拠点ユーザに効果的に知覚させ得る発話完了までの残り時間は、自拠点ユーザの音声に対応する映像として好適である。 The utterance end timing of the local user may have a close relationship with the utterance start timing of the other local user. Therefore, the remaining time until the completion of the utterance that allows the user at the local site to effectively perceive the utterance end timing is suitable as an image corresponding to the voice of the user at the local site.

本発明の映像音声送受信装置に係る実施形態の他の態様では、前記自拠点ユーザ情報は、前記自拠点ユーザの発話が完了した旨を告知する告知音である。 In another aspect of the embodiment of the video / audio transmitting / receiving apparatus of the present invention, the local user information is a notification sound for notifying that the utterance of the local user is completed.

自拠点ユーザの発話終了タイミングは、他拠点ユーザの発話開始タイミングと密接な関係を有し得る。従って、発話終了タイミングを告知する告知音は、自拠点ユーザ情報として好適である。また特に、この種の告知音は、その効果に較べて、情報量が少なくて済むから、装置構成の肥大化を防ぎつつ会話品質を維持し得る点において効果的である。 The utterance end timing of the local user may have a close relationship with the utterance start timing of the other local user. Therefore, the notification sound for notifying the utterance end timing is suitable as the local user information. In particular, this kind of notification sound is effective in that the amount of information can be reduced compared to the effect thereof, and the conversation quality can be maintained while preventing the apparatus configuration from becoming enlarged.

＜映像音声送受信システムの実施形態＞ <Embodiment of Video / Audio Transmission / Reception System>

本発明の映像音声送受信システムに係る実施形態は、複数の拠点の各々においてネットワークに収容され且つユーザに使用される複数の映像音声送受信装置を備え、該複数の映像音声送受信装置の各々は、自拠点ユーザの映像及び音声を取得する自拠点ユーザ映像音声取得手段と、他拠点の前記映像音声送受信装置に対し前記ネットワークを介して前記取得された自拠点ユーザの映像及び音声を送信する送信手段と、前記ネットワークを介して他拠点ユーザの映像及び音声を取得する他拠点ユーザ映像音声取得手段と、前記取得される他拠点ユーザの映像及び音声が、前記取得された自拠点ユーザの映像及び音声のうち少なくとも一方に基づいて生成される自拠点ユーザ情報のうち前記取得される他拠点ユーザの映像及び音声と前記他拠点の時間軸上で同期する情報と共に出力されるように、所定の映像出力手段及び音声出力手段を制御する制御手段とを具備する。 An embodiment of the video / audio transmission / reception system of the present invention includes a plurality of video / audio transmission / reception devices accommodated in a network and used by a user at each of a plurality of bases. A local user video / audio acquisition unit that acquires video and audio of the local user, and a transmission unit that transmits the acquired video and audio of the local user via the network to the video / audio transmission / reception apparatus at another site; , Another site user video / audio acquisition means for acquiring the video and audio of the other site user via the network, and the acquired video and audio of the other site user Among the local site user information generated based on at least one of them, the acquired video and audio of the other site user and the other site user information As is output together with information to be synchronized on during shaft, and a control means for controlling a predetermined video output means and audio output means.

音声送受信システムに係る実施形態によれば、音声送受信装置に係る実施形態を複数備えるため、音声送受信装置に係る実施形態によってもたらされる実践上の利益が同様に得られ、ネットワークを介したユーザ相互間での円滑な会話が実現される。 According to the embodiment related to the voice transmission / reception system, since there are a plurality of embodiments related to the voice transmission / reception apparatus, practical benefits brought about by the embodiment related to the voice transmission / reception apparatus can be similarly obtained, and between users via the network Smooth conversations are realized.

＜コンピュータプログラムの実施形態＞ <Embodiment of Computer Program>

本発明のコンピュータプログラムに係る実施形態は、コンピュータシステムを、上記いずれか一項に記載の映像音声送受信装置として機能させる。 The embodiment according to the computer program of the present invention causes a computer system to function as the video / audio transmission / reception apparatus according to any one of the above.

当該コンピュータプログラムを格納するＲＯＭ、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ハードディスク等の記録媒体或いはＵＳＢ（Universal Serial Bus）メモリ等コンピュータシステムに着脱可能な固体型記憶装置から、当該コンピュータプログラムをコンピュータシステムに読み込んで実行させれば、或いは、当該コンピュータプログラムを、例えば、通信手段等を介してコンピュータシステムにダウンロードさせた後に実行させれば、上述した本発明の映像音声送受信装置に係る実施形態を比較的簡単に実現できる。 The computer program is loaded into the computer system from a solid-state storage device such as a ROM, CD-ROM, DVD-ROM, hard disk or other storage medium storing the computer program or a USB (Universal Serial Bus) memory. Or the computer program is downloaded to a computer system via a communication means or the like and executed, for example, the above-described embodiment of the video / audio transmitting / receiving apparatus of the present invention is relatively simple. Can be realized.

尚、上述した本発明の映像音声送受信装置に係る実施形態における各種態様に対応して、本発明のコンピュータプログラムに係る実施形態も各種態様を採ることが可能である。 Incidentally, in response to the various aspects of the embodiment of the video / audio transmitting / receiving apparatus of the present invention described above, the embodiment of the computer program of the present invention can also adopt various aspects.

＜記録媒体の実施形態＞ <Embodiment of Recording Medium>

本発明のコンピュータプログラムに係る実施形態は、コンピュータプログラムに係る実施形態を記録する。 The embodiment according to the computer program of the present invention records the embodiment according to the computer program.

本発明の記録媒体に係る実施形態によれば、コンピュータシステムに装着又は接続することによって、或いはコンピュータシステムに備わる又は接続された然るべき読取装置に挿入することによって、記録している本発明のコンピュータプログラムに係る実施形態を、コンピュータシステムに読み込ませて実行させることができ、上述した本発明の映像音声送受信装置に係る各実施形態を比較的簡単に実現できる。 According to the embodiment of the recording medium of the present invention, the computer program of the present invention recorded by being attached to or connected to a computer system, or by being inserted into an appropriate reader provided in or connected to the computer system. The embodiments according to the present invention can be read and executed by a computer system, and the embodiments according to the video / audio transmission / reception apparatus of the present invention described above can be realized relatively easily.

以上説明したように、本発明の映像音声送受信装置に係る第１実施形態によれば、自拠点ユーザ映像音声取得手段、送信手段、他拠点ユーザ映像音声取得手段及び制御手段を具備するので、ネットワークを介したユーザ相互間の円滑な会話が実現される。 As described above, according to the first embodiment of the video / audio transmission / reception apparatus of the present invention, the local-site user video / audio acquisition means, the transmission means, the other-site user video / audio acquisition means, and the control means are provided. A smooth conversation between users via the network is realized.

以上説明したように、本発明の映像音声送受信システムに係る実施形態によれば、本発明の映像音声送受信装置に係る実施形態を複数具備するので、ネットワークを介したユーザ相互間の円滑な会話が実現される。 As described above, according to the embodiment of the video / audio transmission / reception system of the present invention, since there are a plurality of embodiments of the video / audio transmission / reception apparatus of the present invention, smooth conversations between users via a network are possible. Realized.

以上説明したように、本発明のコンピュータプログラムに係る実施形態によれば、コンピュータシステムを本発明の映像音声送受信装置に係る実施形態として機能させることができる。 As described above, according to the embodiment of the computer program of the present invention, the computer system can function as the embodiment of the video / audio transmission / reception apparatus of the present invention.

以上説明したように、本発明の記録媒体に係る実施形態によれば、本発明のコンピュータプログラムに係る実施形態を記録することにより、コンピュータシステムを本発明の映像音声送受信装置に係る実施形態として機能させることができる。 As described above, according to the embodiment of the recording medium of the present invention, the computer system functions as the embodiment of the video / audio transmitting / receiving apparatus of the present invention by recording the embodiment of the computer program of the present invention. Can be made.

本発明のこのような作用及び他の利得は次に説明する実施例から明らかにされる。 These effects and other advantages of the present invention will become apparent from the embodiments described below.

以下、適宜図面を参照して、本発明の好適な各種実施例について説明する。
＜第１実施例＞
＜実施例の構成＞
始めに、図１を参照し、本発明の第１実施例に係る会話システム１の構成について説明する。ここに、図１は、会話システム１の構成を概念的に表してなる概略構成図である。 Hereinafter, various preferred embodiments of the present invention will be described with reference to the drawings as appropriate.
<First embodiment>
<Configuration of Example>
First, the configuration of the conversation system 1 according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic configuration diagram conceptually showing the configuration of the conversation system 1.

図１において、会話システム１は、相互いに離れた拠点Ａ（即ち、本発明に係る「自拠点」の一例である）及び拠点Ｂ（即ち、本発明に係る「他拠点」の一例である）を繋ぐ広域ネットワーク１０（ＩＰ（Internet Protocol）網）に収容された、本発明に係る「映像音声送受信システム」の一例たるシステムである。 In FIG. 1, a conversation system 1 is a base A (that is, an example of the “own base” according to the present invention) and a base B (that is, an example of the “other base” according to the present invention) that are separated from each other. 1 is a system that is an example of the “video / audio transmission / reception system” according to the present invention, which is accommodated in a wide area network 10 (IP (Internet Protocol) network).

会話システム１は、拠点Ａに設置され、拠点Ａに存在するユーザ２０Ａ（即ち、本発明に係る「自拠点ユーザ」の一例である）に使用される映像音声送受信装置３０Ａ（即ち、本発明に係る「映像音声送受信装置」の一例）と、拠点Ｂに設置され、拠点Ｂに存在するユーザ２０Ｂ（即ち、本発明に係る「他拠点ユーザ」の一例である）に使用される音声送受信装置３０Ｂ（即ち、本発明に係る「他拠点の映像音声送受信装置」の他の一例）から構成される。ユーザ２０Ａとユーザ２０Ｂとは、会話システム１により相互いの映像及び音声を確認しつつ会話を進行させることができる。 The conversation system 1 is installed at the site A, and the video / audio transmission / reception device 30A (that is, according to the present invention) used for the user 20A existing at the site A (that is, an example of the “own site user” according to the present invention). An example of such “video / audio transmission / reception device” and an audio transmission / reception device 30B used for the user 20B installed at the base B and existing at the base B (that is, an example of the “other base user” according to the present invention). (That is, another example of the “video / audio transmitting / receiving apparatus at another site” according to the present invention). The user 20 A and the user 20 B can make a conversation while confirming mutual video and audio by the conversation system 1.

ここで、会話システム１の詳細な構成をより理解し易くするために、図２及び図３を参照し、会話システム１の動作の概要について説明することとする。ここに、図２は、ネットワークを介さない対面会話におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。また、図３は、ネットワークを介する会話におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。尚、これら各図において、図１と重複する箇所及び相互に重複する箇所については、同一の符号を付してその説明を適宜省略することとする。 Here, in order to make the detailed configuration of the conversation system 1 easier to understand, an outline of the operation of the conversation system 1 will be described with reference to FIGS. 2 and 3. FIG. 2 is a timing chart illustrating the progress of conversation between users in face-to-face conversation not via a network. FIG. 3 is a timing chart illustrating the progress of conversation between users in a conversation via a network. In addition, in each of these drawings, portions that overlap with those in FIG. 1 and portions that overlap each other are denoted by the same reference numerals, and description thereof is omitted as appropriate.

図２において、左側にユーザ２０Ａ（拠点Ａ）、右側にユーザ２０Ｂ（拠点Ｂ）が表される。尚、縦軸は時間軸であり、拠点Ａ及び拠点Ｂの各々について、時間軸が設定されている。 In FIG. 2, the user 20A (base A) is shown on the left side, and the user 20B (base B) is shown on the right side. The vertical axis is a time axis, and a time axis is set for each of the base A and the base B.

ここで、拠点Ａと拠点Ｂとが、ネットワーク１０を介さずにユーザ同士が会話可能な程、近接した拠点であるとする（例えば、一会議室内等であるとする）。この場合、ユーザ２０Ａとユーザ２０Ｂとは、対面会話、即ち、相互いに向き合って、表情や仕草を確認しつつ会話のやり取りを行うことが可能である。 Here, it is assumed that the base A and the base B are so close that the users can talk to each other without going through the network 10 (for example, in a conference room). In this case, the user 20 A and the user 20 B can face-to-face conversation, i.e., face each other and exchange conversations while confirming facial expressions and gestures.

即ち、拠点Ａにおいて、発話Ａ１、Ａ２及びＡ３が時間軸上に沿って順次なされると、拠点Ｂにおいて、ユーザ２０Ｂは、これらユーザ２０Ａの発話内容を瞬時に認識することができ、これらに対応する発話Ｂ１、Ｂ２及びＢ３を順次行うことができる。即ち、これら双方の発話は、時間軸上で同期が取れており、ユーザは相互いに円滑な会話が可能である。 That is, when the utterances A1, A2, and A3 are sequentially performed along the time axis at the base A, the user 20B can instantly recognize the utterance contents of the user 20A at the base B and respond to these. The utterances B1, B2 and B3 can be performed sequentially. That is, both of these utterances are synchronized on the time axis, and the user can have a smooth conversation with each other.

これに対し、図３には、拠点Ａと拠点Ｂとがネットワーク１０を介して繋がった構成（即ち、本実施例に係る構成）が示される。 On the other hand, FIG. 3 shows a configuration in which the base A and the base B are connected via the network 10 (that is, the configuration according to the present embodiment).

図３に示される会話環境においては、拠点Ａにおけるユーザ２０Ａの発話Ａ１は、拠点Ｂにおいて、拠点Ａにおける発話Ａ１の発生時刻からネットワーク１０の伝送遅延時間及び映像音声送受信装置３０Ｂの処理遅延時間等を含む遅延時間ＤＬＹ＿ＡＢだけ遅れた発話Ａ１Ｂ（拠点Ｂにおける発話Ａ１であることを意味する）として出力される。 In the conversation environment shown in FIG. 3, the utterance A1 of the user 20A at the site A is transmitted from the occurrence time of the utterance A1 at the site A to the transmission delay time of the network 10 and the processing delay time of the video / audio transmitting / receiving device 30B. Is output as an utterance A1B delayed by a delay time DLY_AB including (meaning that the utterance is A1 at the base B).

一方、ユーザ２０Ｂが、この発話Ａ１Ｂに対し、発話Ｂ１により応答を行うとする。この場合、拠点Ｂにおけるユーザ２０Ｂの発話Ｂ１は、拠点Ａにおいて、拠点Ｂにおける発話Ｂ１の発生時刻からネットワーク１０の伝送遅延時間及び映像音声送受信装置３０Ａの処理遅延時間等を含む遅延時間ＤＬＹ＿ＢＡだけ遅れた発話Ｂ１Ａ（拠点Ａにおける発話Ｂ１であることを意味する）として出力される。即ち、図２に例示される対面会話と較べると、ユーザ２０Ａが発話Ａ１を行ってから発話Ｂ１を確認するまで、「ＤＬＹ＿ＡＢ＋ＤＬＹ＿ＢＡ」に相当する遅延時間（即ち、本発明に係る「遅延時間」の一例）だけ余分な時間が経過することになる。 On the other hand, it is assumed that the user 20B responds to the utterance A1B with the utterance B1. In this case, the utterance B1 of the user 20B at the base B is delayed at the base A by a delay time DLY_BA including the transmission delay time of the network 10 and the processing delay time of the video / audio transmitting / receiving apparatus 30A from the time when the utterance B1 at the base B occurs. Utterance B1A (meaning utterance B1 at base A). That is, when compared with the face-to-face conversation illustrated in FIG. 2, the delay time corresponding to “DLY_AB + DLY_BA” (that is, the “delay time” according to the present invention) from when the user 20A performs the utterance A1 until the user confirms the utterance B1. Only an extra time will elapse.

他方、拠点Ａの時間軸上では、この発話Ａ１に引き続いて発話Ａ２及びＡ３が順次行われており、上記発話Ｂ１Ａは、発話Ａ３と重複するタイミングで出力されることになる。然るに、この発話Ｂ１Ａの基になる発話Ｂ１は、元々ユーザ２０Ａの発話Ａ１に対応してなされたものであるから、ユーザ２０Ａからすると、時間軸上で同期していない発話となる。また、このような現象は、拠点Ｂにおいても同様に生じ得る。即ち、ネットワーク１０を介した会話においては、何らかの対策を講じない限り、ユーザ２０Ａとユーザ２０Ｂとの間の会話には時間的ずれが生じ、円滑な会話が難しくなる。 On the other hand, on the time axis of the base A, the utterances A2 and A3 are sequentially performed following the utterance A1, and the utterance B1A is output at a timing overlapping with the utterance A3. However, since the utterance B1 that is the basis of the utterance B1A is originally made in correspondence with the utterance A1 of the user 20A, from the user 20A, the utterance is not synchronized on the time axis. Such a phenomenon can occur in the base B as well. That is, in the conversation via the network 10, unless any countermeasure is taken, the conversation between the user 20A and the user 20B has a time lag, and smooth conversation becomes difficult.

第１実施例及び後述する第２実施例に係る会話システムは、このような問題点を解決するためのものである。ここで、各実施例の詳細な説明に先立ち、図４を参照して、これら実施例における係る問題点の解決原理について説明することとする。ここに、図４は、第１及び第２実施例におけるユーザ同士の会話の進行状態を例示するタイミングチャートである。尚、同図において、図３と重複する箇所には、同一の符号を付してその説明を適宜省略することとする。 The conversation system according to the first embodiment and the second embodiment, which will be described later, is for solving such problems. Here, prior to detailed description of each embodiment, the principle of solving the problems in these embodiments will be described with reference to FIG. FIG. 4 is a timing chart illustrating the progress of conversation between users in the first and second embodiments. In the figure, the same reference numerals are given to the same portions as those in FIG. 3, and the description thereof is omitted as appropriate.

図４において、第１及び第２実施例に係る会話システムによれば、拠点Ｂにおいて出力された、拠点Ａにおけるユーザ２０Ａの発話（例えば、Ａ１Ｂ）に対応する情報（図４では、発話Ａ１Ｂそのもの）が、拠点Ｂにおけるユーザ２０Ｂの発話（例えば、Ｂ１）との時間関係を保持したまま（即ち、拠点Ｂの時間軸において同期した状態で）、拠点Ａに出力される（図示、発話Ａ１ＢＡ参照）。即ち、ユーザ２０Ａとユーザ２０Ｂとの間でなされる会話は、双方の拠点において、相手方の発話の出力時刻は遅れるものの、各々における発話の相互関係は一切崩れることがない。このため、遅延時間の大小に関係なく、各拠点のユーザは、相手の発話が自身の如何なる発話に対し、如何なるタイミングでなされたものであるかを常に確認することができ、円滑な会話が可能となるのである。 4, according to the conversation system according to the first and second embodiments, information corresponding to the utterance (for example, A1B) of the user 20A at the base A output at the base B (in FIG. 4, the utterance A1B itself) ) Is output to the base A while maintaining the time relationship with the utterance (for example, B1) of the user 20B at the base B (that is, synchronized with the time axis of the base B) (see the utterance A1BA shown in the figure). ). That is, in the conversation between the user 20A and the user 20B, although the output time of the other party's utterance is delayed at both bases, the mutual relation between the utterances in each is not broken at all. For this reason, regardless of the delay time, users at each site can always check at what timing the other party's utterance was made, and smooth conversation is possible. It becomes.

尚、図４には、拠点Ａにおける、拠点Ｂでなされたユーザ２０Ｂの発話Ｂ１Ａ、Ｂ２Ａ及びＢ３Ａと拠点Ｂで出力された拠点Ａにおけるユーザ２０Ａの発話Ａ１ＢＡ、Ａ２ＢＡ及びＡ３ＢＡとの時間関係が、拠点Ｂにおける、拠点Ｂでなされたユーザ２０Ｂの発話Ｂ１、Ｂ２及びＢ３と拠点Ａでなされたユーザ２０Ａの発話Ａ１Ｂ、Ａ２Ｂ及びＡ３Ｂとの時間関係と一致している様子が示される。 In FIG. 4, the time relationship between the utterances B1A, B2A and B3A of the user 20B made at the site B and the utterances A1BA, A2BA and A3BA of the user 20A at the site A outputted at the site B in the site A A state in which the time relationship between the utterances B1, B2, and B3 of the user 20B made at the base B and the utterances A1B, A2B, and A3B of the user 20A made at the base A at the base B is shown.

これらの点を踏まえた上で、これ以降、実施例毎に、詳細な構成及び動作を説明することとする。 Based on these points, the detailed configuration and operation will be described for each embodiment.

先ず、図５を参照し、映像音声送受信装置３０Ａ（尚、映像音声送受信装置３０Ｂとハードウェア構成は同一である）の構成について説明する。ここに、図５は、映像音声送受信装置３０Ａのブロック図である。尚、同図において、図１と重複する箇所には同一の符号を付してその説明を適宜省略することとする。 First, the configuration of the video / audio transmission / reception device 30A (the hardware configuration is the same as that of the video / audio transmission / reception device 30B) will be described with reference to FIG. FIG. 5 is a block diagram of the video / audio transmitting / receiving apparatus 30A. In the figure, the same reference numerals are given to the same portions as those in FIG. 1, and the description thereof will be omitted as appropriate.

図５において、映像音声送受信装置３０Ａ（或いは３０Ｂ）は、カメラ３１、マイク３２、モニタ３３及びスピーカ３４並びに送受信装置１００を備える。 In FIG. 5, the video / audio transmission / reception device 30 A (or 30 B) includes a camera 31, a microphone 32, a monitor 33, a speaker 34, and the transmission / reception device 100.

カメラ３１は、ユーザ２０Ａを撮像可能に構成された撮像装置である。カメラ３１は、その撮像用のレンズ（不図示）が、ユーザ２０Ａの顔面を含む上半身に対向配置されており、ユーザ２０Ａの上半身を撮像することが可能である。この撮像の結果得られたユーザ２０Ａの映像は、フレーム毎の映像信号として送受信装置１００に送出される構成となっている。 The camera 31 is an imaging device configured to be able to image the user 20A. The camera 31 has an imaging lens (not shown) disposed opposite to the upper body including the face of the user 20A, and can image the upper body of the user 20A. The video of the user 20A obtained as a result of this imaging is configured to be sent to the transmission / reception device 100 as a video signal for each frame.

マイク３２は、ユーザ２０Ａの音声を集音可能に構成された集音装置である。マイク３２は、その集音部が、ユーザ２０Ａの顔面に向けて配置されており、ユーザ２０Ａの音声を集音することが可能である。この集音の結果得られたユーザ２０Ａの音声は、音声信号として送受信装置１００に送出される構成となっている。 The microphone 32 is a sound collecting device configured to collect the voice of the user 20A. The microphone 32 has a sound collection unit arranged toward the face of the user 20A, and can collect the sound of the user 20A. The voice of the user 20A obtained as a result of the sound collection is configured to be sent to the transmission / reception device 100 as a voice signal.

モニタ３３は、例えば液晶表示型ディスプレイ装置等の映像表示装置であり、所定の映像信号に従って、所定の表示領域に映像を表示可能に構成されている。尚、モニタ３３は、送受信装置１００と電気的に接続されており、その映像出力は、送受信装置１００によって制御される構成となっている。 The monitor 33 is a video display device such as a liquid crystal display type display device, for example, and is configured to display a video in a predetermined display area in accordance with a predetermined video signal. The monitor 33 is electrically connected to the transmission / reception device 100, and the video output is controlled by the transmission / reception device 100.

スピーカ３４は、音声出力装置であり、所定の音声信号に従って音声を出力可能に構成されている。尚、スピーカ３４は、送受信装置１００と電気的に接続されており、その音声出力は、送受信装置１００によって制御される構成となっている。 The speaker 34 is an audio output device and is configured to be able to output audio in accordance with a predetermined audio signal. The speaker 34 is electrically connected to the transmission / reception device 100, and the sound output is controlled by the transmission / reception device 100.

送受信装置１００は、映像音声送受信装置３０Ａの動作全体を制御可能に構成されたコンピュータシステムであり、狭義において、本発明に係る「映像音声送受信装置」の一例として機能する。 The transmission / reception device 100 is a computer system configured to be able to control the entire operation of the video / audio transmission / reception device 30A, and functions in the narrow sense as an example of the “video / audio transmission / reception device” according to the present invention.

送受信装置１００は、夫々不図示のＣＰＵ、ＲＯＭ及びＲＡＭ等を含む電子制御ユニットを備える。このＲＯＭには、ＣＰＵにより実行可能に構成された各種制御プログラムが格納されており、ＣＰＵが当該制御プログラムを実行することにより、送受信装置は、本発明に係る「映像音声送受信装置」の一例として機能する構成となっている。尚、係る制御プログラムは、本発明に係る「コンピュータプログラム」の一例である。 The transmission / reception device 100 includes an electronic control unit including a CPU, a ROM, a RAM, and the like (not shown). The ROM stores various control programs configured to be executable by the CPU. When the CPU executes the control program, the transmission / reception device is an example of the “video / audio transmission / reception device” according to the present invention. It has a functioning configuration. The control program is an example of the “computer program” according to the present invention.

尚、この制御プログラムは、ＨＤ（Hard Disk）、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ或いはＢＤ（Blu-ray Disk）等の各種記録媒体（本発明に係る「記録媒体」の一例である）に記録することが可能であり、この種の記録媒体が、例えば各種のコンピュータシステムに内蔵又は付設される然るべき記録再生装置に設置、格納又は挿入された状態において、当該コンピュータシステムにより適宜読み出されることにより、当該コンピュータシステムを、本発明に係る「映像音声送受信装置」の一例として機能させることもまた可能である。 This control program is recorded on various recording media (an example of the “recording medium” according to the present invention) such as HD (Hard Disk), CD-ROM, DVD-ROM, or BD (Blu-ray Disk). This kind of recording medium can be appropriately read by the computer system in a state where the recording medium is installed, stored or inserted in an appropriate recording / reproducing apparatus incorporated in or attached to various computer systems, for example. It is also possible for a computer system to function as an example of the “video / audio transmitting / receiving apparatus” according to the present invention.

次に、図６を参照し、送受信装置１００の詳細な構成について説明する。ここに、図６は、送受信装置１００のブロック図である。尚、同図において、図５と重複する箇所には同一の符号を付してその説明を適宜省略することとする。 Next, a detailed configuration of the transmission / reception device 100 will be described with reference to FIG. FIG. 6 is a block diagram of the transmission / reception device 100. In the figure, the same reference numerals are assigned to the same parts as those in FIG. 5, and the description thereof is omitted as appropriate.

図６において、送受信装置１００は、入力部１１０、送信部１２０、受信部１３０、受信バッファ１４０、出力部１５０、映像処理部１６０及び映像合成部１７０を備える。 6, the transmission / reception apparatus 100 includes an input unit 110, a transmission unit 120, a reception unit 130, a reception buffer 140, an output unit 150, a video processing unit 160, and a video synthesis unit 170.

入力部１１０は、映像及び音声の入力インターフェイスユニットであり、映像入力部１１１、音声入力部１１２及び映像音声符号化部１１３を備える。 The input unit 110 is a video and audio input interface unit, and includes a video input unit 111, an audio input unit 112, and a video / audio encoding unit 113.

映像入力部１１１は、入力端子（符号省略）が先述したカメラ３１に接続された映像入力インターフェイスであり、カメラ３１から順次送出されるユーザ２０Ａのフレーム単位の映像信号（即ち、本発明に係る「自拠点ユーザの映像」の一例）を取得可能に構成された、本発明に係る「自拠点ユーザ映像音声取得手段」の一例である。映像入力部１１１は、後述する映像合成部１７０と電気的に接続されており、取得された映像信号は、映像合成部１７０に送出される構成となっている。 The video input unit 111 is a video input interface in which an input terminal (reference numeral omitted) is connected to the camera 31 described above, and a video signal in frame units of the user 20A sequentially transmitted from the camera 31 (that is, “ It is an example of “own site user video / audio acquisition means” according to the present invention configured to be able to acquire “example of own site user's video”. The video input unit 111 is electrically connected to a video synthesis unit 170 described later, and the acquired video signal is transmitted to the video synthesis unit 170.

音声入力部１１２は、入力端子（符合省略）が、先述したマイク３２に接続された音声入力インターフェイスであり、マイク３２を介してユーザ２０Ａの音声を規定する音声信号（即ち、本発明に係る「自拠点ユーザの音声」の一例）を取得可能に構成された、本発明に係る「自拠点ユーザ映像音声取得手段」の他の一例である。 The audio input unit 112 is an audio input interface whose input terminal (not shown) is connected to the microphone 32 described above, and an audio signal that defines the audio of the user 20A via the microphone 32 (that is, “ It is another example of the “own site user video and audio acquisition means” according to the present invention configured to be able to acquire “example of own site user's voice”.

映像音声符号化部１１３は、図示せぬエンコーダを備え、映像入力部１１１及び音声入力部１１２を介して得られた映像信号及び音声信号或いはこれらに対応する映像信号及び音声信号を、夫々所定のビットレートで圧縮符号化し且つ多重化して、所定規格（例えば、ＭＰＥＧ２ＴＳ（Transport Stream）等）に準じた映像音声データを生成するデータ生成ユニットである。また、映像音声符号化部１１３は、この生成された映像音声データを、所定の伝送プロトコル（本実施例ではＲＴＰ（Real-time Transfer Protocol）等）に準じたデータパケットに変換する構成となっている。 The video / audio encoding unit 113 includes an encoder (not shown), and receives a video signal and an audio signal obtained via the video input unit 111 and the audio input unit 112 or a video signal and an audio signal corresponding to the video signal and the audio signal respectively. This is a data generation unit that generates video and audio data that complies with a predetermined standard (for example, MPEG2TS (Transport Stream), etc.) by compression encoding and multiplexing at a bit rate. Further, the video / audio encoding unit 113 is configured to convert the generated video / audio data into a data packet conforming to a predetermined transmission protocol (RTP (Real-time Transfer Protocol or the like in this embodiment)). Yes.

送信部１２０は、映像音声符合化部１１３で生成された映像音声データに関するデータパケットを、ネットワーク１０を介して拠点Ｂの送受信装置１００に送信可能に構成された送信インターフェイスであり、本発明に係る「送信手段」の一例である。 The transmission unit 120 is a transmission interface configured to be able to transmit the data packet regarding the video / audio data generated by the video / audio encoding unit 113 to the transmission / reception device 100 at the base B via the network 10, and according to the present invention. It is an example of “transmission means”.

受信部１３０は、ネットワーク１０を介して拠点Ｂの送受信装置１００における映像音声符号化部１１３で生成された映像音声データに関するデータパケットを取得する受信インターフェイスであり、本発明に係る「他拠点ユーザ映像音声取得手段」の一例である。 The reception unit 130 is a reception interface that acquires a data packet related to the video / audio data generated by the video / audio encoding unit 113 in the transmission / reception apparatus 100 at the site B via the network 10. It is an example of “voice acquisition means”.

受信バッファ１４０は、各データパケットの伝送遅延時間のばらつきを抑制し、各データパケットの順序を入れ替え（即ち、各データパケットの到着時刻はまちまちである）、またデータパケットの消失による不具合を吸収するために設けられた一時的記憶領域を有する記憶手段である。受信バッファ１４０は、ＲＴＰのシーケンス番号やタイムスタンプを利用して、拠点Ｂの送受信装置１００における送信順序及び時間間隔を保って、取得したデータパケットを順次送出可能に構成されている。 The reception buffer 140 suppresses variations in the transmission delay time of each data packet, changes the order of each data packet (that is, the arrival time of each data packet varies), and absorbs defects caused by the loss of the data packet. This is a storage means having a temporary storage area provided for this purpose. The reception buffer 140 is configured to sequentially transmit acquired data packets using the RTP sequence numbers and time stamps while maintaining the transmission order and time interval in the transmitting / receiving apparatus 100 at the site B.

出力部１５０は、映像及び音声の出力インターフェイスユニットであり、映像音声復号化部１５１、映像出力部１５２及び音声出力部１５３を備えた、本発明に係る「制御手段」の一例である。 The output unit 150 is an output interface unit for video and audio, and is an example of a “control unit” according to the present invention including a video / audio decoding unit 151, a video output unit 152, and an audio output unit 153.

映像音声復号化部１５１は、ＲＴＰ形式のデータパケットをＭＰＥＧ２ＴＳ形式の圧縮符号化データである映像音声データに変換し、更に、この映像音声データを復号化且つ分離化して映像信号（一映像フレームに対応する映像信号）及び音声信号を取り出し可能に構成されている。取り出された映像信号は、映像出力部１５２及び後述する映像処理部１６０に、また音声信号は、音声出力部１５３に夫々送出される構成となっている。 The video / audio decoding unit 151 converts the data packet in the RTP format into video / audio data that is compression-encoded data in the MPEG2TS format, and further decodes and separates the video / audio data into a video signal (into one video frame). Corresponding video signals) and audio signals can be extracted. The extracted video signal is sent to the video output unit 152 and the video processing unit 160 described later, and the audio signal is sent to the audio output unit 153.

映像出力部１５２は、出力端子（符号省略）が、先述したモニタ３３に接続された映像出力インターフェイスであり、モニタ３３に対し映像音声符号化部１５１から送出される映像信号を供給可能に構成されている。 The video output unit 152 is a video output interface whose output terminal (reference numeral omitted) is connected to the monitor 33 described above, and is configured to be able to supply a video signal sent from the video / audio encoding unit 151 to the monitor 33. ing.

音声出力部１５３は、出力端子（符号省略）が、先述したスピーカ３４に接続された音声出力インターフェイスであり、スピーカ３４に対し映像音声符号化部１５１から送出された音声信号を供給可能に構成されている。 The audio output unit 153 is an audio output interface whose output terminal (reference numeral omitted) is connected to the speaker 34 described above, and is configured to be able to supply the audio signal sent from the video / audio encoding unit 151 to the speaker 34. ing.

映像処理部１６０は、映像音声復号化部１５１から供給される映像信号（拠点Ｂにおけるユーザ２０Ｂの一映像フレームに対応する映像信号）に基づいて、ユーザ２０Ａの一映像フレームとの合成に供すべき合成用映像フレーム（即ち、本発明に係る「他拠点ユーザ情報」の一例である）を生成する画像処理装置である。 The video processing unit 160 should be used for synthesis with one video frame of the user 20A based on the video signal supplied from the video / audio decoding unit 151 (video signal corresponding to one video frame of the user 20B at the base B). It is an image processing apparatus that generates a video frame for composition (that is, an example of “other site user information” according to the present invention).

本実施例に係る合成用映像フレームは、元の映像フレームに対し縦横の表示サイズが所定比率で縮小された映像であり、また元の映像フレームの左下隅に表示されるように表示位置が規定された映像フレームであるとする。尚、フレームの残余の領域は、背景色を透明化する等の処理が講じられる。生成された合成用映像フレームは、映像合成部１７０の映像合成用バッファに送出される。 The composition video frame according to the present embodiment is a video in which the vertical and horizontal display sizes are reduced by a predetermined ratio with respect to the original video frame, and the display position is defined so as to be displayed in the lower left corner of the original video frame. It is assumed that the video frame has been processed. The remaining area of the frame is subjected to processing such as making the background color transparent. The generated composite video frame is sent to the video composition buffer of the video composition unit 170.

映像合成部１７０は、映像合成用バッファ１７１を備え、拠点Ａにおけるユーザ２０Ａの一映像フレームと、ユーザＢに関する合成用映像フレームとを合成処理可能に構成された画像処理装置である。 The video synthesizing unit 170 is an image processing apparatus that includes a video synthesizing buffer 171 and is capable of synthesizing one video frame of the user 20A at the site A and a synthesizing video frame related to the user B.

映像合成用バッファ１７１は、映像処理部１６０から送出された合成用映像フレームを一時的に格納する記憶装置である。 The video composition buffer 171 is a storage device that temporarily stores the composition video frame transmitted from the video processing unit 160.

映像合成部１７０は、映像合成バッファ１７１に一時的に格納された合成用映像フレームと、映像入力部１１１から送出された映像信号に基づいて生成されたユーザ２０Ａの一映像フレームとを合成する合成処理を実行して、合成映像フレームを生成可能に構成される。生成された合成映像フレームに係る映像信号は、映像音声符号化部１１３に順次送出される構成となっている。 The video synthesis unit 170 synthesizes the video frame for synthesis temporarily stored in the video synthesis buffer 171 and one video frame of the user 20A generated based on the video signal sent from the video input unit 111. It is configured to be able to generate a composite video frame by executing processing. The video signal related to the generated composite video frame is sequentially sent to the video / audio encoding unit 113.

＜実施例の動作＞
以下、本発明の第１実施例に係る動作として、送受信装置１００が先述した各種制御プログラムを実行することにより実現される送信制御及び受信制御の詳細について説明する。 <Operation of Example>
Hereinafter, as operations according to the first embodiment of the present invention, details of transmission control and reception control realized by the transmission / reception apparatus 100 executing the various control programs described above will be described.

始めに、図７を参照し、送信制御の詳細について説明する。ここに、図７は、送信制御のフローチャートである。 First, details of transmission control will be described with reference to FIG. FIG. 7 is a flowchart of transmission control.

図７において、撮像及び集音処理が実行される（ステップＳ１０１）。撮像及び集音処理においては、カメラ３１によるユーザ２０Ａの撮像処理と、マイク３２によるユーザ２０Ａの集音処理とが、時間同期を保って並列的に実行される。これらの処理の結果得られた映像信号は、映像入力部１１１を介して映像合成部１７０に送出され、また音声信号は、音声入力部１１２を介して映像音声符号化部１１３に送出される。 In FIG. 7, imaging and sound collection processing are executed (step S101). In the imaging and sound collection processing, the imaging processing of the user 20A by the camera 31 and the sound collection processing of the user 20A by the microphone 32 are executed in parallel while maintaining time synchronization. The video signal obtained as a result of these processes is sent to the video synthesizing unit 170 via the video input unit 111, and the audio signal is sent to the video / audio encoding unit 113 via the audio input unit 112.

次に、映像合成部１７０により映像合成処理が実行される（ステップＳ１０２）。映像合成処理では、先述したように、ユーザ２０Ａの一映像フレームと、ユーザ２０Ｂに関する合成用映像フレームとが合成され、合成映像フレームが生成される。合成映像フレームに関する映像信号は、元の映像信号と音声信号との時間同期を保って映像音声符号化部１１３に送出される。 Next, video composition processing is executed by the video composition unit 170 (step S102). In the video synthesis process, as described above, one video frame of the user 20A and the video frame for synthesis related to the user 20B are synthesized to generate a synthesized video frame. The video signal related to the composite video frame is sent to the video / audio encoding unit 113 while maintaining time synchronization between the original video signal and the audio signal.

次に、映像音声符号化部１１３により、合成映像フレームに係る映像信号と音声信号との符号化及び多重化処理が実行され、先述した映像音声データが生成される（ステップＳ１０３）。また、生成された映像音声データは先述したデータパケットとして送信部１２０に送出される。 Next, the video / audio encoding unit 113 executes encoding and multiplexing processing of the video signal and the audio signal related to the synthesized video frame, and generates the above-described video / audio data (step S103). The generated video / audio data is sent to the transmission unit 120 as the data packet described above.

送信部１２０では、この送出されたデータパケットを、ネットワーク１０を介して拠点Ｂの送受信装置１００に送信する（ステップＳ１０４）。送信処理が終了すると、処理はステップＳ１０１に戻され、一連の処理が繰り替えされる。送信制御は以上のように実行される。 The transmitting unit 120 transmits the transmitted data packet to the transmitting / receiving apparatus 100 at the site B via the network 10 (step S104). When the transmission process ends, the process returns to step S101, and a series of processes is repeated. Transmission control is executed as described above.

次に、図８を参照し、受信制御の詳細について説明する。ここに、図８は、受信制御のフローチャートである。 Next, details of reception control will be described with reference to FIG. FIG. 8 is a flowchart of reception control.

図８において、受信部１３０は、拠点Ｂの送受信装置１００から送信されるデータパケットを受信すると共に、受信したデータパケットを順次受信バッファ１４０に格納する（ステップＳ２０１）。 In FIG. 8, the receiving unit 130 receives data packets transmitted from the transmitting / receiving device 100 at the base B, and sequentially stores the received data packets in the reception buffer 140 (step S201).

ＲＴＰに準じたデータパケットには、元のデータの時間関係を規定してなるタイムスタンプが含まれており、受信バッファ４０では、ネットワーク１０を介してランダムに送信されてくるデータパケットを、元の順序に並べ替えることができる。受信バッファ１４０は、保持するデータパケットをタイムスタンプにより管理し、タイムスタンプに従って順次映像音声復号化部１５１に送出する（ステップＳ２０２）。 The data packet conforming to RTP includes a time stamp that defines the time relationship of the original data. In the reception buffer 40, the data packet randomly transmitted via the network 10 is converted into the original data packet. Can be rearranged in order. The reception buffer 140 manages the data packets to be held with time stamps, and sequentially sends them to the video / audio decoding unit 151 according to the time stamps (step S202).

映像音声復号化部１５１では、先述したようにデータパケットを復号化及び分離化し、映像信号を映像出力部１５２及び映像処理部１６０に、また音声信号を音声出力部１５３に夫々送出する（ステップＳ２０３）。映像音声復号化部１５１による処理を経ると、処理は二系統に分化する。 The video / audio decoding unit 151 decodes and separates the data packet as described above, and sends the video signal to the video output unit 152 and the video processing unit 160, and the audio signal to the audio output unit 153 (step S203). ). After the processing by the video / audio decoding unit 151, the processing is divided into two systems.

即ち、一方で、映像処理部１６０により合成用映像フレームが生成され（ステップＳ２０４）、映像合成用バッファ１７１に格納される（ステップＳ２０５）。尚、この格納された合成用映像フレームは、先述の送信制御におけるステップＳ１０２における合成処理に供される。 That is, on the other hand, a video frame for synthesis is generated by the video processing unit 160 (step S204) and stored in the video synthesis buffer 171 (step S205). The stored video frame for synthesis is used for the synthesis process in step S102 in the transmission control described above.

他方で、映像出力部１５２及び音声出力部１５３は、映像音声復号化部１５１を介して得られた映像信号及び音声信号を、夫々モニタ３３及びスピーカ３４に送出する。その結果、モニタ３３には、拠点Ｂにおいて合成処理された合成映像フレームが表示され、スピーカ３４からは、この合成映像フレームに対応するユーザＢの音声が出力される。 On the other hand, the video output unit 152 and the audio output unit 153 send the video signal and the audio signal obtained through the video / audio decoding unit 151 to the monitor 33 and the speaker 34, respectively. As a result, the synthesized video frame synthesized at the site B is displayed on the monitor 33, and the sound of the user B corresponding to the synthesized video frame is output from the speaker 34.

これらの処理が終了すると、処理はステップＳ２０１に戻される。受信制御は以上のように実行される。 When these processes are completed, the process returns to step S201. Reception control is executed as described above.

＜実施例の効果＞
次に、図９を参照し、第１実施例の効果について説明する。ここに、図９は、受信制御の実行過程における拠点Ａの表示映像を例示する図である。 <Effect of Example>
Next, effects of the first embodiment will be described with reference to FIG. FIG. 9 is a diagram exemplifying a display image of the base A in the execution process of the reception control.

図９において、拠点Ａにおいて、カメラ３１により撮像されたユーザ２０Ａの一映像フレームをＰＩＣＴ＿Ａとする。また、拠点Ｂにおいて、カメラ３１により撮像されたユーザ２０Ｂの一映像フレームをＰＩＣＴ＿Ｂとする。 In FIG. 9, let us say that one video frame of the user 20 A captured by the camera 31 at the site A is PICT_A. In addition, at the site B, one video frame of the user 20B captured by the camera 31 is defined as PICT_B.

拠点Ｂにおいて先の受信制御が実行される過程において、このＰＩＣＴ＿Ａに基づいて、合成用映像フレームＰＩＣＴ＿Ａ’（ここでは、ＰＩＣＴ＿Ａの縮小映像フレーム）が生成される。合成用映像フレームＰＩＣＴ＿Ａ’は、本発明に係る「自拠点ユーザ情報」の一例である。この合成用映像フレームＰＩＣＴ＿Ａ’は、拠点Ｂにおいて先の送信制御が実行される過程において、ＰＩＣＴ＿Ｂと合成され、拠点Ａで表示すべき合成映像フレームＰＩＣＴ＿ＢＡが生成される。 In the process in which the reception control is executed at the site B, a composite video frame PICT_A ′ (here, a reduced video frame of PICT_A) is generated based on this PICT_A. The composition video frame PICT_A ′ is an example of “own site user information” according to the present invention. This composite video frame PICT_A ′ is combined with PICT_B in the process of performing the previous transmission control at the base B, and a composite video frame PICT_BA to be displayed at the base A is generated.

この合成映像フレームＰＩＣＴ＿ＢＡは、拠点Ａにおける受信制御において、モニタ３３に拠点Ｂの映像として表示される。 This composite video frame PICT_BA is displayed on the monitor 33 as a video of the site B in the reception control at the site A.

ここで、合成映像フレームＰＩＣＴ＿ＢＡが、ユーザＡの映像及びユーザＢの映像の単なる合成映像と異なる点は、合成用映像ＰＩＣＴ＿Ａ’（拠点Ａで生成される合成用映像ＰＩＣＴ＿Ｂ’も同様である）が時間要素を備えているところにある。 Here, the composite video frame PICT_BA is different from a simple composite video of the user A video and the user B video in that the composite video PICT_A ′ (the same applies to the composite video PICT_B ′ generated at the site A). It has a time element.

即ち、合成用映像フレームＰＩＣＴ＿Ａ’は、元々拠点Ａで得られた映像フレームＰＩＣＴ＿Ａを基にしている。然るに、拠点Ａの送受信装置１００で合成用映像フレームＰＩＣＴ＿Ａ’を生成し、拠点Ｂから得られた映像フレームＰＩＣＴ＿Ｂと合成してしまうと、先に図３を参照して説明したように、遅延時間分、これらのフレームの時間関係がずれることになる。これでは、ユーザＡに関する合成用映像フレームＰＩＣＴ＿Ａ’を拠点Ａで表示することの技術的意義が極めて小さくなる。 That is, the composition video frame PICT_A ′ is based on the video frame PICT_A originally obtained at the site A. However, if the transmitting / receiving device 100 at the site A generates the compositing video frame PICT_A ′ and combines it with the video frame PICT_B obtained from the site B, as described above with reference to FIG. Therefore, the time relationship between these frames is shifted. In this case, the technical significance of displaying the composite video frame PICT_A ′ relating to the user A at the site A becomes extremely small.

これに対し、本実施例では、合成用映像フレームＰＩＣＴ＿Ａ’が、拠点Ｂにおいて生成される構成を採る。即ち、データパケットがネットワーク１０を拠点Ｂへ向けて伝送する過程において、拠点Ａから拠点Ｂへの片道分の遅延時間がキャンセルされる。即ち、合成用映像フレームＰＩＣＴ＿Ａは、拠点Ｂでの時間軸に同期した映像フレームとなる。 On the other hand, in this embodiment, a composition video frame PICT_A ′ is generated at the site B. That is, in the process of transmitting the data packet from the base A to the base B, the delay time for one way from the base A to the base B is cancelled. That is, the composition video frame PICT_A is a video frame synchronized with the time axis at the base B.

一方、この合成用映像フレームＰＩＣＴ＿Ａ’は、拠点Ｂにおいて、時間軸が同期したユーザＢの映像フレームＰＩＣＴ＿Ｂと合成され、データパケットとして拠点Ａに戻されるが、この際、拠点Ｂにおいて合成処理がなされているため、ネットワーク１０を拠点Ｂから拠点Ａへ伝送する過程において、拠点Ｂから拠点Ａへの片道分の遅延時間が、この時間的同期関係を崩すことは決してない。 On the other hand, the composite video frame PICT_A ′ is combined with the video frame PICT_B of the user B whose time axis is synchronized at the base B and returned to the base A as a data packet. At this time, the composite processing is performed at the base B. Therefore, in the process of transmitting the network 10 from the base B to the base A, the delay time for one way from the base B to the base A never breaks this temporal synchronization relationship.

その結果、拠点Ａにおいては、拠点Ｂでの時間軸に同期したユーザＡとユーザＢとの様子（言うなれば、過去のある時点での拠点Ｂでの会話の様子）が、合成映像フレームＰＩＣＴ＿ＢＡとして表示されるのである。即ち、ユーザＡは、合成映像フレームＰＩＣＴ＿ＢＡと同期して出力されるユーザＢの音声（映像フレームＰＩＣＴ＿Ｂに同期した音声である）によって規定されるユーザＢの発話が、自身の如何なる発話に対してなされたのか、また如何なるタイミングでなされたのかを、拠点Ｂの時間軸に同期した合成用映像フレームＰＩＣＴ＿Ａ’により明確に認識することができるのである。また、このような利益は、無論拠点Ｂにおいても同様に享受される。 As a result, at the base A, the state of the user A and the user B synchronized with the time axis at the base B (in other words, the state of the conversation at the base B at a certain point in the past) is the composite video frame PICT_BA. Is displayed. That is, the user A makes the utterance of the user B defined by the voice of the user B (synchronized with the video frame PICT_B) output in synchronization with the synthesized video frame PICT_BA for any utterance of the user A. It is possible to clearly recognize whether or not the timing has been made by the composite video frame PICT_A ′ synchronized with the time axis of the base B. Of course, such a profit is also enjoyed in the base B as well.

従って、本実施例に係る会話システム１によれば、拠点Ａ及び拠点Ｂに設置された送受信装置１００の作用により、複数のユーザが円滑に会話を進行させることができるのである。
＜第２実施例＞
次に、本発明の第２実施例に係る会話システムについて説明する。第２実施例に係る会話システムは、基本的に第１実施例に係る会話システムと同等であるが、各拠点に設置された映像音声送受信装置、特に送受信装置の構成が第１実施例に係る送受信装置１００と異なっている。より具体的には、送受信装置１００に替えて送受信装置２００を備える点において、第１実施例と異なっている。 Therefore, according to the conversation system 1 according to the present embodiment, a plurality of users can smoothly advance the conversation by the action of the transmitting / receiving device 100 installed at the base A and the base B.
<Second embodiment>
Next, a conversation system according to the second embodiment of the present invention is described. The conversation system according to the second embodiment is basically the same as the conversation system according to the first embodiment, but the configuration of the video / audio transmission / reception apparatus installed at each site, particularly the transmission / reception apparatus, is related to the first embodiment. Different from the transmitting / receiving device 100. More specifically, it is different from the first embodiment in that a transmission / reception device 200 is provided instead of the transmission / reception device 100.

始めに、図１０を参照し、第２実施例に係る送受信装置の構成について説明する。ここに、図１０は、送受信装置２００の構成を表してなるブロック図である。尚、同図において、図６と重複する箇所には同一の符合を付してその説明を適宜省略することとする。尚、第１実施例と同様に、送受信装置２００は、拠点Ａの送受信装置であるとする。 First, the configuration of the transmitting / receiving apparatus according to the second embodiment will be described with reference to FIG. FIG. 10 is a block diagram showing the configuration of the transmitting / receiving apparatus 200. In the figure, the same reference numerals are given to the same portions as those in FIG. 6, and the description thereof is omitted as appropriate. As in the first embodiment, the transmission / reception device 200 is assumed to be a transmission / reception device at the site A.

図１０において、送受信装置２００は、入力部１１０、送信部１２０、受信部１３０、受信バッファ１４０、出力部２１０、遅延測定部２２０及び合成処理部２３０を備える。 10, the transmission / reception device 200 includes an input unit 110, a transmission unit 120, a reception unit 130, a reception buffer 140, an output unit 210, a delay measurement unit 220, and a synthesis processing unit 230.

出力部２１０は、映像音声符号化部１５１、映像出力部１５２及び音声出力部１５３に加えて、映像処理部２１１及び音声処理部２１２を備える。 The output unit 210 includes a video processing unit 211 and an audio processing unit 212 in addition to the video / audio encoding unit 151, the video output unit 152, and the audio output unit 153.

映像処理部２１１は、後述する発話データ処理部２３３による制御を受けて、映像音声符号化部１５１から送出される映像信号に基づいたユーザＢの一映像フレームに対し、所定のタイミングで後述する発話状態映像フレーム（即ち、本発明に係る「自拠点ユーザ情報」の一例）を合成する画像処理装置である。発話状態映像フレームが合成された映像フレームに係る映像信号は、映像出力部１５２に送出される構成となっている。 Under the control of the utterance data processing unit 233 described later, the video processing unit 211 performs utterance described later at a predetermined timing for one video frame of the user B based on the video signal transmitted from the video / audio encoding unit 151. It is an image processing apparatus that synthesizes a state video frame (that is, an example of “own site user information” according to the present invention). The video signal related to the video frame in which the speech state video frame is combined is sent to the video output unit 152.

音声処理部２１２は、後述する発話データ処理部２３３による制御を受けて、映像音声符号化部１５１から送出される音声信号に基づいたユーザＢの音声に対し、所定のタイミングで後述する発話終了音（即ち、本発明に係る「自拠点ユーザ情報」の他の一例）を合成する音声処理装置である。発話動作音が合成された音声に係る音声信号は、音声出力部１５３に送出される構成となっている。 Under the control of the utterance data processing unit 233, which will be described later, the audio processing unit 212, for the voice of the user B based on the audio signal transmitted from the video / audio encoding unit 151, the utterance end sound described later at a predetermined timing. That is, it is a voice processing device that synthesizes (another example of “own site user information” according to the present invention). A voice signal related to the voice synthesized with the utterance operation sound is transmitted to the voice output unit 153.

遅延測定部２２０は、機器遅延情報通信部２２１、ネットワーク遅延測定部２２２及び総遅延量算出部２２３を備える。 The delay measurement unit 220 includes a device delay information communication unit 221, a network delay measurement unit 222, and a total delay amount calculation unit 223.

機器遅延情報通信部２２１は、送受信装置２００の処理遅延時間（主として、入力部１１０に係る入力処理に要する遅延時間と、出力部２１０に係る出力処理に要する遅延時間との総和）を測定して、処理遅延時間情報として拠点Ｂの送受信装置２００に送信すると共に、拠点Ｂの送受信装置２００から同様の処理遅延時間情報を取得可能に構成されている。尚、この処理遅延時間情報の送受信は、一定の周期で頻繁に実行されてもよいし、ある特定のタイミング（例えば、会話システムの起動初期等）で実行されてもよい。 The device delay information communication unit 221 measures the processing delay time of the transmission / reception device 200 (mainly the sum of the delay time required for the input processing related to the input unit 110 and the delay time required for the output processing related to the output unit 210). The processing delay time information is transmitted to the transmission / reception device 200 at the base B, and similar processing delay time information can be acquired from the transmission / reception device 200 at the base B. The transmission / reception of the processing delay time information may be frequently executed at a constant cycle, or may be executed at a certain specific timing (for example, at the initial stage of starting the conversation system).

ネットワーク遅延測定部２２２は、ネットワーク１０の伝送遅延時間（ここでは、ＲＴＴ）を測定するユニットである。ネットワーク遅延測定部２２２は、ＲＴＣＰ（Real-time Transport Control Protocol）のＳＲ（Sender Report）及びＲＲ（Receiver Report）を利用してＲＴＴを測定する構成となっている。 The network delay measuring unit 222 is a unit that measures the transmission delay time (here, RTT) of the network 10. The network delay measuring unit 222 is configured to measure RTT using SR (Sender Report) and RR (Receiver Report) of RTCP (Real-time Transport Control Protocol).

総遅延量算出部２２３は、機器遅延情報通信部２２１により測定された処理遅延時間と、ネットワーク遅延測定部２２２により測定された伝送遅延時間とを加算して、拠点間のデータ伝送に要する総遅延時間（本発明に係る「遅延時間」の一例）を算出する演算処理装置である。 The total delay amount calculation unit 223 adds the processing delay time measured by the device delay information communication unit 221 and the transmission delay time measured by the network delay measurement unit 222 to obtain a total delay required for data transmission between the bases. It is an arithmetic processing unit that calculates time (an example of “delay time” according to the present invention).

合成処理部２３０は、発話判定部２３１、発話データ保持部２３２及び発話データ処理部２３３を備える。 The synthesis processing unit 230 includes an utterance determination unit 231, an utterance data holding unit 232, and an utterance data processing unit 233.

発話判定部２３１は、音声入力部１１２から音声信号を取得し、ユーザＡの音声レベル（音圧レベル）を検出すると共に、ユーザＡが発話状態にあるか否かを判定可能に構成されている。 The utterance determination unit 231 is configured to acquire an audio signal from the audio input unit 112, detect the user A's audio level (sound pressure level), and determine whether the user A is in an utterance state. .

発話データ保持部２３２は、発話判定部２３１の判定結果に基づいて、ユーザＡの発話状態をデータ化してなる発話データを生成し且つ一時的に記憶するユニットである。ここで、図１１を参照し、発話データについて説明する。ここに、図１１は、発話データ保持部２３２により保持された発話データの概念図である。 The utterance data holding unit 232 is a unit that generates and temporarily stores utterance data obtained by converting the utterance state of the user A into data based on the determination result of the utterance determination unit 231. Here, speech data will be described with reference to FIG. FIG. 11 is a conceptual diagram of the utterance data held by the utterance data holding unit 232.

図１１において、発話データは、ユーザＡの発話開始時刻に関する情報と発話終了時刻に関する情報とからなるデータである。発話データ保持部２３２は、この発話データを一次的に記憶することによって保持する構成を採る。尚、総遅延量算出部２２３により算出された総遅延時間よりも過去に相当する発話データは、発話データ保持部２２３から順次破棄される構成となっている。 In FIG. 11, the utterance data is data including information on the utterance start time of the user A and information on the utterance end time. The utterance data holding unit 232 employs a configuration in which the utterance data is stored by being temporarily stored. Note that utterance data corresponding to the past than the total delay time calculated by the total delay amount calculation unit 223 is sequentially discarded from the utterance data holding unit 223.

図１０に戻り、発話データ処理部２３３は、発話データ保持部に保持された発話データに基づいて、ユーザＡの発話状態に対応付けられた各種の発話マークｍｋを含んだ所定の発話状態映像フレームを生成する画像処理機能と、生成された発話状態映像フレームを映像処理部２１１に然るべきタイミングで送出する機能と、映像処理部２１１の動作を制御する機能とを有する制御ユニットである。発話データ処理部２３３は、本発明に係る「自拠点ユーザ情報生成手段」及び「選択手段」の夫々一例である。 Returning to FIG. 10, the utterance data processing unit 233 includes a predetermined utterance state video frame including various utterance marks mk associated with the utterance state of the user A based on the utterance data held in the utterance data holding unit. Is a control unit having an image processing function for generating an image, a function for transmitting the generated speech state video frame to the video processing unit 211 at an appropriate timing, and a function for controlling the operation of the video processing unit 211. The utterance data processing unit 233 is an example of each of the “own site user information generating unit” and the “selecting unit” according to the present invention.

ここで、第２実施例に係る送受信装置２００では、ＣＰＵがＲＯＭに格納された制御プログラムに従って、発話データ生成処理と映像音声出力処理とを実行する構成となっている。 Here, in the transmission / reception apparatus 200 according to the second embodiment, the CPU executes a speech data generation process and a video / audio output process in accordance with a control program stored in the ROM.

始めに、図１２を参照し、発話データ生成処理について説明する。ここに、図１２は、発話データ生成処理のフローチャートである。 First, the speech data generation process will be described with reference to FIG. FIG. 12 is a flowchart of the utterance data generation process.

図１２において、発話判定部２３１は、検出される音声レベルが、予め発話状態にあるか否かを判別可能に設定された閾値以上であるか否かを判別する（ステップＳ３０１）。音声レベルが閾値未満である場合（ステップＳ３０１：ＮＯ）、ステップＳ３０１が繰り返し実行され、処理は実質的に待機状態となる。 In FIG. 12, the utterance determination unit 231 determines whether or not the detected voice level is equal to or higher than a threshold set in advance so as to determine whether or not the utterance state is present (step S301). When the sound level is less than the threshold (step S301: NO), step S301 is repeatedly executed, and the process is substantially in a standby state.

音声レベルが閾値以上である場合（ステップＳ３０１：ＹＥＳ）、発話データ保持部２３２は、発話開始時刻（即ち、音声レベルが閾値を超えた時刻）を記録し（ステップＳ３０２）、音声レベルが閾値未満であるか否かを判別する（ステップＳ３０３）。音声レベルが閾値以上である期間は（ステップＳ３０３：ＮＯ）、処理が待機状態となり、音声レベルが閾値未満に低下すると（ステップＳ３０３：ＹＥＳ）、発話データ保持部２３２は、発話終了時刻を記録して、一の発話データが完成する（ステップＳ３０４）。一の発話データが完成すると、処理はステップＳ３０１に戻される。発話データ生成処理は以上のように実行される。 When the voice level is equal to or higher than the threshold (step S301: YES), the utterance data holding unit 232 records the utterance start time (that is, the time when the voice level exceeds the threshold) (step S302), and the voice level is lower than the threshold. It is determined whether or not (step S303). During a period in which the voice level is equal to or higher than the threshold (step S303: NO), the process is in a standby state, and when the voice level falls below the threshold (step S303: YES), the utterance data holding unit 232 records the utterance end time. Thus, one utterance data is completed (step S304). When one utterance data is completed, the process returns to step S301. The speech data generation process is executed as described above.

次に、図１３を参照し、映像音声出力処理について説明する。ここに、図１３は、映像音声出力処理のフローチャートである。 Next, the video / audio output processing will be described with reference to FIG. FIG. 13 is a flowchart of the video / audio output process.

始めに、発話データ処理部２３３は、映像音声復号化部１５１による一の映像音声データの復号化がなされた時刻（以下、適宜「復号化時刻」と称する）と拠点Ｂの時間軸上で同期する時刻（以下、適宜「同期時刻」と称する）における、ユーザＡの音声状態を、発話データ保持部２３２を介して確認する（ステップＳ４０１）。この同期時刻は、復号化時刻よりも、総遅延量算出部２２３で算出された総遅延時間分過去の時刻である。 First, the utterance data processing unit 233 synchronizes on the time axis of the site B with the time when one video / audio data is decoded by the video / audio decoding unit 151 (hereinafter referred to as “decoding time” as appropriate). The voice state of the user A at the time (hereinafter referred to as “synchronization time” as appropriate) is confirmed via the utterance data holding unit 232 (step S401). This synchronization time is a time past the total delay time calculated by the total delay amount calculation unit 223 from the decoding time.

発話データ処理部２３３は、この確認結果に基づいて、ユーザＡが発話状態にあるか否かを判別する（ステップＳ４０２）。ユーザＡが発話状態にある場合（ステップＳ４０２：ＹＥＳ）、発話データ処理部２３３は更に、発話が完了しているか否かを判別する（ステップＳ４０３）。尚、「発話が完了しているか否か」とは、即ち、現在その発話が継続していないことを意味しており、具体的には、発話データ保持部２３２の当該発話データに終了時刻が記述されていることを意味する。 The utterance data processing unit 233 determines whether the user A is in the utterance state based on the confirmation result (step S402). When the user A is in the utterance state (step S402: YES), the utterance data processing unit 233 further determines whether or not the utterance has been completed (step S403). Note that “whether or not the utterance has been completed” means that the utterance is not continuing at present, and specifically, the utterance data of the utterance data holding unit 232 has an end time. Means that it is described.

発話が未だ継続している場合（ステップＳ４０３：ＮＯ）、発話データ処理部２３３は、発話状態映像フレームを生成し、映像処理部２１１を介して、映像音声復号化部１５１を経た映像信号に対応する映像フレームに、生成された発話状態映像フレームを合成する（ステップＳ４０９）。この際、未だ発話が継続中であるため、後述する発話の残り時間は表示されない。 When the utterance is still continuing (step S403: NO), the utterance data processing unit 233 generates an utterance state video frame and corresponds to the video signal that has passed through the video / audio decoding unit 151 via the video processing unit 211. The generated speech state video frame is synthesized with the video frame to be played (step S409). At this time, since the utterance is still continuing, the remaining time of the utterance described later is not displayed.

一方、発話が既に完了している場合（ステップＳ４０３：ＹＥＳ）、発話データ処理部２３３は、発話の残り時間を算出する（ステップＳ４０４）。発話の残り時間とは、上述した同期時刻と、発話終了時刻（発話データによって規定されている）との差分である。 On the other hand, when the utterance has already been completed (step S403: YES), the utterance data processing unit 233 calculates the remaining time of the utterance (step S404). The remaining time of the utterance is a difference between the above-described synchronization time and the utterance end time (specified by the utterance data).

発話データ処理部２３３は、算出された残り時間が予め設定された閾値（例えば、５秒）未満であるか否かを判別する（ステップＳ４０５）。残り時間が閾値以上である場合（ステップＳ４０５：ＮＯ）、即ち、発話終了まで十分な時間がある場合、処理は先述のステップＳ４０９に移行される。 The utterance data processing unit 233 determines whether or not the calculated remaining time is less than a preset threshold value (for example, 5 seconds) (step S405). If the remaining time is equal to or greater than the threshold (step S405: NO), that is, if there is sufficient time until the end of the utterance, the process proceeds to step S409 described above.

一方で、残り時間が閾値未満である場合（ステップＳ４０５：ＹＥＳ）、即ち、程無く発話が終了する場合、映像データ処理部２３３は、残時間表示マークを含む発話状態映像フレームを生成し、映像処理部２１１を介して映像音声復号化部１５１から送出される映像フレームに合成する（ステップＳ４０６）。尚、残時間表示マークとは、発話の残り時間を視覚的に表してなるマーク（アイコンやインジケータ等を含む）である。 On the other hand, when the remaining time is less than the threshold (step S405: YES), that is, when the utterance is finished soon, the video data processing unit 233 generates an utterance state video frame including the remaining time display mark, and the video It is synthesized with a video frame transmitted from the video / audio decoding unit 151 via the processing unit 211 (step S406). The remaining time display mark is a mark (including an icon, an indicator, etc.) that visually represents the remaining time of the utterance.

尚、ステップＳ４０２において、同期時刻においてユーザＡが発話状態にない場合（ステップＳ４０２：ＮＯ）、発話データ処理部２３３は、同期時刻と発話終了時刻とが一致するか、即ち、丁度発話が終了する時刻であるか否かを判別する（ステップＳ４０７）。同期時刻と発話終了時刻とが一致しない場合（ステップＳ４０７：ＮＯ）、発話データ処理部２３３は、映像音声復号化部１５１から送出される映像フレームがそのまま映像出力部１５２に送出されるように映像処理部２１１を制御する。 In step S402, when the user A is not in the utterance state at the synchronization time (step S402: NO), the utterance data processing unit 233 matches the synchronization time with the utterance end time, that is, the utterance is just finished. It is determined whether it is time (step S407). When the synchronization time and the utterance end time do not match (step S407: NO), the utterance data processing unit 233 outputs the video so that the video frame sent from the video / audio decoding unit 151 is sent to the video output unit 152 as it is. The processing unit 211 is controlled.

同期時刻と発話終了時刻とが一致する場合、発話データ処理部２３３は、発話終了音に係る音声信号を生成し、映像処理部１６１を介して映像音声復号化部１５１から送出された音声信号に合成する（ステップＳ４０８）。発話終了音とは、ユーザＡに拠点Ｂの時間軸に同期した過去の自身の発話が終了した旨を告知するための告知音であり、その態様は何ら限定されない。 When the synchronization time and the utterance end time match, the utterance data processing unit 233 generates an audio signal related to the utterance end sound, and adds the audio signal transmitted from the video / audio decoding unit 151 via the video processing unit 161 to the audio signal. Combining (step S408). The utterance end sound is a notification sound for notifying the user A that his / her past utterance synchronized with the time axis of the base B has ended, and its mode is not limited at all.

ステップＳ４０８が実行されるか、ステップＳ４０７が「ＮＯ」側に分岐するか、或いは発話状態映像フレームが合成された映像フレームが表示される場合、処理はステップＳ４０１に戻される。映像音声出力処理は以上のように実行される。 If step S408 is executed, step S407 branches to the “NO” side, or if a video frame in which the speech state video frame is synthesized is displayed, the process returns to step S401. The video / audio output processing is executed as described above.

次に、然るべき条件が満たされた場合に発話データ制御部２３３により生成され、その時点（同期時刻）における映像フレームに適宜合成される発話状態映像フレームについて説明する。ここに、図１４は、発話状態映像フレームの各種態様を例示する図である。尚、同図において、図９と重複する箇所には、同一の符合を付してその説明を適宜省略することとする。 Next, an utterance state video frame that is generated by the utterance data control unit 233 when appropriate conditions are satisfied and appropriately combined with the video frame at that time (synchronization time) will be described. FIG. 14 is a diagram illustrating various modes of the speech state video frame. In the figure, the same parts as those in FIG. 9 are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

図１４において、発話状態映像フレームは、拠点Ａにおいて表示されるユーザＢの映像フレームＰＩＣＴ＿ＢＡに合成される合成用表示フレームであり、フレーム左端部分に各種態様を伴う発話マークｍｋが配置された映像フレームである。 In FIG. 14, the utterance state video frame is a composite display frame synthesized with the video frame PICT_BA of the user B displayed at the site A, and the video frame in which the utterance mark mk with various aspects is arranged at the left end portion of the frame. It is.

図１４（ａ）には、発話マークｍｋとして残時間表示マークｍｋ１を含む発話状態映像フレームが合成された状態が例示される。 FIG. 14A illustrates a state in which an utterance state video frame including the remaining time display mark mk1 as the utterance mark mk is synthesized.

図１４（ａ）に例示される残時間表示マークｍｋ１は、単位時間（ここでは、例として１秒とする）に一表示単位マークを割り当てた一種のインジケータである。図１４（ａ）では、ユーザＡの発話の残り時間が、先述した閾値（ここでは、例として５秒とする）に相当する、表示単位マーク五個分、即ち、五単位時間（即ち、５秒）である旨が示される。即ち、これ以降、単位時間経過毎に、この表示単位マークは漸減し、ユーザＡの発話終了と共に表示されなくなる（即ち、図１３のステップＳ４０９に相当する状態である）。 The remaining time display mark mk1 illustrated in FIG. 14A is a kind of indicator in which one display unit mark is assigned to a unit time (here, assumed to be 1 second as an example). In FIG. 14A, the remaining time of user A's utterance corresponds to the above-described threshold value (here, 5 seconds as an example), that is, five display unit marks, that is, five unit times (that is, five unit times). Second). That is, thereafter, every time the unit time elapses, the display unit mark gradually decreases and is not displayed when the user A's utterance ends (that is, a state corresponding to step S409 in FIG. 13).

図１４（ｂ）には、発話マークｍｋとして発話表示マークｍｋ２を含む発話状態映像フレームが合成された状態が例示される。図１４（ｂ）に例示される発話表示マークｍｋ２は、ユーザＡが発話状態にあるか否かを表すアイコンであり、ユーザＡが発話中である場合に表示され、発話中でなければ表示されない構成となっている。或いは、発話表示マークｍｋ２は、よりユーザＡの注意を喚起し得るように、訴求性の高い点滅表示がなされてもよい。 FIG. 14B illustrates a state where an utterance state video frame including the utterance display mark mk2 as the utterance mark mk is synthesized. The utterance display mark mk2 illustrated in FIG. 14B is an icon indicating whether or not the user A is in an utterance state, and is displayed when the user A is speaking, and is not displayed unless the user A is speaking. It has a configuration. Alternatively, the utterance display mark mk2 may be displayed in a highly appealing blinking manner so that the user A's attention can be drawn more.

図１４（ｃ）には、発話マークｍｋとして音声レベルマークｍｋ３を含む発話状態映像フレームが合成された状態が例示される。図１４（ｃ）に例示される音声レベルマークｍｋ３は、ユーザＡの音声レベル（音圧レベル）を複数段階に区分して表すインジケータであり、図中では、ユーザＡの音声レベルが、レベル３（ハッチング表示参照）である旨が示される。音声レベルマークｍｋ３は、ユーザＡの音声レベルと共にリアルタイムに変化する発話マークである。 FIG. 14C illustrates a state in which an utterance state video frame including the voice level mark mk3 as the utterance mark mk is synthesized. The voice level mark mk3 illustrated in FIG. 14C is an indicator that represents the voice level (sound pressure level) of the user A in a plurality of stages, and in the figure, the voice level of the user A is level 3. (See hatching display). The voice level mark mk3 is an utterance mark that changes in real time with the voice level of the user A.

尚、音声レベル（音圧レベル）を視覚表示するには、当然ながら音声レベルの記憶（音声レベルそのものの記憶でも、複数段階に区分された後の音声レベルの記憶でもよい）が必要となる。 In order to visually display the sound level (sound pressure level), naturally, it is necessary to store the sound level (either the sound level itself or the sound level after being divided into a plurality of stages).

従って、図１４（ｃ）に例示される音声レベルマークｍｋ３の表示を実現する場合、図１０に例示される装置構成に加えて、この種の記憶を可能とする記憶装置が設けられるのが好適である。 Therefore, when realizing the display of the audio level mark mk3 illustrated in FIG. 14C, it is preferable to provide a storage device that enables this type of storage in addition to the device configuration illustrated in FIG. It is.

或いは、図１０に例示される装置構成において、発話データ保持部２３２が、図１１に例示される発話データに加えて、又は発話データの一部として、音声レベルに相当する音声レベルデータを保持してもよい。 Alternatively, in the apparatus configuration illustrated in FIG. 10, the utterance data holding unit 232 holds voice level data corresponding to the voice level in addition to the utterance data illustrated in FIG. 11 or as part of the utterance data. May be.

この場合、例えば、発話判定部２３１が、発話判定を遂行するにあたって発話に相当する音声信号である旨の判定を行った場合に、発話開始時刻と発話終了時刻との間の一発話期間（一発話データに対応する期間）において、所定周期で音声レベル値を発話データ保持部２３２に記憶させてもよい。この際、記憶された音声レベル値を音声レベルマークｍｋ３に相当する複数段階に区分する作業は、例えば、発話データ処理部２３３が行ってもよい。 In this case, for example, when the utterance determination unit 231 determines that the speech signal corresponds to the utterance when performing the utterance determination, the utterance period (one utterance period between the utterance start time and the utterance end time (one In the period corresponding to the utterance data), the voice level value may be stored in the utterance data holding unit 232 in a predetermined cycle. At this time, for example, the speech data processing unit 233 may perform the work of dividing the stored voice level value into a plurality of stages corresponding to the voice level mark mk3.

尚、一の発話状態映像フレームは、これら各発話マークを含む複数の発話マークを含んでいてもよい。例えば、発話マークｍｋ１及びｍｋ２が含まれる場合、発話終了までの残り陣間に比較的余裕がある場合には、発話マークｍｋ１及びｍｋ２の双方が表示され、発話終了間際になって発話マークｍｋ２のみが表示されてもよい。 Note that one utterance state video frame may include a plurality of utterance marks including these utterance marks. For example, when the utterance marks mk1 and mk2 are included, if there is a relatively large space between the remaining teams until the end of the utterance, both the utterance marks mk1 and mk2 are displayed, and only the utterance mark mk2 is just before the end of the utterance. May be displayed.

このように、第２実施例に係る会話システムによれば、拠点Ａにおいて表示される拠点Ｂのユーザ２０Ｂの映像フレームに、当該映像フレームと拠点Ｂの時間軸上で同期したユーザＡの発話状態を規定する発話マークｍｋを含む、発話状態映像フレームが適宜合成される。このため、拠点Ａのユーザ２０Ａは、拠点Ｂのユーザ２０Ｂが、自身の如何なる発話に対して如何なるタイミングで発話を行ったのかについて、視覚的に認識することができる。また、このような効果は、拠点Ｂにおいても同様である。従って、双方における会話を円滑に進行させることが可能となるのである。 As described above, according to the conversation system according to the second embodiment, the utterance state of the user A synchronized with the video frame of the user 20B of the site B displayed at the site A on the time axis of the site B. An utterance state video frame that includes an utterance mark mk that prescribes the For this reason, the user 20A at the site A can visually recognize at what timing the user 20B at the site B uttered what kind of speech he / she himself / herself. In addition, such an effect is the same in the base B. Therefore, it is possible to make the conversation between both parties proceed smoothly.

尚、第２実施例に係る会話システムによれば、第１実施例とは異なり、発話データ処理部２３３が、遅延測定部２２０により測定された総遅延時間に基づいて、自拠点の発話データを管理することによって、拠点Ｂの時間軸に同期したタイミングでユーザ２０Ａに関する映像を表示させる旨の措置を、全て自拠点内で行うことが可能となっている。従って、例えば、会話システム内の送受信装置のうち、一の拠点に設置された一の送受信装置のみが本実施例に係る構成を有しているとしても、少なくとも当該一の拠点におけるユーザは、会話を快適に進行させることができる。このような効果は、例えば、ユーザ相互間で発言権や発言量に偏りがある場合等において、顕著に奏される。 In the conversation system according to the second embodiment, unlike the first embodiment, the utterance data processing unit 233 uses the total delay time measured by the delay measurement unit 220 to convert the utterance data of its own base. By managing, it is possible to perform all measures within the own site to display the video related to the user 20A at the timing synchronized with the time axis of the site B. Therefore, for example, even if only one transmission / reception device installed in one base among the transmission / reception devices in the conversation system has the configuration according to the present embodiment, at least the user in the one base Can proceed comfortably. Such an effect is remarkably exerted, for example, when the right to speak and the amount of speech are uneven among users.

また、第２実施例に係る会話システムによれば、実際に会話を成立させている音声に着目し、音声に基づいた発話判定と、音声に基づいた映像（発話マークｍｋ）表示とが行われる。このため、各拠点のユーザは、確実に自身の発話と相手の発話とを対比させることができ、会話の円滑性をより高めることが可能となり得る。即ち、自身の映像に基づいた映像（例えば、第１実施例に係る合成用映像フレーム）を相手の映像に合成するだけでは、時間軸の同期は確保されるにせよ、自身が如何なる発話を行っているかについて、確認し難い場合が生じ得るのである。 Also, according to the conversation system according to the second embodiment, attention is paid to the voice that has actually established the conversation, and the utterance determination based on the voice and the video (utterance mark mk) display based on the voice are performed. . For this reason, the user of each base can reliably compare his / her utterance with the utterance of the other party, and can further improve the smoothness of the conversation. In other words, by simply synthesizing a video based on one's own video (for example, the video frame for synthesis according to the first embodiment) with the other's video, even if the time axis synchronization is ensured, what kind of utterance is performed by itself? It may be difficult to confirm whether or not

尚、第２実施例に係る送受信装置２００では、合成処理部２３０によって発話データを管理する必要があり、また遅延測定部２２０によりデータ伝送に要する遅延時間を測定する必要があるため、第１実施例に係る送受信装置１００と較べると、装置の処理負荷が大きくなり易い。 In the transmission / reception apparatus 200 according to the second embodiment, it is necessary to manage speech data by the synthesis processing unit 230, and it is necessary to measure the delay time required for data transmission by the delay measurement unit 220. Compared with the transmission / reception apparatus 100 according to the example, the processing load of the apparatus tends to increase.

このため、第１実施例と同様に、自拠点のユーザの映像に基づいた合成用映像フレームを使用すると、少なくとも遅延時間分に相当する時間について、映像入力部１１１により得られた映像信号、当該映像信号に基づいて生成された映像、或いは当該映像に基づいて生成された合成用映像フレーム等、比較的大容量のデータを保持する記憶領域が必要となる。この点に鑑みれば、上記第２実施例のように、映像であっても比較的データ容量が軽くて澄む、音声に関する映像（ここでは、発話マークｍｋ）を含む映像フレームを合成用映像フレームとして使用することの効果は大きい。
＜第３実施例＞
第１実施例と第２実施例とに係る技術的特徴は、相互に相容れないものではなく、相互いに協調させることもまた可能である。以下に、そのような本発明の第３実施例について説明することとする。 For this reason, as in the first embodiment, when a video frame for synthesis based on the video of the user at the local site is used, the video signal obtained by the video input unit 111 for at least the time corresponding to the delay time, A storage area for holding a relatively large amount of data such as a video generated based on the video signal or a composite video frame generated based on the video is required. In view of this point, as in the second embodiment, a video frame including a video related to sound (here, the utterance mark mk) that is relatively light and clear even if it is a video is used as a video frame for synthesis. The effect of using it is great.
<Third embodiment>
The technical features according to the first embodiment and the second embodiment are not mutually exclusive, and can also be coordinated with each other. The third embodiment of the present invention will be described below.

第３実施例に係る会話システムは、基本的に第２実施例に係る会話システムと同等であるが、各拠点に設置された映像音声送受信装置、特に送受信装置の構成が第２実施例に係る送受信装置２００と異なっている。より具体的には、送受信装置２００に替えて送受信装置３００を備える点において、第２実施例と異なっている。 The conversation system according to the third embodiment is basically the same as the conversation system according to the second embodiment, but the configuration of the video / audio transmission / reception apparatus installed at each base, particularly the transmission / reception apparatus, is related to the second embodiment. It is different from the transmission / reception device 200. More specifically, the second embodiment differs from the second embodiment in that a transmission / reception device 300 is provided instead of the transmission / reception device 200.

始めに、図１５を参照し、第３実施例に係る送受信装置の構成について説明する。ここに、図１５は、送受信装置３００の構成を表してなるブロック図である。尚、同図において、図１０と重複する箇所には同一の符合を付してその説明を適宜省略することとする。尚、第２実施例と同様に、送受信装置３００は、拠点Ａの送受信装置であるとする。 First, the configuration of the transmitting / receiving apparatus according to the third embodiment will be described with reference to FIG. FIG. 15 is a block diagram showing the configuration of the transmission / reception device 300. In the figure, the same reference numerals are given to the same parts as those in FIG. 10, and the description thereof will be omitted as appropriate. As in the second embodiment, the transmission / reception device 300 is assumed to be a transmission / reception device at the site A.

図１５において、第３実施例に係る送受信装置３００は、入力部１１０に替えて入力部３１０を備える点において、第２実施例に係る送受信装置２００と異なっている。入力部３１０は、映像入力部１１１と映像音声符号化部１１３との間の信号経路に、音声レベル表示合成部３１１を備える点において、入力部１１０と異なっている。 In FIG. 15, the transmission / reception device 300 according to the third embodiment is different from the transmission / reception device 200 according to the second embodiment in that an input unit 310 is provided instead of the input unit 110. The input unit 310 is different from the input unit 110 in that an audio level display synthesis unit 311 is provided in a signal path between the video input unit 111 and the video / audio encoding unit 113.

音声レベル表示合成部３１１は、映像音声復号化部１５１から送出されたユーザＢの音声信号を取り込んで、その音声レベル（音圧レベル）を検出し（即ち、発話判定部２３１と同等の機能であってもよい）、音声レベルを表す合成用映像フレーム（例えば、図１４（ｃ）に相当する映像フレーム）を生成する。また、この生成された合成用映像フレームを、ユーザ２０Ａの映像フレームに合成する（即ち、合成用映像生成手段及び合成映像生成手段の夫々他の一例として機能する）。尚、この合成映像フレームに係る映像信号は、映像音声符号化部１１３を介して映像音声データに変換され、送信部１２０を介して拠点Ｂに送信される。 The audio level display synthesis unit 311 takes in the audio signal of the user B sent from the video / audio decoding unit 151 and detects the audio level (sound pressure level) (that is, with the same function as the speech determination unit 231). A video frame for synthesis (for example, a video frame corresponding to FIG. 14C) representing the audio level is generated. Further, the generated composite video frame is combined with the video frame of the user 20A (that is, functions as another example of the composite video generation unit and the composite video generation unit). Note that the video signal relating to the synthesized video frame is converted into video / audio data via the video / audio encoding unit 113 and transmitted to the site B via the transmission unit 120.

また、この音声レベルに関する映像フレームの合成処理は、拠点Ｂにおいても同様に実行される。即ち、ユーザ２０Ａの音声は、拠点Ｂにおける映像音声復号化部１５１の下流から取り出され、その音声レベルに関する映像フレームが、ユーザＢの映像フレームと合成されて映像音声データが生成される。この生成された映像音声データに係るデータパケットは、拠点Ａに送信される。 In addition, the video frame synthesis processing related to the audio level is executed in the same way at the site B as well. That is, the audio of the user 20A is extracted from the downstream of the video / audio decoding unit 151 at the base B, and the video frame relating to the audio level is synthesized with the video frame of the user B to generate video / audio data. The data packet related to the generated video / audio data is transmitted to the site A.

即ち、第３実施例によれば、音声レベルに関する映像については、第１実施例と同様に、他拠点側の時間軸との時間同期が他拠点側で行われる。従って、拠点Ａにおける映像音声復号化部１５１から送出される映像信号に対応する映像フレームには、既に拠点Ｂの時間軸に同期したユーザ２０Ａの音声レベルに関する映像フレームが合成されている。 That is, according to the third embodiment, for the video related to the audio level, time synchronization with the time base on the other base side is performed on the other base side, as in the first embodiment. Therefore, the video frame corresponding to the video signal transmitted from the video / audio decoding unit 151 at the site A has already been synthesized with the video frame related to the audio level of the user 20A synchronized with the time axis of the site B.

一方、発話データ処理部２３３は、残時間表示マークｍｋ１或いは発話表示マークｍｋ２等といった、比較的処理負荷の軽い映像に関しては、自拠点側で合成する。即ち、音声レベルに関する合成用映像フレームが合成された合成映像フレームに、更に残時間表示マークｍｋ１或いは発話表示マークｍｋ２等を含む発話状態映像フレームが、拠点Ｂの時間軸と同期して合成され、最終的にモニタ３３から一の映像フレームとして出力される。補足すると、この一の映像フレームに表された音声レベルと、残時間と発話の有無とは、当然ながら、拠点Ａにおいて過去のある同一時刻におけるユーザＡの発話動作を規定する。 On the other hand, the utterance data processing unit 233 synthesizes a video with a relatively light processing load such as the remaining time display mark mk1 or the utterance display mark mk2 on the own site side. That is, an utterance state video frame further including a remaining time display mark mk1 or an utterance display mark mk2 is synthesized in synchronism with the time axis of the base B on the synthesized video frame obtained by synthesizing the synthesis video frame related to the audio level. Finally, it is output from the monitor 33 as one video frame. Supplementally, the sound level represented in the one video frame, the remaining time, and the presence or absence of utterance naturally define the utterance operation of the user A at the same time in the past at the base A.

第３実施例によれば、音声レベルマークｍｋ３等、リアルタイムに変化し得る映像に関しては、上述した処理負荷や記憶容量の観点から、第１実施例と同様、敢えてネットワーク１０上を伝送させることにより、その時間管理に要する処理負荷を軽減し、且つ合成処理部２３０のデータ容量を低減させることができる。従って、双方の利点を生かしたより高い会話品質が提供され得る。尚、ここでは、音声レベルに関する映像が使用されたが、第１実施例と同様に、ユーザの映像から派生する合成用映像フレームが使用されてもよい。 According to the third embodiment, for the video that can change in real time, such as the audio level mark mk3, from the viewpoint of the processing load and the storage capacity described above, as in the first embodiment, it is intentionally transmitted on the network 10. The processing load required for the time management can be reduced, and the data capacity of the composition processing unit 230 can be reduced. Therefore, higher conversation quality can be provided by taking advantage of both. In this example, the video related to the audio level is used. However, as in the first embodiment, a composite video frame derived from the user's video may be used.

送受信装置３００は、ＲＯＭに格納された制御プログラムに従って、第１実施例と同様に送信制御を実行可能である。ここで、図１６を参照し、第３実施例に係る送信制御について説明する。ここに、図１６は、第３実施例に係る送信制御のフローチャートである。尚、同図において、図７と重複する箇所には同一の符合を付してその説明を適宜省略することとする。 The transmission / reception device 300 can execute transmission control in the same manner as in the first embodiment, according to the control program stored in the ROM. Here, the transmission control according to the third embodiment will be described with reference to FIG. FIG. 16 is a flowchart of transmission control according to the third embodiment. In the figure, the same reference numerals are given to the same portions as those in FIG. 7, and the description thereof will be omitted as appropriate.

図１６において、ステップＳ１０１が実行されると、映像合成処理が実行される（ステップＳ５０１）。ステップＳ５０１に係る映像合成処理では、上述したように、入力部３１０内において、音声レベルに関する合成用映像フレームの合成処理が実行される。尚、第１実施例と同様にユーザの映像に基づいた合成用映像フレームが生成される場合には、更にこのユーザの映像から派生する合成用映像フレームが、例えば映像合成部１７０等の作用により実行されてもよい。 In FIG. 16, when step S101 is executed, a video composition process is executed (step S501). In the video synthesis process according to step S501, as described above, the synthesis video frame synthesis process related to the audio level is executed in the input unit 310. When a video frame for synthesis based on the user's video is generated as in the first embodiment, a video frame for synthesis derived from the user's video is further generated by, for example, the video synthesizing unit 170 or the like. May be executed.

映像合成処理の実行後は、第１実施例と同様に、映像音声符号化部１１３による映像音声データ及びデータパケットの生成がなされ（ステップＳ１０３）、送信部１２０を介した送信が実行される（ステップＳ１０４）。第３実施例に係る送信制御は以上のように実行される。 After execution of the video composition processing, the video / audio encoding unit 113 generates video / audio data and data packets as in the first embodiment (step S103), and transmission via the transmission unit 120 is executed (step S103). Step S104). The transmission control according to the third embodiment is executed as described above.

本発明は、上述した実施例に限られるものではなく、請求の範囲及び明細書全体から読み取れる発明の要旨或いは思想に反しない範囲で適宜変更可能であり、そのような変更を伴う映像音声送受信装置、映像音声送受信システム、コンピュータプログラム及び記録媒体もまた本発明の技術的範囲に含まれるものである。 The present invention is not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit or concept of the invention that can be read from the claims and the entire specification. The audio / video transmission / reception system, the computer program, and the recording medium are also included in the technical scope of the present invention.

本発明は、遠隔地のユーザ同士で会話を成立させる装置或いはシステムに適用可能である。 The present invention is applicable to an apparatus or system that establishes a conversation between users at remote locations.

１…会話システム、１０…ネットワーク、２０Ａ、２０Ｂ…ユーザ、３０Ａ、３０Ｂ…映像音声送受信装置、１００…送受信装置、１１０…入力部、１２０…送信部、１３０…受信部、１４０…受信バッファ、１５０…出力部、１６０…映像処理部、１７０…映像合成部、１７１…映像合成用バッファ、２００…送受信装置、２１０…出力部、２２０…遅延測定部、２３０…発話データ処理部、３００…入力部、３１０…音声レベル表示合成部。 DESCRIPTION OF SYMBOLS 1 ... Conversation system, 10 ... Network, 20A, 20B ... User, 30A, 30B ... Video / audio transmission / reception apparatus, 100 ... Transmission / reception apparatus, 110 ... Input part, 120 ... Transmission part, 130 ... Reception part, 140 ... Reception buffer, 150 ... Output unit, 160 ... Video processing unit, 170 ... Video synthesis unit, 171 ... Buffer for video synthesis, 200 ... Transmission / reception device, 210 ... Output unit, 220 ... Delay measurement unit, 230 ... Speech data processing unit, 300 ... Input unit 310. Audio level display synthesis unit.

Claims

In the video / audio transmission / reception system including a plurality of video / audio transmission / reception devices accommodated in a network and used by a user at each of a plurality of bases,
Own site user video and audio acquisition means for acquiring video and audio of the own site user;
Transmitting means for transmitting the acquired video and audio of the local user via the network to the video / audio transmitting / receiving apparatus at another site;
Other site user video and audio acquisition means for acquiring video and audio of other site users via the network;
The acquired video and audio of the other site user is the acquired image of the other site user in the own site user information generated based on at least one of the acquired video and audio of the own site user and A video / audio transmitting / receiving apparatus comprising: a predetermined video output means and a control means for controlling the audio output means so that the audio and the information synchronized on the time axis of the other base are output.

The local site user information is generated by the video / audio transmitting / receiving device at the other site based on at least one of the transmitted video and audio of the local site user,
The local site user information generated by the video / audio transmitting / receiving device at the other site is transmitted via the network together with the video and audio of the other site user acquired at the other site,
The other site user video and audio acquisition means acquires the transmitted video and audio of the other site user and the own site user information,
The said control means controls the said image | video output means and the said audio | voice output means so that the acquired image | video and audio | voice of the other site user, and own site user information may be output. Video audio transmission and reception device.

Further comprising other site user information generating means for generating other site user information based on at least one of the video and audio of the acquired other site user,
The video / audio transmission / reception apparatus according to claim 2, wherein the transmission unit transmits the generated other-site user information together with the acquired video and audio of the local user.

A specifying means for specifying a delay time required for data transmission via the network with the audio / video transmission / reception device at the other site;
Own site user information generating means for generating the own site user information based on at least one of the acquired video and audio of the own site user;
Information synchronized on the time axis of the other site with the acquired video and audio of the other site user among the own site user information generated by the own site user information generating means based on the specified delay time And a selection means for selecting
The said control means controls the said video output means and an audio | voice output means so that the acquired image | video and audio | voice of the other site user, and the said selected information are output. The video / audio transmission / reception device according to any one of the above.

5. The video / audio transmission / reception apparatus according to claim 1, wherein the local site user information is a video corresponding to the acquired video of the local site user. 6.

The video / audio transmitting / receiving apparatus according to claim 5, wherein the video corresponding to the acquired video of the local user is the acquired video of the local user whose display area is reduced.

The video / audio transmission / reception apparatus according to claim 1, wherein the local site user information is a video corresponding to the acquired voice of the local site user.

The video / audio transmission / reception apparatus according to claim 7, wherein the acquired video corresponding to the voice of the local user is a video representing the utterance state of the local user.

The video / audio transmitting / receiving apparatus according to claim 7 or 8, wherein the acquired video corresponding to the voice of the local user is a video representing a remaining time until the utterance of the local user is completed.

The video / audio transmitting / receiving apparatus according to any one of claims 1 to 10, wherein the local site user information is a notification sound for notifying that the utterance of the local site user has been completed.

A plurality of video / audio transmission / reception devices accommodated in a network and used by a user at each of a plurality of bases,
Each of the plurality of video / audio transmission / reception devices includes:
Own site user video and audio acquisition means for acquiring video and audio of the own site user;
Transmitting means for transmitting the acquired video and audio of the local user via the network to the video / audio transmitting / receiving apparatus at another site;
Other site user video and audio acquisition means for acquiring video and audio of other site users via the network;
The acquired video and audio of the other site user is the acquired image of the other site user in the own site user information generated based on at least one of the acquired video and audio of the own site user and A video / audio transmission / reception system comprising: a predetermined video output unit and a control unit that controls the audio output unit so that the audio and the information synchronized on the time axis of the other site are output.

A computer program for causing a computer system to function as the video / audio transmitting / receiving apparatus according to any one of claims 1 to 10.

A recording medium for recording the computer program according to claim 12.