JP2012156820A

JP2012156820A - Video communication system, and operation method of the same

Info

Publication number: JP2012156820A
Application number: JP2011014719A
Authority: JP
Inventors: Atsushi Miyama; 篤深山; Naoyoshi Kanamaru; 直義金丸; Noriyasu Arakawa; 則泰荒川; Shunsuke Takamiya; 駿介高宮
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-01-27
Filing date: 2011-01-27
Publication date: 2012-08-16
Anticipated expiration: 2031-01-27
Also published as: JP5553782B2

Abstract

PROBLEM TO BE SOLVED: To provide a video communication system consisting of two or more video communication terminals and a video relay server.SOLUTION: In a video communication system, a first video communication terminal 20 has an original video transmission part 201 transmitting an original video to a video relay server. A second video communication terminal 30 has: an attention object image generation part 301 generating an attention object image including all or a part of an attention object designated in the original video; and an attention object image transmission part 302 transmitting the attention object image to a video relay server 10. The video relay server 10 has: an attention object detector 101 collating the original video received from the first video communication terminal 20 with the attention object image to detect an attention object coordinate in the original video; a composite video generation part 102 compositing a predetermined diagram at a position of the attention object in the original video based on the attention object coordinate to generate a composite video; and a composite video transmission part 103 transmitting the composite video to the first video communication terminal 20 or the second video communication terminal 30.

Description

本発明は、映像通信端末間で映像を送受信する映像コミュニケーションシステムに関し、特に、各映像通信端末を使用するユーザ間の意思疎通を円滑化する映像コミュニケーションシステム及びその作動方法に関する。 The present invention relates to a video communication system that transmits and receives video between video communication terminals, and more particularly to a video communication system that facilitates communication between users who use each video communication terminal and an operation method thereof.

映像コミュニケーションシステムは、既に一般に普及しており、高価なビジネス向けテレビ会議システムだけでなく、一般向けの携帯電話を映像通信端末としたテレビ会議システムにも実装されるに至っている。 Video communication systems are already in widespread use and have been implemented not only in expensive business videoconferencing systems but also in videoconferencing systems using a general-purpose mobile phone as a video communication terminal.

映像コミュニケーションシステムの一般的な課題として、映像に映っている物体に対して、通常の対面での会話のように「あれ」「それ」といった指示語やジェスチャで指し示して話をすることが困難であることが挙げられる。この問題を解決するために、映像通信端末間で同一の映像を共有し、その共有された映像の中に映像通信端末のユーザが通信する映像中に図形を書き込み、映像と図形を重畳して表示可能なシステムも実用化されている（例えば、非特許文献１参照）。これらのシステムの多くはテレビ会議映像とは別に、ホワイトボードや電子ファイル、Ｗｅｂページ等の画面を各映像通信端末間で共有して書きこむものであるが、テレビ会議映像への図形の重畳も同様に技術的に可能である。 As a general problem of video communication systems, it is difficult to talk to objects shown in video by pointing to them with directives and gestures such as “that” and “it” like normal face-to-face conversations. There are some. In order to solve this problem, the same video is shared between video communication terminals, a graphic is written in the video that the user of the video communication terminal communicates in the shared video, and the video and the graphic are superimposed. A displayable system has also been put into practical use (for example, see Non-Patent Document 1). In many of these systems, screens such as whiteboards, electronic files, and web pages are shared and written between video communication terminals separately from video conference video, but graphics are superimposed on video conference video as well. Technically possible.

“リアルコラボ”、ＮＴＴソフトウェア株式会社、［online]、［２０１１年１月１１日検索］、インターネット〈http://www.ntts.co.jp/products/realcollabo/index.html〉“Real Collaboration”, NTT Software Corporation, [online], [searched on January 11, 2011], Internet <http://www.ntts.co.jp/products/realcollabo/index.html>

映像への図形の重畳表示機能を携帯電話等の移動端末のテレビ電話に応用することで、従来のテレビ会議の場だけでない様々な場面での会話の円滑化に役立つと考えられるが、そのためには次のような課題が存在する。 By applying the graphic superimposition display function to the videophone of a mobile terminal such as a mobile phone, it is thought that it will be useful for facilitating the conversation in various scenes, not only in the place of the conventional videoconference. The following issues exist.

多くの移動端末は手で持って使うために手ぶれが大きく、重畳した図形の映像中の位置が、意図した場所からずれてしまうという課題がある。前述した非特許文献１の技術のような企業向けのテレビ会議システムでは、映像に利用するためのカメラは固定されていることが基本であり、映像に撮影される被写体（対象の物体）も動きは少ない。一方、カメラ付き移動端末を映像通信端末として利用することを考えると、カメラが頻繁に移動するだけでなく、屋外などの場面では映像中の物体についても動きが多くなる。したがって、例えばある瞬間に映像中の中心部に映っていた物体に矢印を書き込んだとしても、その物体が次の瞬間には映像中の別の場所に表示されているということが十分に起こり得る。このような状況で、矢印の位置を元の映像の中心部に固定したままでは、矢印が指す物体が変わってしまい、矢印の意味がなくなってしまうということが起こり得る。 Many mobile terminals have a large amount of camera shake because they are used by hand, and there is a problem that the position of the superimposed graphic in the video is shifted from the intended location. In the video conference system for enterprises such as the technology of Non-Patent Document 1 described above, the camera used for the video is basically fixed, and the subject (target object) captured in the video also moves. There are few. On the other hand, considering the use of a mobile terminal with a camera as a video communication terminal, not only does the camera move frequently, but also the movement of objects in the video increases in scenes such as outdoors. Thus, for example, even if an arrow is written on an object that was reflected in the center of the image at a certain moment, it can be sufficiently that the object is displayed at a different location in the image at the next moment. . In such a situation, if the position of the arrow is fixed at the center of the original image, the object pointed to by the arrow may change, and the meaning of the arrow may be lost.

さらに、非特許文献１のテレビ会議システムでは、業務用の固定端末を映像通信端末として利用するため十分な処理能力を具備できるが、移動端末は比較的処理能力が限定されることが多い。このため、移動端末で図形の重畳等の高負荷な処理を実行できないという問題も想定される。 Furthermore, although the video conference system of Non-Patent Document 1 can be provided with sufficient processing capability to use a business fixed terminal as a video communication terminal, mobile terminals are often relatively limited in processing capability. For this reason, the problem that a high load process, such as a superimposition of a figure, cannot be performed with a mobile terminal is also assumed.

また、図形の重畳等の高負荷な処理を実現するためには、特別な専用のソフトウェアが必要となるため、市販の移動端末をそのまま利用できないことが想定され、テレビ会議システム専用の移動端末や専用のソフトウェアをそれぞれの映像通信端末に用意することは、システムのコストの増大、ユーザ利便性等の問題が生じる。 Also, in order to realize high-load processing such as superimposing graphics, special dedicated software is required, so it is assumed that a commercially available mobile terminal cannot be used as it is, Providing dedicated software for each video communication terminal causes problems such as an increase in system cost and user convenience.

そこで、本発明は、上述の問題を鑑みて為されたものであり、各映像通信端末を使用するユーザ間の意思疎通を円滑化する映像コミュニケーションシステム及びその作動方法を提供することにある。 Accordingly, the present invention has been made in view of the above-described problems, and it is an object of the present invention to provide a video communication system that facilitates communication between users who use each video communication terminal and an operating method thereof.

本発明に係る映像コミュニケーションシステムは、従来のテレビ会議システムとは異なり、各映像通信端末を使用するユーザ間の意思疎通を円滑化するために、映像通信端末から提供される原映像に所定の図形を重畳した合成映像を生成及び提供する映像中継サーバを設け、各映像通信端末は、映像中継サーバを介して映像コミュニケーションを実現する構成とする。従って、本発明に係る映像コミュニケーションシステムでは、各映像通信端末が重畳すべき図形やその座標位置を指定するやり方を主題とするものではなく、各映像通信端末によって重畳すべき図形やその座標位置を指定された際に、映像中継サーバが合成映像を生成及び提供するようにしたことに主題があるため、従来からのテレビ会議システムとは技術的に区別されるべきことに留意する。 The video communication system according to the present invention is different from the conventional video conference system in that a predetermined figure is added to the original video provided from the video communication terminal in order to facilitate communication between users who use each video communication terminal. A video relay server that generates and provides a composite video on which is superimposed is provided, and each video communication terminal is configured to realize video communication via the video relay server. Therefore, in the video communication system according to the present invention, the subject of each video communication terminal is not the method of designating the graphic to be superimposed and its coordinate position, but the graphic to be superimposed by each video communication terminal and its coordinate position. It should be noted that the video relay server, when specified, is technically distinct from conventional video conferencing systems due to the subject matter of generating and providing composite video.

即ち、本発明による第１態様の映像コミュニケーションシステムは、２つ以上の映像通信端末及び映像中継サーバからなる映像コミュニケーションシステムであって、第１映像通信端末は、原映像を映像中継サーバに送信する原映像送信部を備え、第２映像通信端末は、当該原映像中で指定される注目物体の全部又は一部を含む注目物体画像を生成する注目物体画像生成部と、該注目物体画像を前記映像中継サーバに送信する注目物体画像送信部とを備え、前記映像中継サーバは、前記第１映像通信端末から受信した原映像と前記注目物体画像を照合し、当該原映像中で注目物体が映っている位置を示す注目物体座標を検出する注目物体検出部と、前記注目物体座標に基づいて当該原映像中の注目物体の位置に所定の図形を合成して合成映像を生成する合成映像生成部と、該合成映像を、前記第１映像通信端末又は前記第２映像通信端末に送信する合成映像送信部と、を備えることを特徴とする。これにより、第１映像通信端末における処理負担を軽減させつつ各映像通信端末間の映像コミュニケーションを実現することができる。 That is, the video communication system according to the first aspect of the present invention is a video communication system including two or more video communication terminals and a video relay server, and the first video communication terminal transmits the original video to the video relay server. The second video communication terminal includes an attention object image generation unit that generates an attention object image including all or a part of the attention object specified in the original image; and A target object image transmission unit for transmitting to the video relay server, and the video relay server collates the target image with the original video received from the first video communication terminal, and the target object is reflected in the original video. A target object detection unit for detecting a target object coordinate indicating a position of the target object, and a composite image by combining a predetermined figure with the position of the target object in the original video based on the target object coordinate And generating synthesized image generating unit, the combined image, characterized in that it comprises a combined image transmission unit to be transmitted to the first video communication terminal or the second video communication terminal. Thereby, the video communication between each video communication terminal is realizable, reducing the processing burden in a 1st video communication terminal.

また、本発明による第１態様の映像コミュニケーションシステムにおいて、前記合成映像送信部は、前記合成映像を前記第１映像通信端末及び前記第２映像通信端末の双方に送信する手段を有することを特徴とする。これにより、第２映像通信端末は、第１映像通信端末と合成映像を共有することができる。 In the video communication system according to the first aspect of the present invention, the composite video transmission unit includes means for transmitting the composite video to both the first video communication terminal and the second video communication terminal. To do. Thereby, the second video communication terminal can share the composite video with the first video communication terminal.

また、本発明による第１態様の映像コミュニケーションシステムにおいて、前記第２映像通信端末の注目物体画像生成部は、当該合成映像中で指定される注目物体の全部又は一部を含む注目物体画像を生成する手段を有することを特徴とする。これにより、第２映像通信端末は、合成映像に対して更に注目物体を指定できるようになる。 In the video communication system according to the first aspect of the present invention, the target object image generation unit of the second video communication terminal generates a target object image including all or part of the target object specified in the composite video. It has the means to do. As a result, the second video communication terminal can further specify the object of interest for the synthesized video.

さらに、本発明による第２態様の映像コミュニケーションシステムは、２つ以上の映像通信端末及び映像中継サーバからなる映像コミュニケーションシステムであって、第１映像通信端末は、原映像を映像中継サーバに送信する原映像送信部を備え、第２映像通信端末は、当該原映像中で指定される注目領域を示す注目領域座標の情報を生成する注目領域座標指定処理部と、該注目領域座標の情報を前記映像中継サーバに送信する注目領域座標送信部とを備え、前記映像中継サーバは、前記第２映像通信端末から受信した注目領域座標の情報を基に、当該原映像中で注目物体の全部又は一部を含む部分画像を抽出して注目物体画像として生成する注目物体画像抽出部と、前記第１映像通信端末から受信した原映像と前記注目物体画像を照合し、当該原映像中で注目物体が映っている位置を示す注目物体座標を検出する注目物体検出部と、前記注目物体座標に基づいて当該原映像中の注目物体の位置に所定の図形を合成して合成映像を生成する合成映像生成部と、該合成映像を、前記第１映像通信端末又は前記第２映像通信端末に送信する合成映像送信部と、を備えることを特徴とする。これにより、第１映像通信端末における処理負担を軽減させつつ各映像通信端末間の映像コミュニケーションを実現することができる。 Furthermore, the video communication system according to the second aspect of the present invention is a video communication system including two or more video communication terminals and a video relay server, and the first video communication terminal transmits the original video to the video relay server. The second video communication terminal includes an attention area coordinate designation processing section for generating attention area coordinate information indicating an attention area designated in the original image; and the information on the attention area coordinates. A region-of-interest coordinate transmission unit that transmits the image to the video relay server, and the video relay server includes all or one of the objects of interest in the original video based on the information of the region-of-interest coordinates received from the second video communication terminal. A target object image extraction unit that extracts a partial image including a part and generates a target object image, and collates the target image with the original image received from the first video communication terminal. A target object detection unit that detects a target object coordinate indicating a position where the target object is reflected in the original video, and a predetermined figure is synthesized at the position of the target object in the original video based on the target object coordinate. A composite video generation unit that generates a composite video, and a composite video transmission unit that transmits the composite video to the first video communication terminal or the second video communication terminal. Thereby, the video communication between each video communication terminal is realizable, reducing the processing burden in a 1st video communication terminal.

また、本発明による第２態様の映像コミュニケーションシステムにおいて、前記合成映像送信部は、前記合成映像を前記第１映像通信端末及び前記第２映像通信端末の双方に送信する手段を有することを特徴とする。これにより、第２映像通信端末は、第１映像通信端末と合成映像を共有することができる。 In the video communication system according to the second aspect of the present invention, the composite video transmission unit includes means for transmitting the composite video to both the first video communication terminal and the second video communication terminal. To do. Thereby, the second video communication terminal can share the composite video with the first video communication terminal.

また、本発明による第２態様の映像コミュニケーションシステムにおいて、前記第２映像通信端末の注目領域座標指定処理部は、当該合成映像中で指定される注目領域を示す注目領域座標の情報を生成する手段を有することを特徴とする。これにより、第２映像通信端末は、合成映像に対して更に注目物体を指定できるようになる。 Further, in the video communication system according to the second aspect of the present invention, the attention area coordinate designation processing unit of the second video communication terminal generates information of attention area coordinates indicating the attention area designated in the synthesized video. It is characterized by having. As a result, the second video communication terminal can further specify the object of interest for the synthesized video.

また、本発明による第１態様の映像コミュニケーションシステムにおける作動方法は、２つ以上の映像通信端末及び映像中継サーバからなる映像コミュニケーションシステムにおける映像中継サーバの作動方法であって、第１映像通信端末は、原映像を映像中継サーバに送信する原映像送信部を備えており、第２映像通信端末は、当該原映像中で指定される注目物体の全部又は一部を含む注目物体画像を生成する注目物体画像生成部と、該注目物体画像を前記映像中継サーバに送信する注目物体画像送信部とを備えており、前記第１映像通信端末から受信した原映像と前記注目物体画像を照合し、当該原映像中で注目物体が映っている位置を示す注目物体座標を検出するステップと、前記注目物体座標に基づいて当該原映像中の注目物体の位置に所定の図形を合成して合成映像を生成するステップと、該合成映像を、前記第１映像通信端末又は前記第２映像通信端末に送信するステップと、を含むことを特徴とする。 The operating method in the video communication system according to the first aspect of the present invention is an operating method of the video relay server in the video communication system comprising two or more video communication terminals and a video relay server, wherein the first video communication terminal is The second video communication terminal includes an attention object image that includes all or part of the attention object specified in the original image, and includes an original image transmission unit that transmits the original image to the image relay server. An object image generation unit, and a target object image transmission unit that transmits the target object image to the video relay server. The source image received from the first video communication terminal is collated with the target object image, and Detecting a target object coordinate indicating a position where the target object is reflected in the original video; and a position of the target object in the original video based on the target object coordinate. Generating a synthesized and combined image a predetermined shape to, the synthetic video image, characterized in that it comprises the steps of: transmitting to the first video communication terminal or the second video communication terminal.

また、本発明による第２態様の映像コミュニケーションシステムにおける作動方法は、２つ以上の映像通信端末及び映像中継サーバからなる映像コミュニケーションシステムにおける映像中継サーバの作動方法であって、第１映像通信端末は、原映像を映像中継サーバに送信する原映像送信部を備えており、第２映像通信端末は、当該原映像中で指定される注目領域を示す注目領域座標の情報を生成する注目領域座標指定処理部と、該注目領域座標の情報を前記映像中継サーバに送信する注目領域座標送信部とを備えており、前記第２映像通信端末から受信した注目領域座標の情報を基に、当該原映像中で注目物体の全部又は一部を含む部分画像を抽出して注目物体画像として生成するステップと、前記第１映像通信端末から受信した原映像と前記注目物体画像を照合し、当該原映像中で注目物体が映っている位置を示す注目物体座標を検出するステップと、前記注目物体座標に基づいて当該原映像中の注目物体の位置に所定の図形を合成して合成映像を生成するステップと、該合成映像を、前記第１映像通信端末又は前記第２映像通信端末に送信するステップと、を含むことを特徴とする。 The operating method in the video communication system according to the second aspect of the present invention is an operating method of a video relay server in a video communication system comprising two or more video communication terminals and a video relay server, wherein the first video communication terminal The second video communication terminal includes an original video transmission unit that transmits the original video to the video relay server, and the second video communication terminal generates attention region coordinate designation that generates attention region coordinate information indicating the attention region designated in the original video A processing unit, and a region-of-interest coordinate transmission unit that transmits information on the region-of-interest coordinates to the video relay server, and based on the information on the region-of-interest coordinates received from the second video communication terminal, the original video Extracting a partial image including all or part of the object of interest in the image and generating it as the object of interest image; and an original image received from the first video communication terminal Collating the target object image and detecting a target object coordinate indicating a position where the target object is reflected in the original video; and a predetermined position at a position of the target object in the original video based on the target object coordinate. The method includes a step of generating a composite video by combining figures and a step of transmitting the composite video to the first video communication terminal or the second video communication terminal.

本発明によれば、映像中継サーバによって映像中に図形を重畳する際に、映像通信端末によって指定された映像中の対象物体に対して予め定めた図形を重畳するため、映像中の物体の位置が動いても、当該図形を対象物体に追随させた合成映像を提供することができるようになる。 According to the present invention, when a graphic is superimposed on a video by the video relay server, a predetermined graphic is superimposed on the target object in the video specified by the video communication terminal. Even if is moved, it is possible to provide a composite image in which the figure follows the target object.

また、図形を対象物体に追随するよう重畳した合成映像の生成及び提供の処理は、各映像通信端末ではなくネットワーク上の映像中継サーバが行うようにしたため、処理能力の低い映像通信端末をも本発明の映像コミュニケーションシステムに利用可能となる。 In addition, since the process of generating and providing the composite video that superimposes the figure to follow the target object is performed not by each video communication terminal but by the video relay server on the network, the video communication terminal with low processing capability is also used in this case. It can be used for the video communication system of the invention.

また、映像通信端末からの映像送受信方式として既設の標準的なものを利用することが可能となり、標準的な映像通信端末及びそのプログラムを専用化することなく利用することができるようになる。 In addition, an existing standard method can be used as a video transmission / reception method from the video communication terminal, and the standard video communication terminal and its program can be used without being dedicated.

したがって、本発明によれば、映像中継サーバによって映像の中継と図形の合成を行なうようにしたので、性能が低く標準的な映像通信機能しか持たない映像通信端末を用いた、既存のテレビ電話システムの利便性をより高めた映像コミュニケーションシステムを提供できるようになる。 Therefore, according to the present invention, the video relay server and the video composition are performed by the video relay server, so that an existing videophone system using a video communication terminal with low performance and only a standard video communication function is used. A video communication system with improved convenience can be provided.

本発明に係る映像コミュニケーションシステムの構成例を示す図である。It is a figure which shows the structural example of the video communication system which concerns on this invention. 本発明による第１実施形態の映像コミュニケーションシステムの構成を示す図である。It is a figure which shows the structure of the video communication system of 1st Embodiment by this invention. 本発明による第１実施形態の映像コミュニケーションシステムのブロック図である。1 is a block diagram of a video communication system according to a first embodiment of the present invention. 本発明による第１実施形態の映像コミュニケーションシステムの動作フロー図である。It is an operation | movement flowchart of the video communication system of 1st Embodiment by this invention. 本発明による第２実施形態の映像コミュニケーションシステムの構成を示す図である。It is a figure which shows the structure of the video communication system of 2nd Embodiment by this invention. 本発明による第２実施形態の映像コミュニケーションシステムのブロック図である。It is a block diagram of the video communication system of 2nd Embodiment by this invention. 本発明による第２実施形態の映像コミュニケーションシステムの動作フロー図である。It is an operation | movement flowchart of the video communication system of 2nd Embodiment by this invention. 本発明に係る映像コミュニケーションシステムにおける合成映像を例示する図である。It is a figure which illustrates the synthetic | combination image | video in the image | video communication system which concerns on this invention.

以下、図面を参照して、本発明による各実施形態の映像コミュニケーションシステムについて説明する。まず、本発明に係る映像コミュニケーションシステムの包括的な構成を説明し、より具体的な各実施形態については詳細に後述する。 The video communication system of each embodiment according to the present invention will be described below with reference to the drawings. First, a comprehensive configuration of the video communication system according to the present invention will be described, and more specific embodiments will be described later in detail.

図１は、本発明に係る映像コミュニケーションシステムの構成例を示す図である。本発明に係る映像コミュニケーションシステムは、２つ以上の映像通信端末と、各映像通信端末の映像及び音声についてネットワークを通じて中継する映像中継サーバから構成される。以下の説明では、図１に示すように、代表的に、ユーザＡが利用する第１映像通信端末２０と、ユーザＢが利用する第２映像通信端末３０との間で、映像中継サーバ１０を介して映像コミュニケーションを実現する例について説明する。映像中継サーバ１０は、１つのコンピュータで実現可能であるが、第１映像通信端末２０と第２映像通信端末３０が遠隔的にネットワークを通じて通信する例を説明するため、説明の便宜上、映像合成ユニット１０ａと多地点接続ユニット１０ｂからなるものとして説明する。 FIG. 1 is a diagram showing a configuration example of a video communication system according to the present invention. The video communication system according to the present invention includes two or more video communication terminals and a video relay server that relays video and audio of each video communication terminal through a network. In the following description, as shown in FIG. 1, the video relay server 10 is typically set between the first video communication terminal 20 used by the user A and the second video communication terminal 30 used by the user B. An example of realizing video communication via the network will be described. Although the video relay server 10 can be realized by a single computer, the video composition unit will be described for convenience of explanation in order to explain an example in which the first video communication terminal 20 and the second video communication terminal 30 communicate remotely via a network. Explanation will be made on the assumption that it is composed of 10a and a multipoint connection unit 10b.

第１映像通信端末２０は、例えば、既存のカメラ付き移動端末としてユーザＡが利用する端末である。ユーザＡは、このようなカメラ付き移動端末を利用して内臓カメラで撮像した原映像を、映像中継サーバ１０を介して第２映像通信端末３０に送信する機能を有する。一方、第１映像通信端末２０は、映像中継サーバ１０から、この原映像に対して予め定めた図形を重畳した合成映像を受信して、自身のディスプレイのモニタ画面にてユーザＡが視聴可能な表示再生機能を有する。 The first video communication terminal 20 is a terminal used by the user A as an existing mobile terminal with a camera, for example. The user A has a function of transmitting an original image captured by the built-in camera using such a camera-equipped mobile terminal to the second video communication terminal 30 via the video relay server 10. On the other hand, the first video communication terminal 20 receives a composite video obtained by superimposing a predetermined figure on the original video from the video relay server 10 and can be viewed by the user A on the monitor screen of its own display. It has a display / playback function.

第２映像通信端末３０は、例えば、既存のパーソナルコンピュータとしてユーザＢが利用する端末である。このパーソナルコンピュータは、マウスを利用するものやタッチパネル形式のものでもよいし、携帯端末でもよい。第２映像通信端末３０は、映像中継サーバ１０から、第１映像通信端末２０が視聴する合成映像を同様に受信して、自身のディスプレイのモニタ画面にてユーザＢが視聴可能な表示再生機能を有する。さらに、第２映像通信端末３０は、原映像又は合成映像内でユーザＢが注目する物体を指定して、その指定された注目物体画像（第１実施形態）又は注目領域座標（第２実施形態）の情報を、映像中継サーバ１０に送信する機能を有し、映像中継サーバ１０に対して原映像からの所定図形を重畳した合成映像（又は受信した合成映像からの所定図形を重畳した更なる合成映像）を生成可能にする。原映像又は合成映像内でユーザＢが注目する物体を指定するユーザインターフェースは、例えばマウスやタッチパネル形式のものが好適である。 The second video communication terminal 30 is a terminal used by the user B as an existing personal computer, for example. This personal computer may use a mouse, a touch panel, or a portable terminal. The second video communication terminal 30 receives a composite video viewed by the first video communication terminal 20 from the video relay server 10 in the same manner, and has a display reproduction function that allows the user B to view on the monitor screen of its own display. Have. Further, the second video communication terminal 30 designates an object to which the user B pays attention in the original video or the synthesized video, and the designated target object image (first embodiment) or attention area coordinates (second embodiment). ) Information to the video relay server 10, and a composite video in which a predetermined graphic from the original video is superimposed on the video relay server 10 (or a predetermined graphic from the received composite video is superimposed) (Composite video) can be generated. For example, a mouse or touch panel type is preferable as the user interface for designating an object of interest of the user B in the original video or the synthesized video.

ここで、第１映像通信端末２０と第２映像通信端末３０の双方は、既存のテレビ会議システムと同様に、多地点接続ユニット１０ｂを経由して映像コミュニケーションを実現する機能を有しているものとする。例えば、多地点接続ユニット１０ｂは、ＲＴＰ（Real-time Transport Protocol）通信用のサーバで構成することができる。 Here, both the 1st video communication terminal 20 and the 2nd video communication terminal 30 have the function which implement | achieves video communication via the multipoint connection unit 10b similarly to the existing video conference system. And For example, the multipoint connection unit 10b can be configured by a server for RTP (Real-time Transport Protocol) communication.

映像中継サーバ１０は、この多地点接続ユニット１０ｂと相互接続される、原映像に対して所定図形を重畳した合成映像（又は第２映像通信端末３０から前回指定されて生成した合成映像に対して所定図形を重畳した更なる合成映像）の生成及び提供を行う映像合成ユニット１０ａを備える。 The video relay server 10 is interconnected with the multipoint connection unit 10b, and is a composite video in which a predetermined figure is superimposed on the original video (or a composite video generated and designated last time from the second video communication terminal 30). A video synthesis unit 10a is provided for generating and providing a further synthesized video on which a predetermined figure is superimposed.

つまり、映像中継サーバ１０は、第１映像通信端末２０からは原映像を受信し、第２映像通信端末３０から対象物体の指定がある場合に、第２映像通信端末３０から原映像に対する注目物体画像又はその注目物体の領域を示す注目領域座標の情報を受信し、指定された対象物体の画像領域の抽出を行なって、所定の図形を各対象物体に割り当て重畳した合成映像を生成して、第１映像通信端末２０及び／又は第２映像通信端末３０に送信する。尚、合成映像の共有の観点からは、第１映像通信端末２０及び第２映像通信端末３０の双方に合成映像が送信されることが好ましく、映像中継サーバ１０は、第１映像通信端末２０から原映像を受信し、第２映像通信端末３０からの対象物体の指定がなされるまで、当該原映像を合成映像として第１映像通信端末２０及び第２映像通信端末３０の双方に送信する。 That is, the video relay server 10 receives the original video from the first video communication terminal 20, and when the target object is specified from the second video communication terminal 30, the target object for the original video from the second video communication terminal 30. Receiving information on the region of interest coordinates indicating the region of the image or its target object, extracting the image region of the specified target object, generating a composite image in which a predetermined figure is assigned and superimposed on each target object, Transmit to the first video communication terminal 20 and / or the second video communication terminal 30. From the viewpoint of sharing the composite video, the composite video is preferably transmitted to both the first video communication terminal 20 and the second video communication terminal 30, and the video relay server 10 is connected to the first video communication terminal 20. The original video is received and transmitted to both the first video communication terminal 20 and the second video communication terminal 30 as a composite video until the target object is designated from the second video communication terminal 30.

ここで、重畳される所定の図形は、映像合成ユニット１０ａが予め用意したものであり、例えば、「丸」や「矢印」などの図形である。第２映像通信端末３０からの指定回数に応じて重畳する図形を順次変化させる態様や、第２映像通信端末３０から原映像に対する注目物体画像又はその注目物体の領域を示す注目領域座標の情報に対して、図形を指定する補助情報（例えば、「丸」であればフラグ１、「矢印」であればフラグ２、「吹き出し図」であればフラグ３など）を取得するような態様が考えられる。この場合、図形選択を行うタブレット形式のアプリケーションソフトウェアを第２映像通信端末３０に設けるのが好適である。 Here, the predetermined graphic to be superimposed is prepared in advance by the video composition unit 10a, and is, for example, a graphic such as “circle” or “arrow”. In the aspect in which the superimposed graphic is sequentially changed according to the designated number of times from the second video communication terminal 30, or the attention object image information indicating the attention object image or the region of the attention object with respect to the original image from the second video communication terminal 30. On the other hand, auxiliary information for designating a figure (for example, flag 1 for “circle”, flag 2 for “arrow”, flag 3 for “balloon”), etc. may be considered. . In this case, it is preferable to provide the second video communication terminal 30 with tablet-type application software for performing graphic selection.

また、映像合成ユニット１０ａは、第２映像通信端末３０から原映像に対する注目物体画像又はその注目物体の領域を示す注目領域座標の情報に対して、指定の対象物体を有する原映像又は合成映像から、既存のオブジェクト抽出処理（例えば、ＭＰＥＧ−４のオブジェクト抽出技法が知られている）を実行し、合成映像の送出後に繰りかえし第１映像通信端末２０から原映像を受信した場合も、第２映像通信端末３０からの対象物体の更なる指定がなされるまで、当該原映像中の指定の対象物体を追従した位置に所定の図形を重畳した合成映像を生成して提供する。繰り返し得られる原映像中から指定の対象物体が消える又は隠れる場合には、当該図形の重畳をなくした合成映像を提供するように構成することができる。 In addition, the video composition unit 10a receives from the original video or synthesized video having the designated target object with respect to the target object image for the original video from the second video communication terminal 30 or the information of the target area coordinates indicating the area of the target object. Even when an existing object extraction process (for example, MPEG-4 object extraction technique is known) and the original video is received from the first video communication terminal 20 repeatedly after the composite video is transmitted, the second video is also received. Until the target object is further specified from the communication terminal 30, a composite image in which a predetermined figure is superimposed on a position following the specified target object in the original image is generated and provided. When the designated target object disappears or hides from the repeatedly obtained original image, it is possible to provide a composite image in which the figure is not superimposed.

従って、本発明に係る映像コミュニケーションシステムでは、以下のような手順の利用態様が想定される。
（１）ユーザＡとユーザＢは、それぞれの映像通信端末２０，３０を利用して多地点接続ユニット１０ｂを介して映像コミュニケーションを行なう。
（２）ユーザＡは、第１映像通信端末２０の内蔵カメラで周囲の状況を撮影し原映像として多地点接続ユニット１０ｂに送信する。
（３）ユーザＢは、第２映像通信端末３０で合成映像に関して視聴しつつ、会話の中で注目すべき物体が映ったときに、その物体を第２映像通信端末３０の画面上で指定する。
（４）第２映像通信端末３０は、ユーザＢによって指定された、合成映像から切り出された注目物体画像又は画面上で指定された場所を示す注目領域座標の情報を映像合成ユニット１０ａに送信する。
（５）映像合成ユニット１０ａは、注目物体画像（第２映像通信端末３０から受信したもの、若しくは、同じく受信した注目領域座標に基づき原映像等から切り出された部分画像）と原映像を照合し、原映像の中で注目物体が存在する場所を検出する。
（６）映像合成ユニット１０ａは、検出した場所に所定の図形を合成し、合成映像として多地点接続ユニットを介して各映像通信端末２０，３０に送信する。
従って、各映像通信端末２０，３０では、受信した合成映像を表示し、各ユーザＡ，Ｂは、ユーザＢが指定した物体に関連付けられた図形が重畳された合成映像を視聴可能となる。 Accordingly, in the video communication system according to the present invention, the following procedure usage modes are assumed.
(1) User A and User B perform video communication via the multipoint connection unit 10b using the video communication terminals 20 and 30, respectively.
(2) User A captures the surrounding situation with the built-in camera of the first video communication terminal 20 and transmits it to the multipoint connection unit 10b as an original video.
(3) While viewing the synthesized video on the second video communication terminal 30 and viewing a noticeable object in the conversation, the user B designates the object on the screen of the second video communication terminal 30. .
(4) The second video communication terminal 30 transmits, to the video composition unit 10a, the target object image cut out from the composite video specified by the user B or information on the target region coordinates indicating the location specified on the screen. .
(5) The video composition unit 10a compares the target object image (the one received from the second video communication terminal 30 or the partial image cut out from the original video or the like based on the received attention area coordinates) with the original video. The location where the target object is present in the original video is detected.
(6) The video composition unit 10a synthesizes a predetermined figure at the detected location, and transmits it as a composite image to the video communication terminals 20 and 30 via the multipoint connection unit.
Accordingly, each video communication terminal 20 and 30 displays the received composite video, and each user A and B can view the composite video on which the graphic associated with the object designated by the user B is superimposed.

以下、より具体的に、図２〜図４を参照して、本発明による第１実施形態の映像コミュニケーションシステムについて説明する。図２は、本発明による第１実施形態の映像コミュニケーションシステムの構成を示す図である。図３は、本発明による第１実施形態の映像コミュニケーションシステムのブロック図である。図４は、本発明による第１実施形態の映像コミュニケーションシステムの動作フロー図である。 Hereinafter, the video communication system according to the first embodiment of the present invention will be described more specifically with reference to FIGS. FIG. 2 is a diagram showing the configuration of the video communication system according to the first embodiment of the present invention. FIG. 3 is a block diagram of the video communication system according to the first embodiment of the present invention. FIG. 4 is an operation flowchart of the video communication system according to the first embodiment of the present invention.

〔第１実施形態〕
図２を参照するに、本発明による第１実施形態の映像コミュニケーションシステムは、原映像を第１映像通信端末２０にて生成して映像中継サーバ１０に送信し、映像中継サーバ１０では、第２映像通信端末３０から送られてきた注目物体画像と、第１映像通信端末２０から逐次送信されてくる原映像を照合して原映像中の注目物体の位置を検出し、この検出位置に図形を重畳するように構成される。より具体的には、実施形態の映像コミュニケーションシステムは、映像中継サーバ１０と、第１映像通信端末２０と、第２映像通信端末３０とを備える。映像中継サーバ１０は、注目物体検出部１０１と、合成映像生成部１０２と、合成映像送信部１０３とを備える。第１映像通信端末２０は、原映像送信部２０１を有する。第２映像通信端末３０は、注目物体画像生成部３０１と、注目物体画像送信部３０２とを備える。尚、本発明に係る主要な部分のみを図２に示しており、画像の表示再生機能、通信機能、ユーザインターフェース機能等の既存の映像通信端末が備える機能を排除するものではないことに留意する。 [First Embodiment]
Referring to FIG. 2, in the video communication system according to the first embodiment of the present invention, an original video is generated by the first video communication terminal 20 and transmitted to the video relay server 10, and the video relay server 10 The target object image sent from the video communication terminal 30 and the original video sequentially transmitted from the first video communication terminal 20 are collated to detect the position of the target object in the original video, and the figure is placed at this detected position. Configured to overlap. More specifically, the video communication system of the embodiment includes a video relay server 10, a first video communication terminal 20, and a second video communication terminal 30. The video relay server 10 includes a target object detection unit 101, a composite video generation unit 102, and a composite video transmission unit 103. The first video communication terminal 20 has an original video transmission unit 201. The second video communication terminal 30 includes a target object image generation unit 301 and a target object image transmission unit 302. It should be noted that only main parts according to the present invention are shown in FIG. 2 and do not exclude functions provided in an existing video communication terminal such as an image display / playback function, a communication function, and a user interface function. .

尚、本実施形態に係る映像中継サーバ１０は、１つ以上のコンピュータとして構成することができ、映像中継サーバ１０の各機能を実現する処理内容を記述したプログラムを、当該コンピュータの所定の記憶部（図示せず）に格納しておき、当該コンピュータの中央演算処理装置（ＣＰＵ）によってこのプログラムを読み出して実行させることで実現することができる。 Note that the video relay server 10 according to the present embodiment can be configured as one or more computers, and a program describing processing contents for realizing each function of the video relay server 10 is stored in a predetermined storage unit of the computer. It can be realized by storing the program in a computer (not shown) and reading and executing the program by a central processing unit (CPU) of the computer.

第１映像通信端末２０における原映像送信部２０１は、ユーザＡによって内臓カメラを操作して、その場の状況を撮像した原映像を映像中継サーバ１０に送信する機能部である。原映像は、動画でも静止画でもよいが、以下の説明では、動画を例に説明する。 The original video transmission unit 201 in the first video communication terminal 20 is a functional unit that operates the built-in camera by the user A and transmits the original video obtained by capturing the current situation to the video relay server 10. The original image may be a moving image or a still image, but in the following description, a moving image will be described as an example.

第２映像通信端末３０における注目物体画像生成部３０１は、ユーザＢによって原映像が閲覧され、その中で注目する領域をマウスクリックなどの方法で指定された部分画像を切り出し、注目物体画像として生成する機能部である。つまり、注目物体画像生成部３０１は、ユーザＢによって指定が行われた瞬間のフレーム画像を原映像中から取得し、注目領域座標に基づきフレーム画像の一部を注目物体画像として切り出す。この部分画像は、正確な注目物体を切り出す必要はなく、当該注目物体の特定に必要な規定サイズ（例えば、フレーム画像の１／５０サイズ）で切り出される。尚、フレーム画像の全部を注目物体画像として指定してもよいことは勿論である。 An attention object image generation unit 301 in the second video communication terminal 30 generates an attention object image by extracting a partial image designated by a method such as a mouse click of an area of interest in the original image viewed by the user B. It is a functional part to do. That is, the target object image generation unit 301 acquires a frame image at the moment when the user B designates from the original video, and cuts out a part of the frame image as the target object image based on the target region coordinates. This partial image does not need to cut out an accurate object of interest, and is cut out with a specified size (for example, 1/50 size of a frame image) necessary for specifying the object of interest. Of course, the entire frame image may be designated as the target object image.

第２映像通信端末３０における注目物体画像送信部３０２は、注目物体画像生成部３０１によって生成された注目物体画像を映像中継サーバ１０に送信する機能部である。 The target object image transmission unit 302 in the second video communication terminal 30 is a functional unit that transmits the target object image generated by the target object image generation unit 301 to the video relay server 10.

映像中継サーバ１０における注目物体検出部１０１は、第２映像通信端末３０から受信した注目物体画像と受信して保持した原映像を照合し、原映像の中で当該注目物体が存在する場所を「注目物体座標」として検出する機能部である。ここで、原映像の中で当該注目物体が存在する場所を照合して検出する技法は、既存のオブジェクト抽出技法を用いることができるが、単純な画素値マッチングで特定してもよい。 The target object detection unit 101 in the video relay server 10 collates the target object image received from the second video communication terminal 30 with the received and retained original video, and identifies the location where the target object exists in the original video as “ This is a functional unit that is detected as “target object coordinates”. Here, as a technique for collating and detecting a place where the target object exists in the original video, an existing object extraction technique can be used, but it may be specified by simple pixel value matching.

映像中継サーバ１０における合成映像生成部１０２は、注目物体検出部１０１によって検出した場所に所定の図形を合成し、合成映像として生成する機能部である。 The synthesized video generation unit 102 in the video relay server 10 is a functional unit that synthesizes a predetermined figure at a location detected by the target object detection unit 101 and generates a synthesized video.

映像中継サーバ１０における合成映像送信部１０３は、合成映像生成部１０２によって生成した合成映像を各映像通信端末２０，３０に送信する機能部である。 The composite video transmission unit 103 in the video relay server 10 is a functional unit that transmits the composite video generated by the composite video generation unit 102 to the video communication terminals 20 and 30.

図３及び図４には、映像中継サーバ１０を映像合成ユニット１０ａと多地点接続ユニット１０ｂからなるものとして構成した、より具体的な例が示されている。 3 and 4 show more specific examples in which the video relay server 10 is configured as a video composition unit 10a and a multipoint connection unit 10b.

図３を参照しながら、図４を説明するに、まず、映像中継サーバ１０の多地点接続ユニット１０ｂは、第１映像端末２０と第２映像端末３０との間での映像・音声コミュニケーションのために、ＲＴＰ通信を確立している（Ｓ１）。 4 will be described with reference to FIG. 3. First, the multipoint connection unit 10 b of the video relay server 10 is used for video / audio communication between the first video terminal 20 and the second video terminal 30. In addition, RTP communication is established (S1).

第１映像端末２０は、ユーザＡの操作によってカメラ撮影部２０１１によって原映像を取得し、エンコード部２０１２によってこの原映像を所定の符号化方式（例えば、ＭＰＥＧ−４）で符号化し、ＲＴＰ送信部２０１３によってＲＴＰ通信により多地点接続ユニット１０ｂに送信する（Ｓ２）。ここで、第２映像通信端末３０は、原映像の待ち受け状態にある（Ｓ３）。 The first video terminal 20 acquires the original video by the camera photographing unit 2011 by the operation of the user A, encodes the original video by a predetermined encoding method (for example, MPEG-4) by the encoding unit 2012, and transmits the RTP transmission unit. By 2013, it transmits to the multipoint connection unit 10b by RTP communication (S2). Here, the second video communication terminal 30 is in a standby state for the original video (S3).

多地点接続ユニット１０ｂは、ＲＴＰ受信部１０１３によって第１映像端末２０から符号化された原映像を受信するとともに、トランスコード部１０１６によって映像・音声コミュニケーションに適合した所定のビットレートに変換し、ＲＴＰ送信部１０１７によって第２映像通信端末３０に転送する（Ｓ４）。尚、多地点接続ユニット１０ｂは、第１映像端末２０から受信した符号化された原映像を、トランスコード部１０１４によって映像合成ユニット１０ａとのＲＴＰ通信に適合した所定のビットレートに変換し、ＲＴＰ送信部１０１５によって映像合成ユニット１０ａに転送する。映像合成ユニット１０ａに転送された当該符号化された原映像は、ＲＴＰ受信・デコード部１０１８によって受信及び復号され、注目物体検出部１０１に送出される。 The multipoint connection unit 10b receives the original video encoded from the first video terminal 20 by the RTP receiving unit 1013, converts it to a predetermined bit rate suitable for video / audio communication by the transcoding unit 1016, The transmission unit 1017 transfers the data to the second video communication terminal 30 (S4). The multipoint connection unit 10b converts the encoded original video received from the first video terminal 20 into a predetermined bit rate suitable for RTP communication with the video synthesis unit 10a by the transcoding unit 1014, and RTP The data is transferred to the video composition unit 10a by the transmission unit 1015. The encoded original video transferred to the video synthesizing unit 10 a is received and decoded by the RTP reception / decoding unit 1018 and sent to the target object detection unit 101.

第２映像通信端末３０は、ＲＴＰ受信・デコード部３０１１によって、多地点接続ユニット１０ｂから、符号化された原映像を受信して復号し、ディスプレイ表示部３０１４によって当該原映像をモニタ画面に表示するとともに（Ｓ５）、注目物体画像生成部３０１に送出する。注目物体画像生成部３０１は、ユーザ入力部３０１３を介してユーザＢによって指定された原映像中の１つの映像フレームに該当する部分画像を抽出する部分画像抽出部３０１２を有し、この部分画像抽出部３０１２によって当該部分画像を注目物体画像として生成し、注目物体画像送信部３０２に送出する（Ｓ６）。注目物体画像送信部３０２は、ネットワークを通じて注目物体画像を映像合成ユニット１０ａに送信するフレーム画像送信部３０２１を有し、フレーム画像送信部３０２１は、ユーザＢによって指定された原映像中の１つの映像フレームに該当する注目物体画像を映像合成ユニット１０ａのフレーム画像受信部１０１２に送信する（Ｓ７）。 The second video communication terminal 30 receives and decodes the encoded original video from the multipoint connection unit 10b by the RTP receiving / decoding unit 3011 and displays the original video on the monitor screen by the display display unit 3014. At the same time (S5), the image is sent to the target object image generation unit 301. The target object image generation unit 301 includes a partial image extraction unit 3012 that extracts a partial image corresponding to one video frame in the original video specified by the user B via the user input unit 3013. This partial image extraction The partial image is generated as a target object image by the unit 3012 and sent to the target object image transmission unit 302 (S6). The target object image transmission unit 302 includes a frame image transmission unit 3021 that transmits the target object image to the video composition unit 10a through the network. The frame image transmission unit 3021 is one video in the original video specified by the user B. The target object image corresponding to the frame is transmitted to the frame image receiving unit 1012 of the video composition unit 10a (S7).

映像合成ユニット１０ａの注目物体検出部１０１は、フレーム画像受信部１０１２を介して受信した注目物体画像と、第１映像端末２０から受信した原映像とを照合する映像照合認識処理部１０１１を有し、映像照合認識処理部１０１１によって、原映像における注目物体の画像領域を特定する（Ｓ８）。さらに、映像合成ユニット１０ａは、原映像に当該特定した注目物体の注目画像座標の位置に対して所定の図形を重畳して合成映像を生成する合成映像生成部１０２を有し、合成映像生成部１０２は、ＲＴＰ受信・デコード部１０１８から得られる原映像に対して、所定の図形を記憶した記憶部（図示せず）から注目物体に割り当てられる所定の図形の情報を取得して重畳する映像合成処理を実行する映像合成処理部１０２１を有する（Ｓ９）。尚、第２映像通信端末３０は、１つ以上の注目物体を指定し、それぞれの注目物体を識別する識別子と、当該識別子に対応する図形を選択して指定するためのフラグを、当該部分画像ともに送信するように構成することができ、この場合、映像合成ユニット１０ａの注目物体検出部１０１及び合成映像生成部１０２は、当該識別子で識別される注目物体に、当該フラグで指定される図形を記憶部（図示せず）から注目物体ごとに取得して重畳するように構成することもできる。 The target object detection unit 101 of the video composition unit 10 a includes a video collation recognition processing unit 1011 that collates the target object image received via the frame image reception unit 1012 and the original video received from the first video terminal 20. Then, the image verification recognition processing unit 1011 identifies the image area of the object of interest in the original image (S8). Further, the video composition unit 10a includes a composite video generation unit 102 that generates a composite video by superimposing a predetermined figure on the position of the target image coordinate of the identified target object on the original video, and the composite video generation unit 102 is a video composition in which information of a predetermined graphic assigned to the object of interest is superimposed on an original video obtained from the RTP receiving / decoding unit 1018 from a storage unit (not shown) storing a predetermined graphic. It has a video composition processing unit 1021 for executing processing (S9). Note that the second video communication terminal 30 designates one or more target objects, an identifier for identifying each target object, and a flag for selecting and specifying a graphic corresponding to the identifier. In this case, the target object detection unit 101 and the composite video generation unit 102 of the video composition unit 10a can display the graphic specified by the flag on the target object identified by the identifier. It can also be configured to obtain and superimpose each object of interest from a storage unit (not shown).

映像合成ユニット１０ａのエンコード・ＲＴＰ送信部１０３１は、合成映像生成部１０２から得られる合成映像を符号化してＲＴＰ通信で多地点接続ユニット１０ｂに送出する。多地点接続ユニット１０ｂは、ＲＴＰ受信部１０３２で合成映像を受信して、トランスコード部１０３３によって映像・音声コミュニケーションに適合した所定のビットレートに変換し、ＲＴＰ送信部１０３４によって第１映像通信端末２０及び第２映像通信端末３０に転送する（Ｓ１０）。従って、エンコード・ＲＴＰ送信部１０３１、ＲＴＰ受信部１０３２、トランスコード部１０３３及びＲＴＰ送信部１０３４は、合成映像を映像通信端末２０に送信する合成映像送信部１０３として機能する。尚、合成映像送信部１０３は、トランスコード部１０１６及びＲＴＰ送信部１０１７によって、合成映像を映像通信端末３０に送信することもできる。 The encoding / RTP transmission unit 1031 of the video composition unit 10a encodes the composite video obtained from the composite video generation unit 102 and sends it to the multipoint connection unit 10b by RTP communication. The multipoint connection unit 10b receives the composite video at the RTP receiver 1032 and converts it into a predetermined bit rate suitable for video / audio communication by the transcode unit 1033, and the first video communication terminal 20 by the RTP transmitter 1034. And it transfers to the 2nd video communication terminal 30 (S10). Therefore, the encoding / RTP transmission unit 1031, the RTP reception unit 1032, the transcoding unit 1033, and the RTP transmission unit 1034 function as the composite video transmission unit 103 that transmits the composite video to the video communication terminal 20. Note that the composite video transmission unit 103 can also transmit the composite video to the video communication terminal 30 by the transcode unit 1016 and the RTP transmission unit 1017.

第１映像通信端末２０は、ＲＴＰ受信・デコード部２０１４によって当該合成映像を受信して復号し、ディスプレイ表示部２０１５によって当該合成映像をモニタ画面に表示する（Ｓ１１）。同様に、第２映像通信端末３０は、ＲＴＰ受信・デコード部３０１１によって当該合成映像を受信して復号し、ディスプレイ表示部３０１４によって当該合成映像をモニタ画面に表示する（Ｓ１２）。 The first video communication terminal 20 receives and decodes the composite video by the RTP receiving / decoding unit 2014, and displays the composite video on the monitor screen by the display display unit 2015 (S11). Similarly, the second video communication terminal 30 receives and decodes the composite video by the RTP receiving / decoding unit 3011, and displays the composite video on the monitor screen by the display display unit 3014 (S12).

このように、本実施形態の映像コミュニケーションシステムによれば、第１映像通信端末２０と第２映像通信端末３０との間でやり取りされる映像中、第２映像通信端末３０のユーザＢが図形を重畳する対象として指定した注目物体の画像領域を照合によって判別して図形を重畳するように構成したため、従来では図形を描画する際に描画する位置を指定していたため手ぶれや物体の移動より重畳した図形がずれてしまう問題を解決することができる。 As described above, according to the video communication system of the present embodiment, the user B of the second video communication terminal 30 displays a figure in the video exchanged between the first video communication terminal 20 and the second video communication terminal 30. Since the image area of the object of interest specified as the object to be superimposed is identified by matching and the figure is superimposed, the drawing position was specified when drawing the figure, so it was superimposed due to camera shake or object movement. The problem that the figure is displaced can be solved.

また、本実施形態の映像コミュニケーションシステムによれば、第１映像通信端末２０から原映像を映像中継サーバ１０に送信して、図形の重畳を映像中継サーバ２０で実行するように構成したため、第１映像通信端末２０として利用する移動端末の処理能力不足に係る問題を解決することができる。 Further, according to the video communication system of the present embodiment, the first video communication terminal 20 is configured to transmit the original video to the video relay server 10 and execute the superimposition of the figure on the video relay server 20. The problem relating to the lack of processing capability of the mobile terminal used as the video communication terminal 20 can be solved.

また、本実施形態の映像コミュニケーションシステムによれば、第１映像通信端末２０に特別な専用のハードウェアやソフトウェアを使用する必要が無くなり、つまり、第１映像通信端末２０（もしくは第２映像通信端末３０）と映像中継サーバ２０の間の映像送受信の実装方式として、通常の映像コミュニケーションと同様の方式を採用すればよくなり、過大なコスト増加等の問題を生じることなく実現可能となる。 Further, according to the video communication system of the present embodiment, it is not necessary to use special dedicated hardware or software for the first video communication terminal 20, that is, the first video communication terminal 20 (or the second video communication terminal). 30) and the video transmission / reception server 20 can be implemented without causing a problem such as an excessive increase in cost.

次に、図５〜図７を参照して、本発明による第２実施形態の映像コミュニケーションシステムについて説明する。図５は、本発明による第２実施形態の映像コミュニケーションシステムの構成を示す図である。図６は、本発明による第２実施形態の映像コミュニケーションシステムのブロック図である。図７は、本発明による第２実施形態の映像コミュニケーションシステムの動作フロー図である。尚、第１実施形態と同様な構成要素には同一の参照番号を付している。 Next, a video communication system according to a second embodiment of the present invention will be described with reference to FIGS. FIG. 5 is a diagram showing a configuration of a video communication system according to the second embodiment of the present invention. FIG. 6 is a block diagram of a video communication system according to the second embodiment of the present invention. FIG. 7 is an operation flowchart of the video communication system according to the second embodiment of the present invention. In addition, the same reference number is attached | subjected to the component similar to 1st Embodiment.

〔第２実施形態〕
図５を参照するに、本発明による第２実施形態の映像コミュニケーションシステムは、原映像を第１映像通信端末２０にて生成して映像中継サーバ１０に送信し、図形を重畳するための注目物体が映像コミュニケーションシステムでやり取りされる映像中のどこに映っているかを、第２映像通信端末３０にてユーザＢが指定し、この指定場所を示す注目領域座標の情報を映像中継サーバ１０に送信し、映像中継サーバ１０は、この注目領域座標の情報を基に第１映像通信端末３０から受信した原映像から注目物体画像を抽出し、抽出した注目物体画像と原映像を照合して原映像中の注目物体の位置を注目物体座標の情報として検出し、この検出位置に図形を重畳するように構成される。より具体的には、本実施形態の映像コミュニケーションシステムは、映像中継サーバ１０と、第１映像通信端末２０と、第２映像通信端末３０とを備える。映像中継サーバ１０は、注目物体検出部１０１と、合成映像生成部１０２と、合成映像送信部１０３と、注目物体画像抽出部１０４とを備える。第１映像通信端末２０は、原映像送信部２０１を有する。第２映像通信端末３０は、注目領域座標指定処理部３０３と、注目領域座標送信部３０４とを備える。尚、本発明に係る主要な部分のみを図５に示しており、画像の表示再生機能、通信機能、ユーザインターフェース機能等の既存の映像通信端末が備える機能を排除するものではないことに留意する。 [Second Embodiment]
Referring to FIG. 5, the video communication system according to the second embodiment of the present invention generates an original video at the first video communication terminal 20 and transmits it to the video relay server 10 to superimpose a figure. Is specified by the user B at the second video communication terminal 30 and the information of the attention area coordinates indicating the specified location is transmitted to the video relay server 10. The video relay server 10 extracts a target object image from the original video received from the first video communication terminal 30 based on the information on the target region coordinates, and collates the extracted target object image with the original video to check the original video. The position of the target object is detected as target object coordinate information, and a figure is superimposed on the detected position. More specifically, the video communication system of this embodiment includes a video relay server 10, a first video communication terminal 20, and a second video communication terminal 30. The video relay server 10 includes a target object detection unit 101, a composite video generation unit 102, a composite video transmission unit 103, and a target object image extraction unit 104. The first video communication terminal 20 has an original video transmission unit 201. The second video communication terminal 30 includes a region of interest coordinate designation processing unit 303 and a region of interest coordinate transmission unit 304. It should be noted that only main parts according to the present invention are shown in FIG. 5 and do not exclude functions of existing video communication terminals such as an image display / playback function, a communication function, and a user interface function. .

第２実施形態では、第２映像通信端末３０が、注目領域座標指定処理部３０３と、注目領域座標送信部３０４とを備える点と、映像中継サーバ１０が、注目物体画像抽出部１０４を備える点で相違する。 In the second embodiment, the second video communication terminal 30 includes an attention area coordinate designation processing unit 303 and an attention area coordinate transmission unit 304, and the video relay server 10 includes an attention object image extraction unit 104. Is different.

第２映像通信端末３０における注目領域座標指定処理部３０３は、ユーザＢによって原映像が閲覧され、その中で注目する領域をマウスクリックなどの方法で指定された注目領域の座標を特定し、注目領域座標の情報として生成する機能部である。つまり、注目領域座標指定処理部３０３は、ユーザＢによって指定が行われた瞬間のフレーム画像を原映像中から取得し、フレーム画像の一部を注目物体画像として切り出し可能な注目領域座標を指定する。この注目領域座標は、注目物体を囲む座標群とするか、又は１点を指定する座標とすることができ、この場合、映像中継サーバ１０側で当該注目物体の特定に必要な規定サイズ（例えば、フレーム画像の１／５０サイズ）で切り出される。 The attention area coordinate designation processing unit 303 in the second video communication terminal 30 specifies the coordinates of the attention area designated by a method such as mouse click on the area of interest when the original video is browsed by the user B, and It is a functional part generated as area coordinate information. In other words, the attention area coordinate designation processing unit 303 acquires a frame image at the moment when the designation is made by the user B from the original video, and designates attention area coordinates that can cut out a part of the frame image as the attention object image. . The attention area coordinates may be a coordinate group surrounding the attention object, or may be a coordinate designating one point. In this case, a specified size (for example, necessary for specifying the attention object on the video relay server 10 side) , 1/50 size of the frame image).

第２映像通信端末３０における注目領域座標送信部３０４は、注目領域座標指定処理部３０３によって生成された注目領域座標の情報を映像中継サーバ１０に送信する機能部である。 The attention area coordinate transmission unit 304 in the second video communication terminal 30 is a functional unit that transmits information on the attention area coordinates generated by the attention area coordinate designation processing unit 303 to the video relay server 10.

映像中継サーバ１０における注目物体画像抽出部１０４は、第２映像通信端末３０から取得した「注目領域座標」の情報を基に、原映像から注目物体画像を抽出し、注目物体検出部１０１に送出する。この注目物体画像の抽出は、例えば、注目領域座標の情報を重心位置とする注目物体の特定に必要な規定サイズ（例えば、フレーム画像の１／５０サイズ）で切り出せばよい。 The target object image extraction unit 104 in the video relay server 10 extracts the target object image from the original video based on the information of the “target region coordinates” acquired from the second video communication terminal 30 and sends it to the target object detection unit 101. To do. The extraction of the target object image may be performed with a specified size (for example, 1/50 size of the frame image) necessary for specifying the target object with the information of the target region coordinates as the center of gravity.

映像中継サーバ１０における注目物体検出部１０１は、注目物体画像抽出部１０４によって抽出した注目物体画像と受信して保持した原映像を照合し、原映像の中で当該注目物体が存在する場所を「注目物体座標」として検出する機能部である。ここで、原映像の中で当該注目物体が存在する場所を照合して検出する技法は、既存のオブジェクト抽出技法を用いることができるが、単純な画素値マッチングで特定してもよい。 The target object detection unit 101 in the video relay server 10 collates the target object image extracted by the target object image extraction unit 104 with the received and stored original video, and determines the location where the target object exists in the original video. This is a functional unit that is detected as “target object coordinates”. Here, as a technique for collating and detecting a place where the target object exists in the original video, an existing object extraction technique can be used, but it may be specified by simple pixel value matching.

図６及び図７には、映像中継サーバ１０を映像合成ユニット１０ａと多地点接続ユニット１０ｂからなるものとして構成した、より具体的な例が示されている。 6 and 7 show more specific examples in which the video relay server 10 is configured as a video composition unit 10a and a multipoint connection unit 10b.

図６を参照しながら、図７を説明するに、まず、映像中継サーバ１０の多地点接続ユニット１０ｂは、第１映像端末２０と第２映像端末３０との間での映像・音声コミュニケーションのために、ＲＴＰ通信を確立している（Ｓ２１）。 7 will be described with reference to FIG. 6. First, the multipoint connection unit 10 b of the video relay server 10 is used for video / audio communication between the first video terminal 20 and the second video terminal 30. In addition, RTP communication is established (S21).

第１映像端末２０は、ユーザＡの操作によってカメラ撮影部２０１１によって原映像を取得し、エンコード部２０１２によってこの原映像を所定の符号化方式（例えば、ＭＰＥＧ−４）で符号化し、ＲＴＰ送信部２０１によってＲＴＰ通信により多地点接続ユニット１０ｂに送信する（Ｓ２２）。ここで、第２映像通信端末３０は、原映像の待ち受け状態にある（Ｓ２３）。 The first video terminal 20 acquires the original video by the camera photographing unit 2011 by the operation of the user A, encodes the original video by a predetermined encoding method (for example, MPEG-4) by the encoding unit 2012, and transmits the RTP transmission unit. 201 transmits to the multipoint connection unit 10b by RTP communication (S22). Here, the second video communication terminal 30 is in a standby state for the original video (S23).

多地点接続ユニット１０ｂは、ＲＴＰ受信部１０１３によって第１映像端末２０から符号化された原映像を受信して、トランスコード部１０１４及びＲＴＰ送信部１０１５経由で映像合成ユニット１０ａに送信され、最初に撮像された原映像については指定の注目領域座標がないことから、ＲＴＰ受信部１０３２経由でこの原映像を映像合成ユニット１０ａによる合成映像として取得し、トランスコード部１０３３及びＲＴＰ送信部１０３４経由で第２映像通信端末３０に転送する（Ｓ２４）。 The multipoint connection unit 10b receives the original video encoded from the first video terminal 20 by the RTP receiver 1013 and transmits it to the video composition unit 10a via the transcode unit 1014 and the RTP transmitter 1015. Since there is no designated attention area coordinate for the captured original video, the original video is acquired as a synthesized video by the video synthesizing unit 10a via the RTP receiving unit 1032 and is transmitted via the transcoding unit 1033 and the RTP transmitting unit 1034. Transfer to the 2 video communication terminal 30 (S24).

第２映像通信端末３０は、ＲＴＰ受信・デコード部３０１１によって、多地点接続ユニット１０ｂから、符号化された原映像を受信して復号し、ディスプレイ表示部３０１４によって当該原映像をモニタ画面に表示する（Ｓ２５）。注目領域座標指定処理部３０３は、ユーザ入力部３０３１を介してユーザＢによって指定された原映像中の１つの映像フレームに該当する部分画像を特定する注目領域座標の情報を生成し（Ｓ２６）、注目領域座標送信部３０４に送出する。注目領域座標送信部３０４は、ネットワークを通じて注目領域座標の情報を映像合成ユニット１０ａに送信する座標送信部３０２１ｂを有し、座標送信部３０２１ｂは、ユーザＢによって指定された原映像中の１つの映像フレームに該当する注目物体を特定するための注目領域座標の情報を映像合成ユニット１０ａの座標受信部１０１２ｂに送信する（Ｓ２７）。 The second video communication terminal 30 receives and decodes the encoded original video from the multipoint connection unit 10b by the RTP receiving / decoding unit 3011 and displays the original video on the monitor screen by the display display unit 3014. (S25). The attention area coordinate designation processing unit 303 generates attention area coordinate information that identifies a partial image corresponding to one video frame in the original image designated by the user B via the user input unit 3031 (S26). The information is sent to the attention area coordinate transmission unit 304. The attention area coordinate transmission section 304 has a coordinate transmission section 3021b that transmits information on the attention area coordinates to the video composition unit 10a through the network, and the coordinate transmission section 3021b includes one image in the original image designated by the user B. Information on the attention area coordinates for specifying the attention object corresponding to the frame is transmitted to the coordinate receiving unit 1012b of the video composition unit 10a (S27).

映像合成ユニット１０ａの注目物体画像抽出部１０４は、第１映像端末２０から受信した原映像から、注目領域座標の情報を基に部分画像を抽出する部分画像抽出部１０４１を有し、部分画像抽出部１０４１は、抽出した部分画像を注目物体画像として注目物体検出部１０１に送出し、注目物体検出部１０１は、注目物体画像抽出部１０４によって抽出した注目物体画像と、第１映像端末２０から受信した原映像とを照合する映像照合認識処理部１０１１を有し、映像照合認識処理部１０１１によって、原映像における注目物体の注目画像座標を特定する（Ｓ２８）。 The target object image extraction unit 104 of the video composition unit 10a includes a partial image extraction unit 1041 that extracts a partial image from the original video received from the first video terminal 20 based on the information of the target region coordinates. The unit 1041 sends the extracted partial image to the target object detection unit 101 as the target object image. The target object detection unit 101 receives the target object image extracted by the target object image extraction unit 104 and the first video terminal 20. The video collation recognition processing unit 1011 for collating with the original video is specified. The video collation recognition processing unit 1011 identifies the target image coordinates of the target object in the original video (S28).

映像合成ユニット１０ａは、原映像に当該特定した注目物体の注目画像座標の位置に対して所定の図形を重畳して合成映像を生成する合成映像生成部１０２を有し、合成映像生成部１０２は、ＲＴＰ受信・デコード部１０１８から得られる原映像に対して、所定の図形を記憶した記憶部（図示せず）から注目物体に割り当てられる所定の図形の情報を取得して重畳する映像合成処理を実行する映像合成処理部１０２１を有する（Ｓ２９）。尚、第２映像通信端末３０は、１つ以上の注目物体を指定する１つ以上の注目領域座標を特定する際に、それぞれの注目物体を識別する識別子と、当該識別子に対応する図形を選択して指定するためのフラグを、当該注目領域座標の情報ともに送信するように構成することができ、この場合、映像合成ユニット１０ａの注目物体検出部１０１及び合成映像生成部１０２は、当該識別子で識別される注目物体に、当該フラグで指定される図形を記憶部（図示せず）から注目物体ごとに取得して重畳するように構成することもできる。 The video composition unit 10a includes a composite video generation unit 102 that generates a composite video by superimposing a predetermined figure on the position of the target image coordinate of the identified target object on the original video, and the composite video generation unit 102 A video composition process for acquiring and superimposing information on a predetermined graphic assigned to the object of interest from a storage unit (not shown) storing a predetermined graphic with respect to the original video obtained from the RTP receiving / decoding unit 1018 The video composition processing unit 1021 to be executed is included (S29). When the second video communication terminal 30 specifies one or more attention area coordinates that designate one or more attention objects, the second video communication terminal 30 selects an identifier for identifying each attention object and a graphic corresponding to the identifier. In this case, the target object detection unit 101 and the composite video generation unit 102 of the video composition unit 10a use the identifiers. A graphic specified by the flag may be acquired for each target object from the storage unit (not shown) and superimposed on the identified target object.

映像合成ユニット１０ａのエンコード・ＲＴＰ送信部１０３１は、合成映像生成部１０２から得られる合成映像を符号化してＲＴＰ通信で多地点接続ユニット１０ｂに送出する。多地点接続ユニット１０ｂは、ＲＴＰ受信部１０３２で合成映像を受信して、トランスコード部１０３３によって映像・音声コミュニケーションに適合した所定のビットレートに変換し、ＲＴＰ送信部１０３４によって第２映像通信端末３０に転送するとともに、トランスコード部１０１６によって映像・音声コミュニケーションに適合した所定のビットレートに変換し、ＲＴＰ送信部１０１７によって第１映像通信端末２０に転送する（Ｓ３０）。従って、エンコード・ＲＴＰ送信部１０３１、ＲＴＰ受信部１０３２、トランスコード部１０３３、ＲＴＰ送信部１０３４、トランスコード部１０３３及びＲＴＰ送信部１０３４は、合成映像を各映像通信端末２０，３０に送信する合成映像送信部１０３として機能する。 The encoding / RTP transmission unit 1031 of the video composition unit 10a encodes the composite video obtained from the composite video generation unit 102 and sends it to the multipoint connection unit 10b by RTP communication. The multipoint connection unit 10b receives the composite video by the RTP receiver 1032 and converts it into a predetermined bit rate suitable for video / audio communication by the transcode unit 1033, and the second video communication terminal 30 by the RTP transmitter 1034. And the transcoding unit 1016 converts the bit rate to a predetermined bit rate suitable for video / audio communication, and the RTP transmission unit 1017 transfers the bit rate to the first video communication terminal 20 (S30). Therefore, the encoded / RTP transmitting unit 1031, the RTP receiving unit 1032, the transcoding unit 1033, the RTP transmitting unit 1034, the transcoding unit 1033, and the RTP transmitting unit 1034 transmit the synthesized video to the video communication terminals 20 and 30, respectively. It functions as the transmission unit 103.

第１映像通信端末２０は、ＲＴＰ受信・デコード部２０１４によって当該合成映像を受信して復号し、ディスプレイ表示部２０１５によって当該合成映像をモニタ画面に表示する（Ｓ３１）。同様に、第２映像通信端末３０は、ＲＴＰ受信・デコード部３０１１によって当該合成映像を受信して復号し、ディスプレイ表示部３０１４によって当該合成映像をモニタ画面に表示する（Ｓ３２）。 The first video communication terminal 20 receives and decodes the composite video by the RTP receiving / decoding unit 2014, and displays the composite video on the monitor screen by the display display unit 2015 (S31). Similarly, the second video communication terminal 30 receives and decodes the composite video by the RTP receiving / decoding unit 3011, and displays the composite video on the monitor screen by the display display unit 3014 (S32).

このように、本実施形態の映像コミュニケーションシステムにおいても、第１実施形態の利点をすべて包含した構成とすることができる。 Thus, the video communication system of the present embodiment can also be configured to include all the advantages of the first embodiment.

図８は、原映像に所定の図形を重畳した例である。撮影画像（原映像）における注目領域に「丸」を追加したり、「矢印」を追加したりすることができ、「吹き出し」を追加したりすることができる。この「吹き出し」が与えられた合成映像について、第１映像通信端末２０や第２映像通信端末３０からの指示によりＲＴＰ通信でテキスト情報を送信して、文字入力することも可能である。 FIG. 8 shows an example in which a predetermined figure is superimposed on the original video. “Circle” can be added to the region of interest in the captured image (original video), “arrow” can be added, and “balloon” can be added. It is also possible to input text by transmitting text information by RTP communication in response to an instruction from the first video communication terminal 20 or the second video communication terminal 30 with respect to the composite video provided with this “speech balloon”.

本発明によれば、映像中継サーバによって映像中に図形を重畳する際に、映像通信端末によって指定された映像中の対象物体に対して予め定めた図形を重畳するため、映像中の物体の位置が動いても、当該図形を対象物体に追随させた合成映像を提供することができるようになるから、移動端末を利用した映像コミュニケーションの用途に有用である。 According to the present invention, when a graphic is superimposed on a video by the video relay server, a predetermined graphic is superimposed on the target object in the video specified by the video communication terminal. Even if the image moves, it is possible to provide a composite image in which the figure follows the target object, which is useful for video communication using a mobile terminal.

１０映像中継サーバ
１０ａ映像合成ユニット
１０ｂ多地点接続ユニット
２０第１映像通信端末
３０第２映像通信端末
１０１注目物体検出部
１０２合成映像生成部
１０３合成映像送信部
１０４注目物体画像抽出部
２０１原映像送信部
３０１注目物体画像生成部
３０２注目物体画像送信部
３０３注目領域座標指定処理部
３０４注目領域座標送信部 DESCRIPTION OF SYMBOLS 10 Video relay server 10a Video composition unit 10b Multipoint connection unit 20 1st video communication terminal 30 2nd video communication terminal 101 Object-of-interest detection part 102 Composite image generation part 103 Composite image transmission part 104 Object-of-interest image extraction part 201 Original image transmission Unit 301 attention object image generation unit 302 attention object image transmission unit 303 attention region coordinate designation processing unit 304 attention region coordinate transmission unit

Claims

A video communication system comprising two or more video communication terminals and a video relay server,
The first video communication terminal includes an original video transmission unit that transmits the original video to the video relay server,
The second video communication terminal includes a target object image generation unit that generates a target object image including all or part of a target object specified in the original video, and a target for transmitting the target object image to the video relay server. An object image transmission unit,
The video relay server is
An attention object detection unit that compares the original image received from the first video communication terminal with the attention object image and detects an attention object coordinate indicating a position where the attention object is reflected in the original image;
A synthesized video generating unit that generates a synthesized video by synthesizing a predetermined figure at the position of the target object in the original video based on the target object coordinates;
A composite video transmission unit for transmitting the composite video to the first video communication terminal or the second video communication terminal;
A video communication system comprising:

The video communication system according to claim 1, wherein the composite video transmission unit includes means for transmitting the composite video to both the first video communication terminal and the second video communication terminal.

The object-of-interest image generation unit of the second video communication terminal includes means for generating an object-of-interest image including all or part of the object of interest specified in the composite image. The video communication system described.

A video communication system comprising two or more video communication terminals and a video relay server,
The first video communication terminal includes an original video transmission unit that transmits the original video to the video relay server,
The second video communication terminal generates an attention area coordinate specification processing unit that generates attention area coordinate information indicating the attention area specified in the original video, and transmits the attention area coordinate information to the video relay server. An area coordinate transmitter,
The video relay server is
An attention object image extraction unit that extracts a partial image including all or part of the attention object in the original image based on the information of the attention area coordinates received from the second video communication terminal; ,
An attention object detection unit that compares the original image received from the first video communication terminal with the attention object image and detects an attention object coordinate indicating a position where the attention object is reflected in the original image;
A synthesized video generating unit that generates a synthesized video by synthesizing a predetermined figure at the position of the target object in the original video based on the target object coordinates;
A composite video transmission unit for transmitting the composite video to the first video communication terminal or the second video communication terminal;
A video communication system comprising:

The video communication system according to claim 4, wherein the composite video transmission unit includes means for transmitting the composite video to both the first video communication terminal and the second video communication terminal.

The attention area coordinate specification processing unit of the second video communication terminal includes means for generating attention area coordinate information indicating an attention area specified in the composite video. Video communication system.

An operation method of a video relay server in a video communication system including two or more video communication terminals and a video relay server,
The first video communication terminal includes an original video transmission unit that transmits the original video to the video relay server,
The second video communication terminal includes a target object image generation unit that generates a target object image including all or part of a target object specified in the original video, and a target for transmitting the target object image to the video relay server. An object image transmission unit,
Collating the original image received from the first video communication terminal with the object-of-interest image and detecting object-of-interest coordinates indicating a position where the object of interest is reflected in the original image;
Generating a synthesized image by synthesizing a predetermined figure at the position of the object of interest in the original image based on the object of interest coordinates;
Transmitting the composite video to the first video communication terminal or the second video communication terminal;
A method for operating a video relay server, comprising:

An operation method of a video relay server in a video communication system including two or more video communication terminals and a video relay server,
The first video communication terminal includes an original video transmission unit that transmits the original video to the video relay server,
The second video communication terminal generates an attention area coordinate specification processing unit that generates attention area coordinate information indicating the attention area specified in the original video, and transmits the attention area coordinate information to the video relay server. An area coordinate transmitter,
Extracting a partial image including all or part of the target object in the original video based on the information of the target area coordinates received from the second video communication terminal, and generating the target object image;
Collating the original image received from the first video communication terminal with the object-of-interest image and detecting object-of-interest coordinates indicating a position where the object of interest is reflected in the original image;
Generating a synthesized image by synthesizing a predetermined figure at the position of the object of interest in the original image based on the object of interest coordinates;
Transmitting the composite video to the first video communication terminal or the second video communication terminal;
A method for operating a video relay server, comprising: