JP2013115527A

JP2013115527A - Video conference system and video conference method

Info

Publication number: JP2013115527A
Application number: JP2011258427A
Authority: JP
Inventors: Kyoichi Nakakuma; 恭一中熊; Hiroki Mizozoe; 博樹溝添
Original assignee: Hitachi Consumer Electronics Co Ltd
Current assignee: Hitachi Consumer Electronics Co Ltd
Priority date: 2011-11-28
Filing date: 2011-11-28
Publication date: 2013-06-10

Abstract

PROBLEM TO BE SOLVED: To provide a mechanism for cutting the face of a person uttering a voice, out of a camera image and displaying the face superposed on a material image.SOLUTION: In a transmission terminal 1, first image input means 100 receives input of image information, and face recognition means 101 recognizes a face. Second image input means 103 receives input of background information, and voice input means 106 grasps which direction the voice has reached from. In a reception terminal 2, orientation information acquisition means 207 grasps which direction the voice has reached from, cut-out information acquisition means designates which position in the image information the face exists. Image cut out means 204 cuts out a face image existing in the reaching direction of the voice, image superposition means 205 superposes the face image on the background information, and display means 206 outputs the resulting image to a display device 5.

Description

複数の情報端末からなるテレビ会議システムに関わり、特に映像、音声、データを他の情報端末に送信し、前記情報を受信し再生する情報端末に関する。 The present invention relates to a video conference system including a plurality of information terminals, and more particularly to an information terminal that transmits video, audio, and data to other information terminals, and receives and reproduces the information.

近年、高速インターネットの普及により、テレビやカメラと接続して双方向通信を行うテレビ会議機能を利用するユーザがますます増えている。特に、テレビ会議システムで、双方で資料を共有し、資料映像と相手カメラの映像を同時に見たいというニーズが非常に高くなってきている。 In recent years, with the widespread use of the high-speed Internet, more and more users use a video conference function that performs two-way communication by connecting to a TV or a camera. In particular, in a video conference system, there is an increasing need to share materials between both parties and view the material image and the image of the other camera at the same time.

資料映像と相手カメラの映像を同時に見るシステムの実現においては、たとえば画面をいくつかに分解し、一つの画面では資料映像、他方の画面では相手カメラの映像を表示するといった方法で上記システムを実現しているが、この場合、資料の表示サイズが小さくなるために字が読みづらくなるという問題がある。 In the realization of a system for simultaneously viewing the document video and the video of the other camera, for example, the above system is realized by dividing the screen into several parts and displaying the material video on one screen and the other camera's video on the other screen. However, in this case, there is a problem that it is difficult to read the characters because the display size of the material is small.

この問題を解決するための方法として、特許文献１のような技術を用いることにより、相手カメラで撮影した相手の顔の領域をピクチャーインピクチャー機能で小さく表示して、その代わり資料の表示サイズを大きくすることができる。しかし、この方法では、複数の人数が同時に会議に参加する場面などでは、資料が見づらくなるという問題がある。 As a method for solving this problem, by using a technique such as Patent Document 1, the area of the face of the other party photographed by the other camera is displayed in a small size with the picture-in-picture function, and instead the display size of the document is reduced. Can be bigger. However, this method has a problem that it is difficult to see the material when a plurality of people participate in the conference at the same time.

特開2001-177812公報JP 2001-177812 JP

テレビ会議システムにおいて、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みを実現する。 In a video conferencing system, a mechanism is realized in which the face of a person who has made a sound is cut out from the camera video and superimposed on the document video.

上記目的を解決するために、例えば特許請求の範囲に記載の構成を採用する。 In order to solve the above-described object, for example, the configuration described in the claims is adopted.

音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みを実現することが可能となる。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 It is possible to realize a mechanism in which the face of a person who has made a sound is cut out from the camera video and superimposed on the document video. Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

実施例１におけるシステムのブロック図の例（１）である。2 is an example (1) of a block diagram of a system in Embodiment 1. FIG. 実施例１におけるシステムのブロック図の例（２）である。3 is an example (2) of a block diagram of the system in Embodiment 1. FIG. ストリーム４００及び送信情報４１０の説明図の例である。It is an example of explanatory drawing of the stream 400 and the transmission information 410. FIG. 切出情報４２４及び指向情報４２５の説明図の例である。It is an example of explanatory drawing of cutout information 424 and directivity information 425. 実施例２におけるシステムの動作フロー図の例（１）である。10 is an example (1) of an operation flow diagram of a system in Embodiment 2. FIG. 実施例２におけるシステムの動作フロー図の例（２）である。FIG. 10B is an example (2) of an operation flowchart of the system in the second embodiment.

上記課題は、以下の手段により解決される。 The above problem is solved by the following means.

本願にかかるテレビ会議システムは、例えば映像情報、音声情報、背景情報を入力し送信する送信端末と、これら情報を受信し再生する受信端末とから構成される。 The video conference system according to the present application includes, for example, a transmission terminal that inputs and transmits video information, audio information, and background information, and a reception terminal that receives and reproduces the information.

送信端末は、映像情報をカメラから入力する第一映像入力手段と、データを入力し背景情報として取得する第二映像入力手段と、音声情報を入力する音声入力手段と、第一映像入力手段で入力した映像から顔を認識する顔認識手段と、顔の映像を切り出すための情報である切出情報を生成する切出情報生成手段と、どの方向から音声が来たかを検知し指向情報を生成する指向情報生成手段と、指向情報、切取情報、音声情報、背景情報、映像情報を多重化するＭＵＸ手段と、通信を行う通信手段とから構成される。 The transmission terminal includes first video input means for inputting video information from the camera, second video input means for inputting data and acquiring it as background information, audio input means for inputting audio information, and first video input means. Face recognition means for recognizing a face from the input video, cutout information generation means for generating cutout information that is information for cutting out the face video, and generating direction information by detecting from which direction the voice comes Directing information generating means, MUX means for multiplexing the directivity information, clipping information, audio information, background information, and video information, and communication means for performing communication.

また、受信端末は、情報を受信する通信手段と、情報を分離するＤＥＭＵＸ手段と、切出情報を取得する切出情報取得手段と、指向情報を取得する指向情報取得手段と、背景情報を取得する第二映像取得手段と、切出情報をもとに顔を選択する映像選択手段と、切出情報と映像情報とから顔映像を切り出す映像切取手段と、顔映像と背景情報を重ね合わせる映像重畳手段と、これら情報を表示する表示手段と、音声を出力する音声出力手段とから構成される。 The receiving terminal also acquires communication means for receiving information, DEMUX means for separating information, cutout information acquisition means for acquiring cutout information, directivity information acquisition means for acquiring directivity information, and background information. Second video acquisition means, video selection means for selecting a face based on the cut-out information, video cut-out means for cutting out the face video from the cut-out information and the video information, and a video that superimposes the face video and the background information It comprises superimposing means, display means for displaying such information, and sound output means for outputting sound.

なお、受信端末は、送信端末の各手段を保有して送信端末の機能を具有してもよい。同様に送信端末は、受信端末の各手段と保有して受信端末の機能を具有してもよい。 The receiving terminal may have each function of the transmitting terminal and have the function of the transmitting terminal. Similarly, the transmitting terminal may have the functions of the receiving terminal owned by each means of the receiving terminal.

上記の目的は、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みは、以下のような処理手段の働きにより実現される。 The above-described object is achieved by the following processing means that cuts out the face of a person who has produced a sound from the camera video and superimposes it on the material video for display.

送信端末においては、第一映像入力手段によりカメラから映像情報を入力し、第二映像入力手段により、背景情報を入力する。音声入力手段によりマイクから音声情報を入力する。顔認識手段により映像情報から顔が映像情報のどの位置にあるかを認識し、切出情報生成手段により、顔が映像情報のどの位置にあるかを示す切出情報を生成する。指向情報生成手段により、音声入力手段により入力した音声情報を解析してどの方向から音声が到達したかを把握し、音声の到達方向を表す指向情報を生成する。ＭＵＸ手段により、映像情報、音声情報、背景情報、切出情報、指向情報を多重化しストリームを生成し、通信手段によりストリームを通信回線を経由して受信端末に送信する。 In the transmission terminal, video information is input from the camera by the first video input means, and background information is input by the second video input means. Voice information is input from a microphone by voice input means. The face recognition means recognizes the position of the face in the video information from the video information, and the cut-out information generation means generates cut-out information indicating the position of the face in the video information. The directivity information generation means analyzes the voice information input by the voice input means to grasp from which direction the voice has arrived, and generates directivity information representing the voice arrival direction. The MUX means multiplexes the video information, audio information, background information, cutout information, and directivity information to generate a stream, and the communication means transmits the stream to the receiving terminal via the communication line.

受信端末においては、通信手段により前記ストリームを受信し、ＤＥＭＵＸ手段により前記ストリームを分解する。第二映像取得手段により背景情報を取得し、切出情報取得手段により切出情報を取得し、指向情報取得手段により指向情報を取得する。映像選択手段により指向情報を解析して音声の到達方向に位置する顔の映像を選択し、映像切取手段により映像情報から顔の映像を切り取る。映像重畳手段により、背景情報と前記顔の映像とを重ね合わせて映像化し、表示手段により前記映像を表示し、音声出力手段により音声を出力する。 In the receiving terminal, the stream is received by the communication means, and the stream is decomposed by the DEMUX means. Background information is acquired by the second video acquisition means, cut-out information is acquired by the cut-out information acquisition means, and directivity information is acquired by the directivity information acquisition means. The video selection means analyzes the directivity information to select a face video located in the direction of voice arrival, and the video clipping means cuts out the face video from the video information. The video superimposing means superimposes the background information and the face video into a video, displays the video by the display means, and outputs the sound by the audio output means.

このような処理手段の働きにより、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みが実現される。 By such a function of the processing means, a mechanism is realized in which the face of the person who made the sound is cut out from the camera image and superimposed on the document image for display.

また、上記の目的は、下記のような手段の働きによっても実現可能である。 The above object can also be realized by the following means.

送信端末においては、第一映像入力手段によりカメラから映像情報を入力し、第二映像入力手段により、背景情報を入力する。音声入力手段によりマイクから音声情報を入力する。ＭＵＸ手段により、映像情報、音声情報、背景情報を多重化しストリームを生成し、通信手段によりストリームを通信回線を経由して受信端末に送信する。 In the transmission terminal, video information is input from the camera by the first video input means, and background information is input by the second video input means. Voice information is input from a microphone by voice input means. The video information, audio information, and background information are multiplexed by the MUX means to generate a stream, and the stream is transmitted to the receiving terminal via the communication line by the communication means.

受信端末において、通信手段により前記ストリームを受信し、ＤＥＭＵＸ手段により前記ストリームを分解する。第二映像取得手段により背景情報を取得する。顔認識手段により映像情報から顔に前記当する部分を認識し、切出情報生成手段により顔に前記当する部分を示す切出情報を生成する。音声取得手段により、音声情報を取得し、指向情報生成手段により、音声がどの方向から到達したかを認識して指向情報を生成する。 In the receiving terminal, the stream is received by the communication means, and the stream is decomposed by the DEMUX means. The background information is acquired by the second video acquisition means. The face recognition unit recognizes the portion corresponding to the face from the video information, and the cutout information generation unit generates cutout information indicating the portion corresponding to the face. Voice information is acquired by the voice acquisition means, and the direction information generation means recognizes from which direction the voice has arrived and generates the direction information.

映像選択手段により指向情報を解析して音声の到達方向に位置する顔の映像を選択し、映像切取手段により映像情報から顔の映像を切り取る。映像重畳手段により、背景情報と前記顔の映像とを重ね合わせて映像化し、表示手段により前記映像を表示し、音声出力手段により音声を出力する。 The video selection means analyzes the directivity information to select a face video located in the direction of voice arrival, and the video clipping means cuts out the face video from the video information. The video superimposing means superimposes the background information and the face video into a video, displays the video by the display means, and outputs the sound by the audio output means.

このような手段の働きにより、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みが実現される。 With the function of such means, a mechanism is realized in which the face of the person who made the sound is cut out from the camera image and superimposed on the document image for display.

以下、本発明の実施例を図面を用いて説明する。 Embodiments of the present invention will be described below with reference to the drawings.

以下、実施例１について図１を用いて説明する。 Hereinafter, Example 1 will be described with reference to FIG.

図１は、本発明の実施例であるシステムの構成を示すブロック図の１例である。本システムは、送信端末１、受信端末２、インターネット３、背景情報４２３を生成する情報装置４、映像情報４２４を表示する表示装置５、音声情報４２２を取得するマイク装置８、音声情報４２２を出力するスピーカ装置１０とから構成される。人物６及び人物９は送信端末１を操作し、送信端末１のカメラで撮像される対象となる人物であり、人物７は受信端末２を操作する人物である。 FIG. 1 is an example of a block diagram showing a configuration of a system according to an embodiment of the present invention. This system outputs a transmission terminal 1, a reception terminal 2, the Internet 3, an information device 4 that generates background information 423, a display device 5 that displays video information 424, a microphone device 8 that acquires audio information 422, and audio information 422. The speaker device 10 is configured. A person 6 and a person 9 are persons who operate the transmission terminal 1 and are captured by the camera of the transmission terminal 1, and a person 7 is a person who operates the reception terminal 2.

本テレビ会議システムは、一組の送信端末１、受信端末２から構成される。送信端末１及び受信端末２はインターネット３を介して互いに通信する端末装置であり、例えば、セットトップボックスやパソコンのような固定式の情報端末や、ＰＤＡやスマートフォンのような移動体通信端末、あるいはテレビ会議専用の通信装置である。この送信端末１及び受信端末２には通信装置が搭載され、ＣＰＵで作動するプログラムにより互いに通信することが可能である。情報装置４は、例えばパソコンのような固定式のコンピュータやカメラなど、映像を生成する装置であればどのような情報装置であってもよい。また、表示装置５は、例えばテレビのような表示装置やプロジェクタなどの映像投射装置であり、映像情報を表示する情報装置であればどのようなものであってもよい。 This video conference system is composed of a pair of a transmission terminal 1 and a reception terminal 2. The transmission terminal 1 and the reception terminal 2 are terminal devices that communicate with each other via the Internet 3, for example, a fixed information terminal such as a set-top box or a personal computer, a mobile communication terminal such as a PDA or a smartphone, or It is a communication device dedicated to video conferencing. The transmission terminal 1 and the reception terminal 2 are equipped with a communication device, and can communicate with each other by a program that operates on a CPU. The information device 4 may be any information device as long as it generates a video image, such as a fixed computer such as a personal computer or a camera. The display device 5 is a display device such as a television or a video projection device such as a projector, and may be any information device that displays video information.

送信端末１及び受信端末２の構成の一例を図１に示すブロック図で説明する。送信端末１及び受信端末２は、図１の各ブロックに対応する各処理手段をハードウェアとして実装していても良いし、ソフトウェアとして実装していても良い。
ハードウェアとして各処理手段を実装する場合、実装されるハードウェアは複数の処理手段が行う処理を一つのハードウェアが行っても良いし、逆に一つの処理手段が行う処理を複数のハードウェアが行っても良い。
また、ソフトウェアとして各処理手段を実装する場合、送信端末１及び受信手段２は、CPU及び記憶装置及びメモリ等を実装しており、CPUは記憶装置に記憶されたプログラムをメモリに展開して実行することで各処理手段が行う処理を行うことができる。ここで、上記のソフトウェアは、複数の処理手段が行う処理を一つのソフトウェアが行っても良いし、逆に一つの処理手段が行う処理を複数のソフトウェアが行っても良い。 An example of the configuration of the transmission terminal 1 and the reception terminal 2 will be described with reference to the block diagram shown in FIG. The transmission terminal 1 and the reception terminal 2 may be implemented with hardware as each processing means corresponding to each block in FIG. 1 or as software.
When each processing means is implemented as hardware, the hardware to be implemented may be processed by a plurality of processing means by one hardware, or conversely, the processing performed by one processing means may be performed by a plurality of hardware. May go.
Further, when each processing means is implemented as software, the transmission terminal 1 and the reception means 2 are equipped with a CPU, a storage device, a memory, and the like, and the CPU expands and executes the program stored in the storage device in the memory. By doing so, the processing performed by each processing means can be performed. Here, with regard to the above software, one software may perform processing performed by a plurality of processing means, and conversely, a plurality of software may perform processing performed by one processing means.

以下の説明では説明の簡易化のため、送信端末１及び受信端末２がそれぞれ備えるCPUがテレビ会議プログラムを実行しており、テレビ会議プログラムを実行することで図１に示すブロック図に対応する処理手段に相当する機能を発言できるものとして説明を行う。また、以下の説明ではCPUがテレビ会議プログラムを実行することで発現される各機能を、図１のブロック図に示される各処理手段が実行しているものとして説明を行う。 In the following description, for simplification of explanation, the CPU provided in each of the transmission terminal 1 and the reception terminal 2 executes the video conference program, and the processing corresponding to the block diagram shown in FIG. 1 by executing the video conference program The description will be made assuming that the function corresponding to the means can be said. In the following description, each function expressed by the CPU executing the video conference program is described as being executed by each processing unit shown in the block diagram of FIG.

送信端末１は、カメラ装置など映像を出力する機器に接続して映像情報４２１を入力する第一映像入力手段１００と、情報装置４から背景情報４２３を取得する第二映像入力手段１０３と、マイク装置から音声情報４２２を取得する音声入力手段１０６と、前記第一映像入力手段１００で入力した映像情報４２１から人物６あるいは人物９の顔を認識する顔認識手段１０１と、顔認識手段１０１により認識された顔の位置の情報を切出情報４２４に変換する切出情報生成手段１０２と、音声入力手段１０６で入力した音声情報４２２から音声の到達方向を認識し、指向情報４２５に変換する指向情報生成手段１０７と、前記映像情報４２１、背景情報４２３、音声情報４２２、指向情報４２５及び切出情報４２４を多重化しストリーム４００に変換するＭＵＸ手段１０４と、前記ストリーム４００をインターネット３を介して受信端末２に送信する通信手段１０５とから構成される。 The transmission terminal 1 is connected to a device that outputs video such as a camera device, and inputs first video input means 100 that inputs video information 421; second video input means 103 that acquires background information 423 from the information device 4; and a microphone Recognition by the voice recognition means 101 for acquiring voice information 422 from the apparatus, face recognition means 101 for recognizing the face of the person 6 or 9 from the video information 421 input by the first video input means 100, and recognition by the face recognition means 101 Direction information for recognizing the voice arrival direction from the voice information 422 input by the voice input means 106 and converting it to the direction information 425 The generation unit 107, the video information 421, the background information 423, the audio information 422, the directivity information 425, and the cutout information 424 are multiplexed and the stream 400 is multiplexed. The MUX means 104 for converting, a communication unit 105 for transmitting the stream 400 to the receiving terminal 2 via the Internet 3.

受信端末２は、インターネット３を介してストリーム４００を受信する通信手段２００と、ストリーム４００を分解するＤＥＭＵＸ手段２０１と、ＤＥＭＵＸ手段２０１から切出情報４２４を取得する切出情報取得手段２０２と、ＤＥＭＵＸ手段２０１から背景情報４２３を取得する第二映像取得手段２０３と、ＤＥＭＵＸ手段２０１から指向情報４２５を取得する指向情報取得手段２０７と、指向情報４２５から人物６あるいは人物９の顔を選択する映像選択手段２０８と、切出情報４２４及び映像情報４２１とから映像選択手段２０８により選択された顔の映像を切り出す映像切取手段２０４と、顔の映像と背景情報４２３とを重ね合わせて出力映像とする映像重畳手段２０５と、出力映像を表示装置５に表示する表示手段２０６と、音声を出力する音声出力手段２１３とから構成される。 The receiving terminal 2 includes a communication unit 200 that receives the stream 400 via the Internet 3, a DEMUX unit 201 that decomposes the stream 400, a cut-out information acquisition unit 202 that acquires the cut-out information 424 from the DEMUX unit 201, and a DEMUX Second video acquisition means 203 for acquiring background information 423 from means 201, directivity information acquisition means 207 for acquiring directivity information 425 from DEMUX means 201, and video selection for selecting the face of person 6 or person 9 from directivity information 425 Image 208, image cutout means 204 for cutting out the face image selected by the image selection means 208 from the cutout information 424 and the image information 421, and an image to be output by superimposing the face image and the background information 423 Superimposing means 205, display means 206 for displaying output video on display device 5, sound Composed from the audio output means 213 for outputting a.

次に、送信端末１及び受信端末２で使用されるストリーム４００、送信情報４１０について図２を用いて説明する。 Next, the stream 400 and transmission information 410 used in the transmission terminal 1 and the reception terminal 2 will be described with reference to FIG.

ストリーム４００は、１つあるいは複数の送信情報４１０とから構成される。送信端末１は、送信情報４１０を時系列的に列挙したストリーム４００を受信端末２に送信することにより、映像や音声などの情報を受信端末２に伝えることができる。 The stream 400 is composed of one or a plurality of transmission information 410. The transmitting terminal 1 can transmit information such as video and audio to the receiving terminal 2 by transmitting the stream 400 listing the transmission information 410 in time series to the receiving terminal 2.

送信情報４１０は、映像情報４２１、音声情報４２２、背景情報４２３、切出情報４２４及び指向情報４２５のうち１つ以上の情報が格納された情報であり、どれか１つの情報だけを格納していてもよいし、複数の種類が同時に送信情報４１０に多重化されていてもよい。 The transmission information 410 is information in which one or more information among the video information 421, the audio information 422, the background information 423, the cut-out information 424, and the directivity information 425 is stored, and only one of the information is stored. Alternatively, a plurality of types may be multiplexed in the transmission information 410 at the same time.

映像情報４２１は、カメラ装置など映像を出力する情報装置から出力されるデータであり、カメラで撮影した人物６あるいは人物９の顔の映像を含有するデータである。映像情報４２１は時刻情報５１０を含み、受信装置２において、時刻情報５１０で整列して出力することにより動画の情報として表示装置５に表示することが出来る情報である。 The video information 421 is data output from an information device that outputs video, such as a camera device, and is data that contains video of the face of the person 6 or person 9 photographed by the camera. The video information 421 includes time information 510, and is information that can be displayed on the display device 5 as moving image information by the receiving device 2 by aligning and outputting the time information 510.

音声情報４２２は、マイク装置８が出力するデータであり、カメラで撮影した人物６および人物９の発した音声の情報である。また、音声情報４２２は、音声がどの方向から到達したかを示す指向性の情報も持っており、たとえばステレオ音声などである。また、音声情報４２２は時刻情報５１０を含み、時刻情報５１０で整列してスピーカ装置１０で出力することが出来る情報である。 The audio information 422 is data output from the microphone device 8 and is information on the sound produced by the person 6 and the person 9 photographed by the camera. The audio information 422 also has directivity information indicating from which direction the audio has reached, such as stereo audio. Also, the audio information 422 includes time information 510 and is information that can be output by the speaker device 10 after being aligned with the time information 510.

背景情報４２３は、パソコンなどの情報装置の出力する映像であり、動画像あるいは静止画であってもよい。背景情報４２３は時刻情報５１０を含み、時刻情報５１０で整列して出力することが出来る情報である。 The background information 423 is a video output from an information device such as a personal computer, and may be a moving image or a still image. The background information 423 includes time information 510 and is information that can be output in alignment with the time information 510.

次に、送信情報４１０に含まれる切出情報４２４及び指向情報４２５について図３を用いて説明する。切出情報４２４は、映像情報４２１に含まれるユーザの顔が映像情報４２１のどの座標にあるかを示す情報である。また、指向情報４２５は、マイク８が検出した音声がどの方向から発せられたのかを示す情報である。 Next, the cutout information 424 and the directivity information 425 included in the transmission information 410 will be described with reference to FIG. The cutout information 424 is information indicating at which coordinate of the video information 421 the user's face included in the video information 421 is located. The directivity information 425 is information indicating from which direction the sound detected by the microphone 8 is emitted.

切出情報４２４は、時刻情報５１０、映像識別情報５１１、原点情報５１３、サイズ５１４及び方位情報５１５とから構成される。 The cut-out information 424 includes time information 510, video identification information 511, origin information 513, size 514, and direction information 515.

時刻情報５１０は、情報をいつ出力するかを決定する時刻の情報であり、たとえば、ＧＭＴやＪＳＴなどの絶対時刻であったり、ある周波数における１周期を１カウントとする経過時間情報である。この時刻情報５１０を用いることで、送信情報４１０に含まれる映像情報４２１、音声情報４２２、背景情報４２３、切出情報４２４及び指向情報４２５を決まった時刻に同時に受信端末２から出力することが可能である。 The time information 510 is time information for determining when to output information, and is, for example, absolute time such as GMT or JST, or elapsed time information in which one period at a certain frequency is one count. By using this time information 510, the video information 421, the audio information 422, the background information 423, the cutout information 424, and the directivity information 425 included in the transmission information 410 can be simultaneously output from the receiving terminal 2 at a fixed time. It is.

映像識別情報５１１は、映像情報４２１毎に固有に割り振られた識別子であり、映像情報４２１を特定する情報である。 The video identification information 511 is an identifier uniquely assigned to each video information 421 and is information for specifying the video information 421.

原点情報５１３は、映像識別情報５１１で識別される映像情報４２１に対し、どの座標位置から切り出すかを決定する原点を表す情報であり、Ｘ軸方向及びＹ軸方向の座標位置を含む情報である。 The origin information 513 is information representing an origin for determining which coordinate position to cut out from the video information 421 identified by the video identification information 511, and is information including coordinate positions in the X-axis direction and the Y-axis direction. .

サイズ５１４は、原点情報５１３からＸ軸方向及びＹ軸方向にどれくらいの範囲が切り出す情報であるかを示す範囲情報であり、たとえば縦と横のドットサイズで表す。 The size 514 is range information indicating how much information is cut out from the origin information 513 in the X-axis direction and the Y-axis direction, and is represented by vertical and horizontal dot sizes, for example.

方位情報５１５は、前記切出情報４２４の示す顔映像が、３次元空間においてどの方角から到達したかを表す情報であり、たとえば北を０として時計回りに３６０度回転したときの角度であったり、東西南北などの方位情報であってもよい。 The azimuth information 515 is information indicating from which direction the face image indicated by the cutout information 424 has arrived in the three-dimensional space. For example, the azimuth information 515 may be an angle obtained by rotating 360 degrees clockwise from north as 0. Orientation information such as east, west, south, and north may be used.

指向情報４２５は、時刻情報５２１及び方位情報５２３とから構成される。 The directivity information 425 includes time information 521 and direction information 523.

時刻情報５２１は、情報をいつ出力するかを決定する時刻の情報であり、たとえば、ＧＭＴやＪＳＴなどの絶対時刻であったり、ある周波数における１周期を１カウントとする経過時間情報である。 The time information 521 is time information that determines when the information is output, and is, for example, absolute time such as GMT or JST, or elapsed time information in which one period at a certain frequency is one count.

方位情報５２３は、切出情報４２４の方位情報５１５に対応する情報であり、たとえば北を０として時計回りに３６０度回転したときの角度であったり、東西南北などの方位情報である。要するに、音声がどの方角からやってきたかを示す情報であればどのようなものであってもよい。 The azimuth information 523 is information corresponding to the azimuth information 515 of the cut-out information 424. For example, the azimuth information is an angle obtained by rotating 360 degrees clockwise with north as 0, or azimuth information such as east, west, south, and north. In short, any information may be used as long as the information indicates from which direction the voice comes.

次に、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みを実現する動作について、図４のフローチャートを用いて説明する。 Next, an operation for realizing a mechanism in which the face of a person who has produced a sound is cut out from the camera image and superimposed on the document image will be described with reference to the flowchart of FIG.

ユーザにより送信端末１と受信端末２との間でテレビ会議の開始が指示されると、送信端末１においては処理１０００が、受信端末２においては処理１１００が実行される。以下、フローチャートに従い説明する。 When the user instructs the start of the video conference between the transmission terminal 1 and the reception terminal 2, the process 1000 is executed at the transmission terminal 1 and the process 1100 is executed at the reception terminal 2. Hereinafter, it demonstrates according to a flowchart.

送信端末１、受信端末２においてそれぞれのCPUは、テレビ会議プログラムを記憶装置から読み出してメモリ上に展開し、テレビ会議プログラムを実行することによりテレビ会議機能を発現する（１０００、１１００）。 In the transmission terminal 1 and the reception terminal 2, each CPU reads out the video conference program from the storage device, develops it on the memory, and executes the video conference program to develop the video conference function (1000, 1100).

送信端末１の第二映像入力手段１０３は、背景情報４２３を外部あるいは内部のプログラムから受領してMUX手段１０４に出力する（１００１）。
また、第一映像入力手段１００は、映像情報４２１を外部あるいは内部のカメラ装置などの撮像装置から取得して顔認識手段１０１及びMUX手段１０４に出力する（１００２）。 The second video input means 103 of the transmission terminal 1 receives the background information 423 from an external or internal program and outputs it to the MUX means 104 (1001).
Also, the first video input unit 100 acquires the video information 421 from an imaging device such as an external or internal camera device and outputs the video information 421 to the face recognition unit 101 and the MUX unit 104 (1002).

次に、顔認識手段１０１は、映像情報４２１に対し画像解析を施して、顔がどこにあるかを検出し、その座標値を取得して切出情報生成手段１０２に出力する（１００３）。この顔検出は、映像をピクセル単位で解析して目、鼻、口、輪郭などの特徴を検出することにより顔を判別したり、あるいは顔画像のデータベースと照合して一致することにより顔を判別するなどのいずれの方法でもよく、要は顔が映像情報４２１のうちのどの座標に位置するかが判定できればよい。 Next, the face recognition unit 101 performs image analysis on the video information 421, detects where the face is, acquires the coordinate value, and outputs it to the cut-out information generation unit 102 (1003). This face detection is done by analyzing the image in pixel units and detecting features such as eyes, nose, mouth, and contours, or by matching against a database of face images. Any method may be used. In short, it is only necessary to determine at which coordinate of the video information 421 the face is located.

次に、切出情報生成手段１０２は、ステップ１００３で顔認識手段１０１が取得した座標情報を受信し、座標情報に基づき、切出情報４２４を生成し、MUX手段１０４に出力する（１００４）。 Next, the cutout information generation unit 102 receives the coordinate information acquired by the face recognition unit 101 in step 1003, generates cutout information 424 based on the coordinate information, and outputs it to the MUX unit 104 (1004).

切出情報生成手段１０２は、現在時刻を時刻情報５１０に、映像情報４２１の識別情報を映像識別情報５１１に、ステップ１００３で取得した座標情報から取得した情報を原点情報５１３及びサイズ５１４にそれぞれ保存する。また、切出情報生成手段１０２は、カメラ装置の向いている方向を原点に、映像情報４２１の中の顔の位置がどの方位に位置しているかを計算し、方位情報５１５として切出情報４２４に保存する。 The cut-out information generation unit 102 stores the current time in the time information 510, the identification information of the video information 421 in the video identification information 511, and the information acquired from the coordinate information acquired in step 1003 in the origin information 513 and the size 514, respectively. To do. Further, the cut-out information generation unit 102 calculates the direction in which the face position in the video information 421 is located with the direction in which the camera device is facing as the origin, and the cut-out information 424 is obtained as the direction information 515. Save to.

次に、音声入力手段１０６は、マイク装置８から音声情報４２２を取得する（１００５）。
このとき、音声入力手段１０６は、音声がどの方角から到達しているかを複数のマイクを使って検出し、指向情報４２５を生成する（１００６）。音声入力手段１０６は、図３で解説したとおり、現在時刻を時刻情報５２１に保存する。また、音声の到達方向を、カメラ装置の向いている方角を原点として方位を計算し、方位情報５２３に保存する。
切出情報生成手段１０２、音声入力手段１０６は、それぞれ生成した切出情報４２４、指向情報４２５をする。 Next, the voice input unit 106 acquires the voice information 422 from the microphone device 8 (1005).
At this time, the voice input means 106 detects from which direction the voice has arrived using a plurality of microphones, and generates directivity information 425 (1006). The voice input means 106 stores the current time in the time information 521 as explained in FIG. Further, the azimuth is calculated by using the direction in which the camera device is facing as the origin as the voice arrival direction, and stored in the direction information 523.
The cut-out information generation unit 102 and the voice input unit 106 perform the cut-out information 424 and the directivity information 425 generated, respectively.

次に、MUX手段１０４は、映像情報４２１、音声情報４２２、背景情報４２３、切出情報４２４及び指向情報４２５を多重化処理することでストリーム４００を作成し、通信手段及びインターネット３を介して受信装置２にストリーム４００を送信する（１００７）。 Next, the MUX unit 104 creates a stream 400 by multiplexing the video information 421, the audio information 422, the background information 423, the cutout information 424, and the directivity information 425, and receives it via the communication unit and the Internet 3. The stream 400 is transmitted to the apparatus 2 (1007).

受信端末２の通信手段２００は、ステップ１００７で送信端末１が送信したストリーム４００を受信し（１１０１）、受信したすおリーム４００をDEMUX手段２０１に出力する。
DEMUX手段２０１は、多重化された前記ストリーム４００を分解する（１１０２）。
次に、映像切取手段２０４はDEMUX手段２０１から映像情報４２１を取得し（１１０３）、第二映像取得手段２０３は背景情報４２３を取得し（１１０４）、切出情報取得手段２０２は切出情報４２４を取得する（１１０５）。 The communication means 200 of the receiving terminal 2 receives the stream 400 transmitted by the transmitting terminal 1 in Step 1007 (1101), and outputs the received soot dream 400 to the DEMUX means 201.
The DEMUX means 201 decomposes the multiplexed stream 400 (1102).
Next, the video cutout unit 204 acquires the video information 421 from the DEMUX unit 201 (1103), the second video acquisition unit 203 acquires the background information 423 (1104), and the cutout information acquisition unit 202 outputs the cutout information 424. Is acquired (1105).

次に、DEMUX手段２０１から音声情報４２２を取得し（１１０６）、指向情報取得手段２０７は指向情報４２５を取得する（１１０７）。音声出力手段２１３は、該音声情報４２２をスピーカ装置１０を用いて音声として出力する。 Next, the voice information 422 is acquired from the DEMUX means 201 (1106), and the directivity information acquisition means 207 acquires the directivity information 425 (1107). The sound output means 213 outputs the sound information 422 as sound using the speaker device 10.

ここで、映像選択手段２０８は、指向情報４２５の方位情報５２３と、切出情報４２４の方位情報５１５が一致する切出情報４２４を取得して、映像識別情報５１１と原点情報５１３とサイズ５１４を取得して顔を選択する（１１０８）。 Here, the video selection unit 208 acquires the cutout information 424 in which the heading information 523 of the directivity information 425 matches the heading information 515 of the cutout information 424, and obtains the video identification information 511, the origin information 513, and the size 514. The acquired face is selected (1108).

次に、映像切取手段２０４は、映像識別情報５１１に関連付けられている映像情報４２１に対し、前記原点情報５１３及びサイズ５１４に基づいて映像情報を切出して顔映像を取得する（１１０９）。
次に、映像重畳手段２０５は、背景情報４２３の上に前記顔映像を上書きして映像を重ね合わせてあらたな映像情報を作成する（１１１０）。
映像重畳手段２０５は作成した映像情報を、表示装置５に出力する（１１１１）。 Next, the video cutout unit 204 cuts out the video information based on the origin information 513 and the size 514 with respect to the video information 421 associated with the video identification information 511 to obtain a face video (1109).
Next, the video superimposing means 205 overwrites the face video on the background information 423 and superimposes the video to create new video information (1110).
The video superimposing means 205 outputs the created video information to the display device 5 (1111).

以上の動作により、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みを実現することができる。 With the above operation, it is possible to realize a mechanism in which the face of a person who has made a sound is cut out from the camera video and superimposed on the document video.

また、受信端末２のステップ１１０９において、指向情報４２５の方位情報５２３を使うのではなく、複数の切出情報４２４の原点情報５１３及びサイズ５１４を用いて順に時間を開けて映像情報４２１を切り取るようにすることにより、参加者全員の顔を順に一つづつ背景情報４２３の上に表示する仕組みを実現することも可能である。 Further, in step 1109 of the receiving terminal 2, the video information 421 is cut out in order by using the origin information 513 and the size 514 of the plurality of pieces of cutout information 424 instead of using the direction information 523 of the directivity information 425. By doing so, it is possible to realize a mechanism for displaying the faces of all the participants one by one on the background information 423 in order.

また、受信端末２のステップ１１１０において、ステップ１１０９が生成した全ての顔映像を背景情報４２３の上に整列して上書きすることにより、参加者全員の顔を同時に背景情報４２３の上に表示する仕組みを実現することも可能である。 Further, in step 1110 of the receiving terminal 2, all face images generated in step 1109 are overlaid on the background information 423 so that the faces of all participants are displayed on the background information 423 at the same time. Can also be realized.

以下、実施例２について図５を用いて説明する。 Hereinafter, Example 2 will be described with reference to FIG.

図５は、実施例２のシステムの構成を示すブロック図の１例である。図１で説明したシステムと同様、本システムは、送信端末１、受信端末２、インターネット３、背景情報４２３を生成する情報装置４、映像情報４２４を表示する表示装置５、音声情報４２２を取得するマイク装置８と、音声情報４２２を出力するスピーカ装置１０とから構成される。 FIG. 5 is an example of a block diagram illustrating the configuration of the system according to the second embodiment. Similar to the system described in FIG. 1, the system acquires the transmission terminal 1, the reception terminal 2, the Internet 3, the information device 4 that generates the background information 423, the display device 5 that displays the video information 424, and the audio information 422. The microphone device 8 includes a speaker device 10 that outputs audio information 422.

送信端末１及び受信端末２の構成の一例を図５に示すブロック図で説明する。 An example of the configuration of the transmission terminal 1 and the reception terminal 2 will be described with reference to the block diagram shown in FIG.

送信端末１は、カメラ装置など映像を出力する機器に接続して映像情報４２１を入力する第一映像入力手段１００と、情報装置４から背景情報４２３を取得する第二映像入力手段１０３と、マイク装置から音声情報４２２を取得する音声入力手段１０６と、前記映像情報４２１、背景情報４２３、音声情報４２２を多重化しストリーム４００に変換するＭＵＸ手段１０４と、前記ストリーム４００をインターネット３を介して受信端末２に送信する通信手段１０５とから構成される。 The transmission terminal 1 is connected to a device that outputs video such as a camera device, and inputs first video input means 100 that inputs video information 421; second video input means 103 that acquires background information 423 from the information device 4; and a microphone Audio input means 106 for obtaining audio information 422 from the device, MUX means 104 for multiplexing the video information 421, background information 423, audio information 422 and converting it into a stream 400, and a terminal for receiving the stream 400 via the Internet 3 2 and a communication means 105 for transmitting to the network.

受信端末２は、インターネット３を介してストリーム４００を受信する通信手段２００と、ストリーム４００を分解するＤＥＭＵＸ手段２０１と、ストリーム４００から背景情報４２３を取得する第二映像取得手段２０３と、ストリーム４００から映像情報４２１を取得し、顔がどこにあるかを認識する顔認識手段２０９と、顔がどこにあるかを示す切取情報４２４を生成する切取情報生成手段２１０と、ストリーム４００から音声情報４２２を取得する音声取得手段２１１と、前記音声情報４２２を解析し、音声がどの方角が到達したかを示す指向情報４２５を生成する指向情報生成手段２１２と、指向情報４２５から人物６あるいは人物９の顔を選択する映像選択手段２０８と、切出情報４２４及び映像情報とから映像選択手段２０８により選択された顔の映像を切り出す映像切取手段２０４と、顔の映像と背景情報とを重ね合わせて出力映像とする映像重畳手段２０５と、出力映像を表示装置５に表示する表示手段２０６とから構成される。 The receiving terminal 2 includes a communication unit 200 that receives the stream 400 via the Internet 3, a DEMUX unit 201 that decomposes the stream 400, a second video acquisition unit 203 that acquires background information 423 from the stream 400, and the stream 400. The video information 421 is acquired, the face recognition unit 209 that recognizes where the face is, the cut information generation unit 210 that generates the cut information 424 indicating where the face is, and the audio information 422 from the stream 400. The voice acquisition unit 211, the voice information 422 is analyzed, the direction information generation unit 212 that generates the direction information 425 indicating which direction the voice has reached, and the face of the person 6 or the person 9 is selected from the direction information 425. To the video selection means 208 from the video selection means 208 to be cut out, the cutout information 424 and the video information. A video cutout unit 204 that cuts out the video of the selected face, a video superimposing unit 205 that superimposes the face video and background information into an output video, and a display unit 206 that displays the output video on the display device 5. Composed.

次に、音声を発した人物の顔をカメラ映像から切り取り、資料映像に重畳して表示する仕組みの実現例について、図６のフローチャートを用いて説明する。 Next, an implementation example of a mechanism for cutting out the face of a person who uttered sound from a camera video and superimposing it on a material video will be described with reference to the flowchart of FIG.

送信端末１の第二映像入力手段１０３は、背景情報４２３を外部あるいは内部のプログラムから受領してMUX手段１０４に出力する（１００１）。
また、第一映像入力手段１００は、映像情報４２１を外部あるいは内部のカメラ装置などの撮像装置から取得て顔認識手段１０１及びMUX手段１０４に出力し保存する（１００２）。 The second video input means 103 of the transmission terminal 1 receives the background information 423 from an external or internal program and outputs it to the MUX means 104 (1001).
Also, the first video input unit 100 acquires the video information 421 from an imaging device such as an external or internal camera device, outputs it to the face recognition unit 101 and the MUX unit 104, and stores it (1002).

次に、音声入力手段１０６は、マイク装置８から音声情報４２２を取得する（１００５）。
MUX手段１０４は、映像情報４２１、音声情報４２２、背景情報４２３を多重化処理することでストリーム４００を作成し、通信手段及びインターネット３を介して受信装置２にストリーム４００を送信する（１００７）。 Next, the voice input unit 106 acquires the voice information 422 from the microphone device 8 (1005).
The MUX means 104 creates a stream 400 by multiplexing the video information 421, the audio information 422, and the background information 423, and transmits the stream 400 to the receiving device 2 via the communication means and the Internet 3 (1007).

受信端末２の通信手段２００は、ステップ１００７で送信端末１が送信したストリーム４００を受信し（１１０１）、受信したすおリーム４００をDEMUX手段２０１に出力する。
DEMUX手段２０１は、多重化された前記ストリーム４００を分解する（１１０２）。
次に、映像切取手段２０４はDEMUX手段２０１から映像情報４２１を取得し（１１０３）、第二映像取得手段２０３は背景情報４２３を取得する（１１０４）。 The communication means 200 of the receiving terminal 2 receives the stream 400 transmitted by the transmitting terminal 1 in Step 1007 (1101), and outputs the received soot dream 400 to the DEMUX means 201.
The DEMUX means 201 decomposes the multiplexed stream 400 (1102).
Next, the video cutout unit 204 acquires the video information 421 from the DEMUX unit 201 (1103), and the second video acquisition unit 203 acquires the background information 423 (1104).

次に、顔認識手段２０９は、映像情報４２１に対し画像解析を施して、顔がどこにあるかを検出し、その座標値を取得して切出情報生成手段２１０に出力する（１１１３）。この顔検出は、映像をピクセル単位で解析して目、鼻、口、輪郭などの特徴を検出することにより顔を判別したり、あるいは顔映像のデータベースと照合して一致することにより顔を判別するなどのいずれの方法でもよく、要は顔が映像情報４２１のうちのどの座標に位置するかが判定できればよい。 Next, the face recognition unit 209 performs image analysis on the video information 421, detects where the face is, acquires the coordinate value, and outputs it to the cut-out information generation unit 210 (1113). This face detection is done by analyzing the image in pixel units and detecting features such as eyes, nose, mouth, and contours, or by matching against a database of face images to determine the face. Any method may be used. In short, it is only necessary to determine at which coordinate of the video information 421 the face is located.

次に、切出情報生成手段２１０は、ステップ１１１３で顔認識手段２０９が取得した座標情報を受信し、座標情報に基づき、切出情報４２４を生成し、MUX手段１０４に出力する（１１１４）。切出情報生成手段２１０は、現在時刻を時刻情報５１０に、映像情報４２１の識別情報を映像識別情報５１１に、ステップ１１１３で取得した座標情報から取得した情報を原点情報５１３及びサイズ５１４にそれぞれ保存する。また、切出情報生成手段１０２は、映像情報４２１の中の顔の位置がどの方位に位置しているかを計算し、方位情報５１５として切出情報４２４に保存する。 Next, the cutout information generation unit 210 receives the coordinate information acquired by the face recognition unit 209 in step 1113, generates cutout information 424 based on the coordinate information, and outputs it to the MUX unit 104 (1114). The cut-out information generation unit 210 stores the current time in the time information 510, the identification information of the video information 421 in the video identification information 511, and the information acquired from the coordinate information acquired in step 1113 in the origin information 513 and the size 514, respectively. To do. Further, the cut-out information generation unit 102 calculates in which direction the face position in the video information 421 is located and stores it in the cut-out information 424 as the direction information 515.

次に、音声取得手段２１１は、DEMUX手段２０１から音声情報４２２を取得する（１１０６）。
音声取得手段２１１は、音声を解析することにより、音声がどの方角から到達しているか検出し、指向情報４２５を生成する（１１１５）。 Next, the voice acquisition unit 211 acquires the voice information 422 from the DEMUX unit 201 (1106).
The voice acquisition unit 211 analyzes the voice to detect from which direction the voice has arrived, and generates directivity information 425 (1115).

また、受信端末２のステップ１１０９において、指向情報４２５の方位情報５２３を使うのではなく、複数の切出情報４２４の原点情報５１３及びサイズ５１４を用いて順に時間を開けて映像情報４２１を切り取るようにすることにより、参加者全員の顔を順に一つづつ背景情報４２３の上に表示する仕組みを実現することができる。 Further, in step 1109 of the receiving terminal 2, the video information 421 is cut out in order by using the origin information 513 and the size 514 of the plurality of pieces of cutout information 424 instead of using the direction information 523 of the directivity information 425. By doing so, it is possible to realize a mechanism for displaying the faces of all the participants on the background information 423 one by one in order.

また、受信端末２のステップ１１１０において、ステップ１１０９が生成した複数の顔映像を背景情報４２３の上に整列して上書きすることにより、複数の参加者の顔を同時に背景情報４２３の上に表示する仕組みを実現することができる。 In step 1110 of the receiving terminal 2, the plurality of face images generated in step 1109 are aligned and overwritten on the background information 423, thereby simultaneously displaying the faces of the plurality of participants on the background information 423. The mechanism can be realized.

なお、以上の実施例では受信端末２が備える映像切出手段２０４が、切出情報４２４および指向情報４２５に基づいて、映像情報から顔映像を取得する構成としたが、実施例１の構成に加えて送信端末１が映像切出手段を備える構成とすることも可能である。これにより受信装置２の構成から切出し情報取得手段２０２、指向情報取得手段２０７、映像選択手段２０８を省略することができる。この場合、送信装置１が通信手段１０５を介して受信装置２に送信する映像情報はすでに映像切出手段によって切出しの処理が行われている顔映像であるため、送信装置１から受信装置２に送信される情報量を減少させることができる。 In the above-described embodiment, the video cutout unit 204 included in the receiving terminal 2 is configured to acquire a face video from the video information based on the cutout information 424 and the directivity information 425. In addition, the transmission terminal 1 may be configured to include a video cutout unit. Thereby, the cut-out information acquisition unit 202, the directivity information acquisition unit 207, and the video selection unit 208 can be omitted from the configuration of the receiving device 2. In this case, the video information transmitted from the transmission device 1 to the reception device 2 via the communication unit 105 is a face image that has already been subjected to the clipping process by the video clipping unit. The amount of information transmitted can be reduced.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to an above-described Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、または、ＩＣカード、ＳＤカード、ＤＶＤ（Digital Versatile Disk）等の記録媒体に置くことができる。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function is stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD (Digital Versatile Disk). Can be put.

また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Further, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

上記の実施例は、無線、有線により接続された通信システムにおいて、資料映像とカメラ映像を重ね合わせて表示するピクチャオンピクチャを実現するテレビ会議システムにおいて有効である。 The above embodiment is effective in a video conference system that realizes a picture-on-picture in which a document video and a camera video are displayed in a superimposed manner in a communication system connected by radio and wire.

１…送信端末
２…受信端末
４００…ストリーム
４１０…送信情報
４２４…切出情報
４２５…指向情報
１０００…送信端末１の動作を実現する処理フロー
１１００…受信端末２の動作を実現する処理フロー DESCRIPTION OF SYMBOLS 1 ... Transmission terminal 2 ... Reception terminal 400 ... Stream 410 ... Transmission information 424 ... Extraction information 425 ... Direction information 1000 ... Processing flow 1100 which implement | achieves operation | movement of the transmission terminal 1 ... Processing flow which implement | achieves operation | movement of the reception terminal 2

Claims

A video conference system comprising a transmission terminal that transmits video and a reception terminal that receives video transmitted from the transmission terminal, wherein the transmission terminal includes:
First video input means for receiving input of the first video;
Second video input means for receiving input of the second video;
Face recognition means for recognizing a human face included in the first video;
Cutting information generating means for generating cutting information which is information indicating the coordinates of the face;
A first communication means for transmitting the first video, the second video and the cutout information;
The receiving terminal is
Second communication means for receiving the first video, the second video, and the cutout information;
Video cutting means for generating the face as a third video based on the first video and the clipping information;
Video superimposing means for superposing the third video and the second video to generate a fourth video;
Output means for outputting the fourth video,
A video conference system characterized by that.

The video conference system according to claim 1,
The transmitting terminal is
Voice input means for inputting voice;
Directional information generating means for generating directional information indicating the direction of voice arrival,
Said 1st communication means is provided with the function to transmit said 1st image | video, said 2nd image | video, said cut-out information, and said directional information, The video conference system characterized by the above-mentioned.

The video conference system according to claim 2,
The second communication means receives the first video, the second video, the cutout information, and the directional information,
The receiving terminal is
Based on the directivity information, from a plurality of third video created by the video clipping means, comprising video selection means for selecting a video to be superimposed on the second video,
The video superimposing unit generates the fourth video by superimposing the third video selected by the video selection unit and the second video,
A video conference system characterized by that.

A video conference system comprising a transmission terminal that transmits video and a reception terminal that receives video transmitted from the transmission terminal, wherein the transmission terminal includes:
First video input means for receiving input of the first video;
Second video input means for receiving input of the second video;
A first communication means for transmitting the first video and the second video,
The receiving terminal is
Second communication means for receiving the first video and the second video;
Face recognition means for recognizing a human face included in the first video;
Cutting information generating means for generating cutting information which is information indicating the coordinates of the face;
Video cutting means for generating the face as a third video based on the first video and the clipping information;
Video superimposing means for superposing the third video and the second video to generate a fourth video;
Output means for outputting the fourth video,
A video conference system characterized by that.

The video conference system according to claim 4,
The transmitting terminal includes voice input means for inputting voice,
The first communication means transmits the first video, the second video, and the audio;
A video conference system characterized by

The video conference system according to claim 5,
The second communication means receives the first video, the second video, and the audio,
The receiving terminal is
Directional information generating means for generating directional information indicating the arrival direction of the voice;
Video selection means for selecting a video to be superimposed on the second video from a plurality of third videos created by the video clipping means based on the orientation information;
The video superimposing unit generates the fourth video by superimposing the third video selected by the video selection unit and the second video,
A video conference system characterized by that.

A video conference method used in a video conference system comprising: a transmission terminal that transmits video; and a reception terminal that receives video transmitted from the transmission terminal, wherein the transmission terminal includes:
Receiving the input of the first video;
Receiving a second video input;
Recognizing a human face included in the first video;
Generating cutout information that is information indicating the coordinates of the face;
Transmitting the first video, the second video, and the cut-out information;
In the receiving device,
Receiving the first video, the second video and the clipping information;
Generating the face as a third video based on the first video and the clipping information;
Superposing the third video and the second video to generate a fourth video;
And a step of outputting the fourth video.

The video conference method according to claim 7, wherein in the transmission device,
Receiving voice input;
Generating directivity information indicating a direction of arrival of the voice, and
In the step of transmitting the first video, the second video, and the cutout information, the directional information is transmitted.
A video conferencing method characterized by the above.

9. The video conference method according to claim 8, wherein in the receiving device,
Obtaining the directional information;
Selecting a video to be superimposed on the second video from a plurality of third videos based on the orientation information;
Generating a third video from the cut-out information and the directional information;
A video conferencing method characterized by:

A video conference method used in a video conference system comprising: a transmission terminal that transmits video; and a reception terminal that receives video transmitted from the transmission terminal, wherein the transmission terminal includes:
Receiving the input of the first video;
Receiving a second video input;
Transmitting the first video and the second video, and
In the receiving terminal,
Receiving the first video and the second video;
Recognizing a human face included in the first video;
Generating cutout information that is information indicating the coordinates of the face;
Generating the face as a third video based on the first video and the clipping information;
Superposing the third video and the second video to generate a fourth video;
A video conferencing method comprising: a step of outputting a fourth video.

The video conference method according to claim 10, wherein in the transmission terminal,
Receiving voice input;
And transmitting the voice. A video conference method comprising:

12. The video conference method according to claim 11, wherein the receiving apparatus includes:
Receiving the voice,
Generating directional information indicating a direction of arrival of the voice;
Selecting a video to be superimposed on the second video from the plurality of third videos based on the directivity information, and
In the step of generating the fourth image by superimposing the third image and the second image, the fourth image is generated by superimposing the selected third image and the second image. To
A video conferencing method characterized by the above.