JP6461679B2

JP6461679B2 - Video display system and video display method

Info

Publication number: JP6461679B2
Application number: JP2015071764A
Authority: JP
Inventors: 康夫高橋; 吏中野; 貴司折目; 雄一郎竹内; 暦本　純一; 純一暦本; 宮島　靖; 靖宮島
Original assignee: Sony Corp; Daiwa House Industry Co Ltd
Current assignee: Sony Corp; Daiwa House Industry Co Ltd
Priority date: 2015-03-31
Filing date: 2015-03-31
Publication date: 2019-01-30
Anticipated expiration: 2035-03-31
Also published as: WO2016159166A1; JP2016192688A

Description

本発明は、映像表示システム及び映像表示方法に係り、特に、遠隔地に居る対話相手の映像を対話者側のディスプレイに表示させる映像表示システム及び映像表示方法に関する。 The present invention relates to a video display system and a video display method, and more particularly, to a video display system and a video display method for displaying a video of a conversation partner in a remote place on a display on the side of a conversation person.

互いに離れた空間に居るユーザ同士がお互いの映像を見ながら対話することを実現する通信システム（以下、映像表示システム）は、既に知られている。同システムでは、一方のユーザ側から映像の映像データが送信され、他方のユーザ側で当該映像データを受信して展開する。これにより、一方のユーザの映像が他方のユーザ側のディスプレイに表示されるようになる。この結果、ディスプレイにてお互いの映像を見ているユーザ同士は、あたかも相手と対面しているかのように感じるようになる。 There is already known a communication system (hereinafter referred to as a video display system) that allows users in separate spaces to interact while watching each other's video. In this system, video data of video is transmitted from one user side, and the video data is received and expanded on the other user side. Thereby, the video of one user comes to be displayed on the display of the other user. As a result, users watching each other's images on the display feel as if they are facing each other.

また、上記の映像表示システムの中には、テクスチャマッピング等を利用して撮像映像を三次元化して表示するシステムが存在する（例えば、特許文献１参照）。このように三次元化された映像（以下、三次元映像）を表示することで、ディスプレイに相手の映像を表示しながら行う対話の臨場感を一層向上させることが可能となる。 In addition, among the video display systems described above, there is a system that displays captured video in three dimensions using texture mapping or the like (see, for example, Patent Document 1). By displaying the three-dimensional video (hereinafter, three-dimensional video) in this way, it is possible to further improve the realism of the dialogue performed while displaying the other party's video on the display.

さらに、上記の映像表示システムの中には、対話の臨場感をより一層高める目的から、ディスプレイを見ている者の目線とディスプレイに映し出された者の目線とを一致させることが可能なシステムが存在する（例えば、特許文献１乃至３参照）。具体的に説明すると、特許文献１及び２に記載のシステムでは、目線の位置が一致するようにカメラの設置位置が予め適当に決められている。また、特許文献３に記載のシステムでは、ディスプレイに映し出される者の映像を撮像するカメラの位置を、ディスプレイを見ている者の目の高さに応じて上下動させることで両者の目線を一致させる。 Furthermore, among the above video display systems, there is a system that can match the eyes of the person who is looking at the display and the eyes of the person shown on the display for the purpose of further enhancing the realism of the dialogue. Exists (for example, see Patent Documents 1 to 3). Specifically, in the systems described in Patent Documents 1 and 2, the installation position of the camera is appropriately determined in advance so that the positions of the line of sight coincide. Moreover, in the system described in Patent Document 3, the position of the camera that captures the image of the person shown on the display is moved up and down according to the height of the eyes of the person who is looking at the display, thereby matching the eyes of both. Let

特開２０１４−８６７７４号公報JP 2014-86774 A 特開２０００−３２４２０号公報JP 2000-32420 A 特表２０１４−５２２６２２号公報Special table 2014-522622 gazette

しかしながら、特許文献１及び２に記載のシステムでは、目線の位置が一致するようにカメラの設置位置を決めるので、目線の高さが制限されてしまうことになる。つまり、撮像カメラの設置位置が固定されているため、その設置位置とは異なる高さに目線がある者にとっては利用し難いシステムとなる（具体的には、目線の位置が一致しなくなる）。一方、特許文献３に記載のシステムでは、ディスプレイを見ている者の目の高さに応じてカメラの位置を調整可能であるため、様々な目の高さに対応し得るものの、カメラ位置の調整機構を設ける必要があるため、システム構築コストが割高となってしまう。 However, in the systems described in Patent Documents 1 and 2, since the installation position of the camera is determined so that the positions of the lines of sight match, the height of the lines of sight is limited. That is, since the installation position of the imaging camera is fixed, it becomes a system that is difficult for a person who has a line of sight at a height different from the installation position (specifically, the position of the line of sight does not match). On the other hand, in the system described in Patent Document 3, the position of the camera can be adjusted according to the height of the eyes of the person viewing the display. Since it is necessary to provide an adjustment mechanism, the system construction cost is expensive.

また、映像表示システムを用いた対話の臨場感について更なる向上を図るためには、ディスプレイを見ている者の動き（特に顔の動き）やディスプレイに映っている者の位置の変化に追従させるように、ディスプレイの映像を切り替える必要がある。具体的に説明すると、ディスプレイを見ている者の顔が横移動したとき、その者が対話相手と実際に対面している場面で顔を横に動かしたときの見え方、を反映して表示映像を変えるのが望ましい。 In addition, in order to further improve the realism of dialogue using the video display system, the movement of the person watching the display (especially the movement of the face) and the change in the position of the person shown on the display are followed. Thus, it is necessary to switch the video on the display. Specifically, when the face of the person watching the display moves sideways, it reflects the appearance of the person moving the face sideways in a situation where the person is actually facing the conversation partner. It is desirable to change the image.

また、カメラの被写体がカメラから離れるほど、ディスプレイに写る当該被写体の映像の表示サイズは、より小さくなってしまう。ところが、実際に対面しながら対話を行っている場面において、その当事者のうちの一方の者に対して他方の者が多少離れたときの当該他方の者の姿（大きさ）は、上記一方の者の見え方（見た目）では殆ど変化しないように見える。このような見え方を考慮し、被写体とカメラとの間の距離、すなわち奥行距離が変化したときには当該被写体の映像の表示サイズを調整するのが望ましい。 Also, the further away the camera subject is from the camera, the smaller the display size of the subject image on the display. However, in a situation where the conversation is actually conducted while facing each other, the figure (size) of the other person when the other person is slightly separated from one of the parties is It seems that there is almost no change in how people look (look). In consideration of such appearance, it is desirable to adjust the display size of the image of the subject when the distance between the subject and the camera, that is, the depth distance changes.

そこで、本発明は、上記の課題に鑑みてなされたものであり、その目的とするところは、ディスプレイに映し出されるユーザの目の高さと撮像装置の設置高さとが異なる場合において、ディスプレイに上記ユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能な映像表示システム及び映像表示方法を提供することである。
また、本発明の他の目的は、ディスプレイに映し出されるユーザの映像を見ている第二のユーザの顔が横移動したときに、実際の見え方を反映してディスプレイの表示映像を変化させることである。さらに、本発明の第三の目的は、ディスプレイに映し出されるユーザの映像の表示サイズを、当該ユーザと撮像装置との間の距離が変化した際に適切に調整することである。 Therefore, the present invention has been made in view of the above problems, and the object of the present invention is to display the user on the display when the height of the user's eyes projected on the display is different from the installation height of the imaging device. It is an object to provide a video display system and a video display method capable of improving the realistic sensation of a dialogue performed while displaying the video.
Another object of the present invention is to change the display image on the display to reflect the actual appearance when the face of the second user who is watching the user image displayed on the display moves sideways. It is. A third object of the present invention is to appropriately adjust the display size of the user's video displayed on the display when the distance between the user and the imaging device changes.

前記課題は、本発明の映像表示システムによれば、（Ａ）撮像装置により撮像されたユーザの映像を取得する映像取得部と、（Ｂ）前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得する距離データ取得部と、（Ｃ）前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成する三次元映像生成部と、（Ｄ）前記ユーザの目の高さを検知する高さ検知部と、を有し、（Ｅ）前記撮像装置が設置されている高さ及び前記高さ検知部が検知した前記目の高さの双方が異なるとき、前記三次元映像生成部は、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記高さ検知部が検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することにより解決される。 According to the video display system of the present invention, (A) a video acquisition unit that acquires a video of a user captured by an imaging device, and (B) when the video is divided into a predetermined number of video pieces. For each video piece, a distance data acquisition unit that acquires distance data indicating a distance between the imaging device and the object in the video piece, and (C) the user's video and the distance data are used. A 3D image generation unit that generates a 3D image of the user by executing a rendering process; and (D) a height detection unit that detects the height of the user's eyes, and (E) When both the height at which the imaging device is installed and the height of the eyes detected by the height detection unit are different, the 3D image generation unit determines the difference between the two and the imaging device and the user. Based on the distance between them, the height detector It is solved by executing the said rendering processing for obtaining the three-dimensional image of the user when viewed from a virtual viewpoint at the height of the eye.

上記の構成によれば、撮像装置により撮像されたユーザの映像、及び、当該ユーザの映像について取得した距離データを用いたレンダリング処理を実行することでユーザの三次元映像を生成する。また、ユーザの目の高さと、撮像装置が設置されている高さと、が異なる場合には、ユーザの目の高さと同じ高さにある仮想的な視点から見たときのユーザの三次元映像を取得するように、レンダリング処理を実行する。このように３ＤＣＧ技術としてのレンダリング処理によって、上記ユーザの目の高さと同じ高さから仮想的に見たユーザの三次元映像を得ることで、双方の高さが異なる場合にも、ディスプレイを見ている者の目線とディスプレイに映し出される者の目線とを合わせることが可能となる。これにより、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能となる。 According to said structure, a user's 3D image | video is produced | generated by performing the rendering process using the user's image | video imaged with the imaging device, and the distance data acquired about the said user | video image | video. If the height of the user's eyes is different from the height at which the imaging device is installed, the user's three-dimensional image when viewed from a virtual viewpoint at the same height as the user's eyes Execute the rendering process so that In this way, by rendering processing as 3DCG technology, a 3D image of a user viewed virtually from the same height as the user's eyes is obtained, so that even if the heights of both are different, the display can be viewed. It is possible to match the eyes of the person who is present and the eyes of the person shown on the display. As a result, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.

また、上記の映像表示システムにおいて、前記映像取得部は、前記撮像装置により撮像された前記ユーザの映像、及び、前記撮像装置により撮像された背景の映像をそれぞれ取得し、前記距離データ取得部は、前記ユーザの映像及び前記背景の映像のそれぞれについて、前記距離データを取得し、前記三次元映像生成部は、前記ユーザの映像及び当該ユーザの映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記ユーザの前記三次元映像を生成すると共に、前記背景の映像及び当該背景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記背景の前記三次元映像を生成し、前記ユーザの前記三次元映像と前記背景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置した合成映像をディスプレイに表示させる合成映像表示部を有すると、好適である。
上記の構成では、ユーザの三次元映像及び背景の三次元映像を合成し、背景の手前にユーザが位置した合成映像を表示する。このような奥行感を有する合成映像が表示されることで、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感がより向上することになる。 In the video display system, the video acquisition unit acquires the user's video captured by the imaging device and the background video captured by the imaging device, and the distance data acquisition unit includes: The distance data is acquired for each of the user image and the background image, and the 3D image generation unit uses the distance data acquired for the user image and the user image. Generating the 3D image of the user by executing a process, and executing the rendering process using the background data and the distance data acquired for the background image, Generating an original video, combining the 3D video of the user and the 3D video of the background, When having a combined image display unit for displaying a combined image in which the user is positioned in front of the background to the display, which is preferable.
In the above configuration, the 3D video of the user and the 3D video of the background are synthesized, and the synthesized video in which the user is positioned in front of the background is displayed. By displaying the composite video having such a feeling of depth, the realism of the dialogue performed while displaying the user's video on the display is further improved.

また、上記の映像表示システムにおいて、前記映像取得部は、前記撮像装置により撮像された前景の映像を更に取得し、前記距離データ取得部は、前記前景の映像についての前記距離データを更に取得し、前記三次元映像生成部は、前記前景の映像及び当該前景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記前景の前記三次元映像を更に生成し、前記合成映像表示部は、前記ユーザの前記三次元映像と前記背景の前記三次元映像と前記前景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置し、かつ、前記ユーザの手前に前記前景が位置している前記合成映像を前記ディスプレイに表示させると、より好適である。
上記の構成では、ユーザの三次元映像及び背景の三次元映像に加えて、前景の三次元映像を更に合成し、ユーザの手前に前景が位置した合成映像を表示する。これにより、より一層奥行感を有する合成映像が表示されるようになる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感が一段と向上することになる。 In the video display system, the video acquisition unit further acquires a foreground video captured by the imaging device, and the distance data acquisition unit further acquires the distance data regarding the foreground video. The 3D image generation unit further generates the 3D image of the foreground by executing the rendering process using the distance data acquired for the foreground image and the foreground image, and the composition The video display unit synthesizes the 3D video of the user, the 3D video of the background, and the 3D video of the foreground, the user is positioned in front of the background, and It is more preferable to display the composite image in which the foreground is located on the display.
In the above configuration, in addition to the 3D video of the user and the 3D video of the background, the 3D video of the foreground is further synthesized, and the synthesized video with the foreground positioned in front of the user is displayed. As a result, a composite image having a greater sense of depth is displayed. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.

また、上記の映像表示システムにおいて、前記距離データに基づいて、前記撮像装置と前記ユーザとの間の距離が変化したかどうかを判定する判定部を備え、前記撮像装置が前記ユーザの映像を撮像している間に、前記撮像装置と前記ユーザとの間の距離が変化したと前記判定部が判定したとき、前記合成映像表示部は、前記合成映像における前記ユーザの映像の表示サイズを、前記撮像装置と前記ユーザとの間の距離が変化する前の前記表示サイズとなるように調整すると、更に好適である。
上記の構成によれば、撮像装置とユーザとの間の距離、すなわち奥行距離が変化したとしても、ディスプレイには、変化前の表示サイズのままでユーザの三次元映像が表示されることになる。すなわち、ユーザの奥行距離が変化した場合、変化後の合成映像は、実際にユーザと対面して当該ユーザを見たときの見え方（すなわち、自らの視覚を通じて認識したユーザの大きさ）を反映した表示サイズにてユーザの三次元映像を表示したものとなる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感がより一層向上することになる。 The video display system may further include a determination unit that determines whether the distance between the imaging device and the user has changed based on the distance data, and the imaging device captures an image of the user. When the determination unit determines that the distance between the imaging device and the user has changed, the composite video display unit displays the display size of the user video in the composite video, It is more preferable to adjust the display size before the distance between the imaging device and the user is changed.
According to the above configuration, even when the distance between the imaging device and the user, that is, the depth distance changes, the display displays the user's three-dimensional video with the display size before the change. . That is, when the user's depth distance changes, the composite image after the change reflects how it looks when the user actually sees the user (ie, the size of the user recognized through his / her own vision). The 3D image of the user is displayed at the display size. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.

また、上記の映像表示システムにおいて、前記ディスプレイに表示された前記合成映像を見る第二のユーザの顔が前記ディスプレイの幅方向に移動したことを検知する顔移動検知部を有し、該顔移動検知部が前記顔の移動を検知したとき、前記合成映像表示部は、前記ディスプレイに表示されている前記合成映像を、前記顔移動検知部が前記顔の移動を検知する前の状態から遷移させる遷移処理を実行し、該遷移処理では、前記合成映像における前記ユーザの前記三次元映像の表示位置、及び、前記背景の前記三次元映像の中で前記合成映像中に含まれる範囲のうちの一方を、他方のずれ量よりも大きいずれ量だけ前記幅方向に沿ってずらした状態へ前記合成映像を遷移させると、より一層好適である。
上記の構成によれば、ユーザの映像及び背景の映像を合成して得られる合成映像において、ユーザの映像及び背景の映像のそれぞれの表示位置や表示サイズ等を個別に調整することが可能である。そして、第二のユーザの顔が横移動したときには、ユーザの三次元映像の表示位置、及び、背景の三次元映像の中で合成映像中に含まれる範囲のうちの一方を、他方のずれ量よりも大きいずれ量だけ横方向にずらした状態へ合成映像を遷移させることとしている。これにより、第二のユーザの顔が横移動した後のディスプレイには、移動後の顔の位置から実際にユーザと対面して当該ユーザを見たときの見え方、を再現した映像が表示されるようになる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感が、更に向上することとなる。 The video display system may further include a face movement detection unit that detects that the face of a second user who views the composite video displayed on the display has moved in the width direction of the display. When the detection unit detects the movement of the face, the composite video display unit transitions the composite video displayed on the display from a state before the face movement detection unit detects the movement of the face. A transition process is performed, and in the transition process, one of a display position of the 3D video of the user in the composite video and a range included in the composite video in the background 3D video It is even more preferable that the composite video is shifted to a state in which it is shifted along the width direction by any amount larger than the other shift amount.
According to the above configuration, in the synthesized video obtained by synthesizing the user video and the background video, it is possible to individually adjust the display position, the display size, and the like of each of the user video and the background video. . Then, when the second user's face moves laterally, one of the display position of the user's 3D video and the range included in the synthesized video in the background 3D video is changed to the other shift amount. The synthesized video is transitioned to a state that is shifted in the horizontal direction by an amount larger than that. As a result, on the display after the second user's face has moved sideways, an image reproducing the appearance when the user is actually seen from the position of the face after moving is displayed. Become so. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.

また、上記の映像表示システムにおいて、前記映像取得部は、互いに異なる撮像方向にて前記ユーザの映像を撮像する複数の前記撮像装置により撮像された前記ユーザの映像を、前記撮像装置別に取得し、前記距離データ取得部は、前記ユーザの映像についての前記距離データを前記撮像装置別に取得し、前記三次元映像生成部は、前記撮像装置別に取得された前記ユーザの映像と、前記撮像装置別に取得された前記距離データと、に基づいて、前記撮像装置別の前記ユーザの三次元映像片を生成する映像片生成工程と、前記ユーザの前記三次元映像を生成するために、前記撮像装置別の前記ユーザの前記三次元映像片の各々を、当該各々に含まれる共通の映像領域同士が重なり合うように結合する結合工程と、を行い、前記映像片生成工程において前記ユーザの目を含む部分の前記三次元映像片を生成する際、前記双方が異なるときには、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記仮想的な視点から見たときの前記三次元映像片を取得するための前記レンダリング処理を実行すると、尚好適である。
上記の構成によれば、互いに撮像方向が異なる複数の撮像装置によってユーザの映像を撮像する場合に、撮像装置別に三次元映像片を生成し、最終的に三次元映像片同士を結合してユーザの三次元映像を取得する。一方、撮像装置別に生成される三次元映像片のうち、ユーザの目を含む部分の三次元映像片を生成する際には、ユーザの目の高さにある仮想的な視点から見たときの三次元映像を取得するためのレンダリング処理を実行する。これにより、三次元映像片同士を結合してなるユーザの三次元映像をディスプレイに表示すれば、当該ユーザの目線とディスプレイを見ている者の目線とを合わせることが可能となる。 In the video display system, the video acquisition unit acquires the video of the user captured by the plurality of imaging devices that capture the video of the user in different imaging directions, for each imaging device, The distance data acquisition unit acquires the distance data regarding the video of the user for each imaging device, and the 3D video generation unit acquires the video of the user acquired for the imaging device and the imaging device. An image piece generating step for generating the user's 3D image piece for each image pickup device based on the distance data, and for generating the 3D image of the user for each image pickup device. A step of combining each of the user's three-dimensional video pieces so that common video regions included in the respective pieces overlap each other, and the video piece generation step When the three-dimensional image piece of the portion including the user's eyes is generated, if the two are different, the virtual image is based on the difference between the two and the distance between the imaging device and the user. It is more preferable that the rendering process for acquiring the 3D video piece when viewed from the viewpoint is executed.
According to the above configuration, when a user's video is captured by a plurality of imaging devices having different imaging directions, a 3D video piece is generated for each imaging device, and finally the 3D video pieces are connected to each other. Get 3D video. On the other hand, among the 3D image pieces generated for each imaging device, when generating a 3D image piece of a part including the user's eyes, it is as seen from a virtual viewpoint at the height of the user's eyes. A rendering process for acquiring a 3D image is executed. As a result, if a 3D video of a user formed by combining 3D video pieces is displayed on the display, the user's line of sight and the line of sight of the person watching the display can be matched.

また、上記の映像表示システムにおいて、前記撮像方向が基準面の法線方向と異なるとき、前記三次元映像生成部は、前記映像片生成工程において、前記撮像方向にて撮像した映像に基づいて生成した前記ユーザの前記三次元映像片を、前記法線方向から仮想的に見た場合の前記三次元映像片へ変換すると、益々好適である。
上記の構成では、基準方向の法線方向と異なる撮像方向にてユーザの映像を撮像し、その映像から三次元映像片を生成する場合に、上記の撮像方向にて撮像した映像に基づいて生成したユーザの三次元映像片を、上記の法線方向から仮想的に見た場合の三次元映像片へ変換する。そして、変換後の三次元映像片を用いてユーザの三次元映像を取得する。このようにして得られた三次元映像は、上記の法線方向から見たときの映像となっており、ディスプレイに表示した際には適切に表示されるようになる。具体的に説明すると、ユーザの三次元映像中、三次元映像片同士を結合した部分付近が屈曲しているかのように見えてしまうのを抑制することが可能となる。 In the video display system, when the imaging direction is different from the normal direction of the reference plane, the 3D video generation unit generates the video piece based on the video captured in the imaging direction in the video piece generation step. It is more preferable to convert the 3D image piece of the user into the 3D image piece when virtually viewed from the normal direction.
In the above configuration, when a user's video is captured in an imaging direction different from the normal direction of the reference direction, and a 3D video piece is generated from the video, it is generated based on the video captured in the imaging direction. The user's 3D video piece is converted into a 3D video piece viewed virtually from the normal direction. And a user's 3D image | video is acquired using the converted 3D image piece. The three-dimensional image obtained in this way is an image when viewed from the normal direction, and is appropriately displayed when displayed on the display. More specifically, it becomes possible to suppress the vicinity of the portion where the 3D video pieces are joined in the 3D video of the user from appearing to be bent.

また、前述した課題は、本発明の映像表示方法によれば、（Ａ）コンピュータが、撮像装置により撮像されたユーザの映像を取得することと、（Ｂ）コンピュータが、前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得することと、（Ｃ）コンピュータが、前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成することと、（Ｄ）コンピュータが、前記ユーザの目の高さを検知することと、を有し、（Ｅ）前記撮像装置が設置されている高さ及び検知した前記目の高さの双方が異なるとき、コンピュータは、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することにより解決される。
上記の方法によれば、ユーザの目の高さと撮像装置の設置高さとが異なっていても、ディスプレイを見ている者の目線とディスプレイに映し出される者（すなわち、ユーザ）の目線とを合わせることが可能となる。これにより、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能となる。 In addition, according to the video display method of the present invention, the above-described problem is that (A) the computer acquires the video of the user captured by the imaging device, and (B) the computer displays the video for a predetermined number of times. Obtaining distance data indicating a distance from the object in the video piece from the imaging device for each video piece when divided into video pieces; And generating a 3D video of the user by executing a rendering process using the distance data, and (D) a computer detecting the eye height of the user, E) When both the height at which the imaging device is installed and the detected eye height are different, the computer detects based on the difference between the two and the distance between the imaging device and the user. did It is solved by executing the rendering process for acquiring the three-dimensional image of the user when viewed from a virtual viewpoint at the height of the serial eyes.
According to the above method, even if the eye height of the user is different from the installation height of the imaging device, the line of sight of the person watching the display and the line of sight of the person shown on the display (that is, the user) are matched. Is possible. As a result, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.

本発明の映像表示システム及び映像表示方法によれば、ユーザの目の高さと撮像装置の設置高さとが異なっていても、ディスプレイを見ている者（すなわち、第二のユーザ）の目線とディスプレイに映し出される者（すなわち、ユーザ）の目線とを合わせることが可能である。また、第二のユーザの顔が横移動したときに、ディスプレイに表示されている映像を、移動後の顔の位置から実際にユーザと対面して当該ユーザを見たときの見え方を再現した映像へ遷移させることが可能である。さらに、また、撮像装置とユーザとの間の距離（奥行距離）が変化した際、ディスプレイに表示されている合成映像中、ユーザの三次元映像の表示サイズを、奥行距離変化前の表示サイズとなるように調整する。これにより、ユーザと実際に対面して当該対話相手を見たときに感じる大きさ（すなわち、ユーザが自分の視覚を通じて認識する対話相手の大きさ）にてユーザの三次元映像を表示することが可能となる。
以上の作用により、本発明の映像表示システムや映像表示方法によれば、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感（リアリティ）を向上させることが可能となる。 According to the video display system and the video display method of the present invention, even if the eye height of the user is different from the installation height of the imaging device, the eye and the display of the person viewing the display (that is, the second user) It is possible to match the line of sight of the person (that is, the user) projected on In addition, when the face of the second user moved sideways, the image displayed on the display was reproduced from the position of the face after moving, actually facing the user and looking at the user Transition to video is possible. Furthermore, when the distance (depth distance) between the imaging device and the user changes, the display size of the user's 3D image in the composite image displayed on the display is changed to the display size before the change of the depth distance. Adjust so that Thus, it is possible to display the 3D video of the user at a size that is felt when the user actually faces the user and sees the conversation partner (that is, the size of the conversation partner recognized by the user through his / her own vision). It becomes possible.
With the above operation, according to the video display system and the video display method of the present invention, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.

本発明の一実施形態に係る映像表示システムの構成を示した図である。It is the figure which showed the structure of the video display system which concerns on one Embodiment of this invention. 各ユーザの部屋内に設置されたシステム構成機器の配置位置を示した図である。It is the figure which showed the arrangement position of the system component apparatus installed in the room of each user. 図３の（Ａ）及び（Ｂ）は、本発明のディスプレイの一例を示した図である。3A and 3B are diagrams showing an example of the display of the present invention. 映像合成の手順についての説明図である。It is explanatory drawing about the procedure of an image composition. 実映像から人物映像を抽出する手順についての説明図である。It is explanatory drawing about the procedure which extracts a person image | video from a real image | video. 三次元映像を生成する手順についての説明図である。It is explanatory drawing about the procedure which produces | generates a three-dimensional image | video. 各ユーザが保有するホームサーバの構成を機能面から示した図である。It is the figure which showed the structure of the home server which each user holds from a functional surface. ユーザの三次元映像について目線の高さを合わせる手順についての説明図であり、（Ａ）は、実際のカメラ位置から撮像したときの映像を、（Ｂ）は、カメラとユーザの目線との位置関係を、（Ｃ）は、仮想的なカメラ位置から撮像したときの映像を、それぞれ示している。It is explanatory drawing about the procedure which matches the height of a eyes | visual_axis about a user's three-dimensional image, (A) is an image | video when it images from an actual camera position, (B) is a position of a camera and a user's eyes | visual_axis. (C) shows the relationship between the images taken from the virtual camera position. 従来の映像表示システムの構成例を示した図であり、ディスプレイを見ている者の移動に連動して表示映像が変化する様子を図示している。It is the figure which showed the structural example of the conventional video display system, and shows a mode that a display video changes in response to the movement of the person who is looking at a display. 第二のユーザの顔が横移動した状況を模式的に示した図である。It is the figure which showed typically the condition where the 2nd user's face moved sideways. ユーザ、背景及び前景の各々の奥行距離についての説明図である。It is explanatory drawing about the depth distance of each of a user, a background, and a foreground. 遷移処理を実行したときの合成映像の変化を示した説明図であり、（Ａ）は、遷移処理前の合成映像を、（Ｂ）は、遷移処理後の合成映像を、それぞれ示している。It is explanatory drawing which showed the change of the synthetic | combination video when a transition process is performed, (A) has shown the synthetic | combination video before a transition process, (B) has each shown the synthetic | combination video after a transition process. 従来の映像表示システムの構成例を示した図であり、ユーザの奥行距離に応じて当該ユーザの映像の表示サイズが変わる様子を図示している。It is the figure which showed the structural example of the conventional video display system, and has shown a mode that the display size of the said user's image | video changes according to a user's depth distance. 映像表示サイズの調整についての説明図であり、（Ａ）は、ユーザの奥行距離が変化する前の合成映像を、（Ｂ）は、奥行距離が変化した後にサイズ調整が行われた段階の合成映像を、それぞれ示している。It is explanatory drawing about adjustment of a video display size, (A) is a synthetic | combination image | video before a user's depth distance changes, (B) is the synthesis | combination of the stage in which size adjustment was performed after the depth distance changed. Each video is shown. 映像表示フローの流れを示した図である（その１）。It is the figure which showed the flow of the image | video display flow (the 1). 映像表示フローの流れを示した図である（その２）。It is the figure which showed the flow of the video display flow (the 2). 人物の三次元映像を取得する手順を示した図である。It is the figure which showed the procedure which acquires the three-dimensional image | video of a person. 複数のカメラにてユーザの映像を撮像する様子を模式的に示した図である。It is the figure which showed typically a mode that a user's image | video is imaged with a some camera. カメラ別に生成した三次元映像片と、三次元映像片同士を結合してなる三次元映像と、を示した図である。It is the figure which showed the 3D image piece produced | generated for each camera, and the 3D image formed by combining 3D image pieces. 変形例において人物の三次元映像を取得する手順を示した図である。It is the figure which showed the procedure which acquires the three-dimensional image | video of a person in a modification. 第二の遷移処理に関する説明図であり、（Ａ）が第二の遷移処理前の合成映像を、（Ｂ）が第二の遷移処理後の合成映像を、それぞれ示している。It is explanatory drawing regarding a 2nd transition process, (A) has shown the synthetic | combination image | video before a 2nd transition process, (B) has each shown the synthetic | combination image | video after a 2nd transition process.

以下、本発明の一実施形態（以下、本実施形態）について図面を参照しながら説明する。本実施形態に係る映像表示システム（以下、本システムＳ）は、互いに離れた部屋に居るユーザ同士が互いの姿（映像）を見ながら対話するために用いられる。より具体的に説明すると、各ユーザが居る部屋内には映像表示器としてのディスプレイが設置されており、このディスプレイに相手の映像が映し出される（表示される）。これにより、各ユーザは、ディスプレイをガラス（例えば、窓ガラスやドアガラス）と見立て、あたかもガラス越しに相手と対面しながら対話しているように感じる。 Hereinafter, an embodiment of the present invention (hereinafter, this embodiment) will be described with reference to the drawings. The video display system according to the present embodiment (hereinafter, system S) is used for users who are in rooms separated from each other to interact with each other while watching each other's appearance (video). More specifically, a display as a video display is installed in a room where each user is present, and the other party's video is displayed (displayed) on this display. As a result, each user feels that the display is viewed as glass (for example, window glass or door glass) and interacts with the other party through the glass.

なお、本システムＳは、各ユーザが各自の自宅に居るときに利用されることになっている。つまり、本システムＳは、各ユーザが自宅に居ながらにして対話相手と対話（擬似的な対面対話であって、以下、単に「対面対話」という）を行うために利用される。ただし、これに限定されるものではなく、本システムＳは、ユーザが自宅以外の場所、例えば、集会所や商業施設、あるいは学校の教室や学習塾、病院等の公共施設、会社や事務所等に居るときに用いられてもよい。また、同じ建物内で互いに離れた部屋に居る者同士が対面対話するために本システムＳを用いてもよい。 The system S is to be used when each user is at his / her home. That is, this system S is used for each user to have a conversation with a conversation partner (a pseudo face-to-face conversation, hereinafter simply referred to as “face-to-face conversation”) while at home. However, the present system S is not limited to this, and the system S is a place where the user is not at home, such as a meeting place, a commercial facility, a school classroom, a school, a public facility such as a hospital, a company, an office, etc. May be used when in Moreover, you may use this system S in order for the person who is in the room apart from each other in the same building to face-to-face.

以降、本システムＳについて分かり易く説明するために、二人のユーザが本システムＳを利用して対面対話するケースを例に挙げて説明することとし、一方のユーザをＡさん、他方のユーザをＢさんとする。また、以下では、Ｂさん側の視点、すなわち、Ａさんの映像を見る立場から本システムＳの構成等を説明することとする。つまり、Ａさんが「ユーザ」に相当し、Ｂさんが「第二のユーザ」に相当する。ただし、「ユーザ」及び「第二のユーザ」は、映像を見る者及び見られる者の関係に応じて切り替わる相対的な概念である。したがって、Ａさんの視点を基準としたときには、Ｂさんが「ユーザ」に相当し、Ａさんが「第二のユーザ」に相当することとなる。 Hereinafter, in order to explain the system S in an easy-to-understand manner, a case where two users have a face-to-face conversation using the system S will be described as an example. One user is Mr. A and the other user is Let's say B. In the following, the configuration of the system S will be described from the viewpoint of Mr. B, that is, from the viewpoint of viewing Mr. A's video. That is, Mr. A corresponds to a “user”, and Mr. B corresponds to a “second user”. However, “user” and “second user” are relative concepts that are switched according to the relationship between the person who sees the image and the person who sees the image. Therefore, when the viewpoint of Mr. A is used as a reference, Mr. B corresponds to “user” and Mr. A corresponds to “second user”.

＜＜本システムの基本構成＞＞
先ず、本システムＳの基本構成について説明する。本システムＳは、二人のユーザ（すなわち、Ａさん及びＢさん）がお互いの映像を見ながら対面対話をするために用いられ、より具体的には、各ユーザに対して対話相手の等身大の映像を表示し、対話相手の音声を再生するものである。このような視聴覚的効果を得るために、各ユーザは、通信ユニット１００を保有している。つまり、本システムＳは、各ユーザが保有する通信ユニット１００によって構成されている。 << Basic configuration of this system >>
First, the basic configuration of the system S will be described. This system S is used for two users (namely, Mr. A and Mr. B) to conduct a face-to-face conversation while watching each other's images, and more specifically, the life-size of the conversation partner for each user. Is displayed, and the other party's voice is played back. In order to obtain such an audiovisual effect, each user has a communication unit 100. That is, this system S is comprised by the communication unit 100 which each user possesses.

次に、図１を参照しながら通信ユニット１００の構成について説明する。図１は、本システムＳの構成、より具体的には各通信ユニット１００の構成を示した図である。各通信ユニット１００は、ホームサーバ１、撮像装置としてのカメラ２、集音装置としてのマイク３、赤外線センサ４、映像表示器としてのディスプレイ５、及び、スピーカ６を主な構成機器として有する。これらの機器のうち、カメラ２、マイク３、赤外線センサ４、ディスプレイ５及びスピーカ６は、各ユーザの自宅における所定部屋（例えば、対面対話を行う際に利用する部屋）内に配置されている。 Next, the configuration of the communication unit 100 will be described with reference to FIG. FIG. 1 is a diagram showing the configuration of the system S, more specifically, the configuration of each communication unit 100. Each communication unit 100 includes a home server 1, a camera 2 as an imaging device, a microphone 3 as a sound collection device, an infrared sensor 4, a display 5 as a video display, and a speaker 6 as main components. Among these devices, the camera 2, the microphone 3, the infrared sensor 4, the display 5, and the speaker 6 are arranged in a predetermined room at the home of each user (for example, a room used when performing a face-to-face conversation).

ホームサーバ１は、本システムＳの中枢をなす装置であり、コンピュータ、具体的にはホームゲートウェイを構成するサーバコンピュータからなる。このホームサーバ１の構成については公知であり、ＣＰＵ、ＲＯＭやＲＡＭ等のメモリ、通信用インタフェース及びハードディスクドライブ等によって構成されている。 The home server 1 is a central device of the system S, and includes a computer, specifically, a server computer constituting a home gateway. The configuration of the home server 1 is publicly known and includes a CPU, a memory such as a ROM and a RAM, a communication interface, a hard disk drive, and the like.

また、ホームサーバ１には、対面対話の実現に必要なデータ処理を実行するためのプログラム（以下、対話用プログラム）がインストールされている。この対話用プログラムには、三次元映像表示用のプログラムが組み込まれている。このプログラムは、三次元コンピュータグラフィックス（以下、３ＤＣＧ）により三次元映像を構築して表示するためのプログラムであり、所謂レンダラーである。また、上記の３ＤＣＧレンダラーは、複数の三次元映像を合成する機能を有する。そして、複数の三次元映像を合成してなる映像、すなわち、合成映像がディスプレイ５に表示されると、合成された個々の三次元映像がディスプレイ５の奥行方向において互いに異なる位置に配置されているように映る。 The home server 1 is also installed with a program for executing data processing necessary for realizing a face-to-face conversation (hereinafter referred to as an interactive program). This interactive program incorporates a program for displaying 3D images. This program is a program for constructing and displaying a 3D image by 3D computer graphics (hereinafter, 3DCG), and is a so-called renderer. The 3DCG renderer has a function of synthesizing a plurality of 3D images. Then, when an image formed by combining a plurality of 3D images, that is, a combined image is displayed on the display 5, the combined 3D images are arranged at different positions in the depth direction of the display 5. It looks like this.

また、ホームサーバ１は、インターネット等の外部通信ネットワークＧＮを介して通信機器と通信可能な状態で接続されている。つまり、Ａさんが保有する通信ユニット１００に属するホームサーバ１は、外部通信ネットワークＧＮを介して、Ｂさんが保有する通信ユニット１００に属するホームサーバ１と通信し、両サーバ間で各種データの送受信を行う。なお、ホームサーバ１が送受信するデータは、対面対話に必要なデータであり、例えば、各ユーザの映像を示す映像データや音声を示す音声データである。 The home server 1 is connected in a state where it can communicate with a communication device via an external communication network GN such as the Internet. That is, the home server 1 belonging to the communication unit 100 owned by Mr. A communicates with the home server 1 belonging to the communication unit 100 owned by Mr. B via the external communication network GN, and transmits and receives various data between the two servers. I do. Note that the data transmitted and received by the home server 1 is data necessary for a face-to-face conversation, for example, video data indicating video of each user and audio data indicating audio.

カメラ２は、公知のネットワークカメラであり、撮像範囲（画角）内にある被写体の映像を撮像する。ここで、「映像」とは、連続している複数のフレーム画像（ＲＧＢ画像）の集合体によって構成されるものであるが、以下の説明では、フレーム画像の集合体を含む他、個々のフレーム画像をも含むものとする。また、本実施形態では、カメラ２の撮像範囲が固定されている。このため、カメラ２は、その起動中、常に当該カメラ２が設置された空間の所定領域の映像を撮像することになる。 The camera 2 is a known network camera, and captures an image of a subject within an imaging range (view angle). Here, the “video” is constituted by an aggregate of a plurality of continuous frame images (RGB images), but in the following description, in addition to including an aggregate of frame images, individual frames Including images. In the present embodiment, the imaging range of the camera 2 is fixed. For this reason, the camera 2 always captures an image of a predetermined area of the space in which the camera 2 is installed during its activation.

カメラ２は、撮像映像を示す信号（映像信号）を、当該カメラ２が所属する通信ユニット１００と同一のユニットに属するホームサーバ１に対して出力する。なお、カメラ２の設置台数については、特に制限されるものではないが、本実施形態ではコスト面を考慮し、各通信ユニット１００においてカメラ２を１台のみ備えることとした。 The camera 2 outputs a signal indicating the captured video (video signal) to the home server 1 belonging to the same unit as the communication unit 100 to which the camera 2 belongs. Although the number of cameras 2 installed is not particularly limited, in the present embodiment, only one camera 2 is provided in each communication unit 100 in consideration of cost.

また、カメラ２のレンズは、ディスプレイ５における表示画面の形成面に面している。ここで、形成面を構成するディスプレイ５のパネル（厳密には、タッチパネル５ａであり、鏡面部分に相当）は、透明なガラスによって構成されている。したがって、カメラ２は、図２に示すように、パネル越しで当該パネルの前に位置する人物の映像を撮像することになる。図２は、本システムＳの構成機器としてＡさん及びＢさんのそれぞれの部屋内に配置されている各種機器の配置位置を示した図である。なお、カメラ２の配置位置については、ディスプレイ５から離れた位置であってもよい。 The lens of the camera 2 faces the display screen forming surface of the display 5. Here, the panel of the display 5 (strictly speaking, it is the touch panel 5a and corresponds to the mirror surface portion) constituting the formation surface is made of transparent glass. Therefore, as shown in FIG. 2, the camera 2 captures an image of a person located in front of the panel through the panel. FIG. 2 is a diagram showing arrangement positions of various devices arranged in the rooms of Mr. A and Mr. B as the components of the system S. Note that the position of the camera 2 may be a position away from the display 5.

ここで、被写体である人物がディスプレイ５の前方位置でディスプレイ５から所定距離だけ離れているとき、カメラ２は、当該人物の顔から足までの全身像を撮像することが可能である。「全身像」とは、起立姿勢での全身像であってもよく、あるいは着座姿勢での全身像であってもよい。また、「全身像の映像」には、前方に配置された物によって身体の一部分が隠れた状態の映像が含まれる。 Here, when the person who is the subject is separated from the display 5 by a predetermined distance at the front position of the display 5, the camera 2 can capture the whole body image from the person's face to the foot. The “whole body image” may be a whole body image in a standing posture or a whole body image in a sitting posture. The “whole body image” includes an image in which a part of the body is hidden by an object placed in front.

なお、本システムＳにおいて、カメラ２は、床面から約１ｍ上がった高さに設置されている。このため、ディスプレイ５の前方位置に立つ人物の身長（厳密には、目の高さ）がカメラ２の設置位置よりも高くなっているとき、カメラ２は、被写体である人物の顔を下方から撮像することになる。ここで、カメラ２が設置される高さ（換言すると、鉛直方向におけるカメラ２の位置）については特に制限されるものではなく、任意の高さに設定可能である。 In the present system S, the camera 2 is installed at a height of about 1 m above the floor surface. For this reason, when the height (strictly speaking, the height of the eyes) of the person standing in front of the display 5 is higher than the installation position of the camera 2, the camera 2 moves the face of the person who is the subject from below. The image will be taken. Here, the height at which the camera 2 is installed (in other words, the position of the camera 2 in the vertical direction) is not particularly limited, and can be set to an arbitrary height.

マイク３は、マイク３が設置された部屋内の音声を集音し、その音声信号をホームサーバ１（厳密には、マイク３が所属する通信ユニット１００と同一のユニットに属するホームサーバ１）に対して出力する。なお、本実施形態において、マイク３は、図２に示すようにディスプレイ５の直上位置に設置されている。 The microphone 3 collects sound in the room in which the microphone 3 is installed, and the sound signal is sent to the home server 1 (strictly, the home server 1 belonging to the same unit as the communication unit 100 to which the microphone 3 belongs). Output. In the present embodiment, the microphone 3 is installed at a position directly above the display 5 as shown in FIG.

赤外線センサ４は、所謂デプスセンサであり、赤外線方式にて計測対象物（対象物に相当）の深度を計測するためのセンサである。具体的に説明すると、赤外線センサ４は、計測対象物に向けて発光部４ａから赤外線を照射し、その反射光を受光部４ｂにて受光することにより深度を計測する。より具体的に説明すると、赤外線センサ４の発光部４ａ及び受光部４ｂは、ディスプレイ５における表示画面の形成面に面している。その一方で、形成面を構成するディスプレイ５のタッチパネル５ａのうち、赤外線センサ４の直前位置にある部分には、赤外線が透過することが可能なフィルムが貼られている。発光部４ａから照射された後に計測対象物にて反射された赤外光は、上記のフィルムを通過した上で受光部４ｂにて受光される。 The infrared sensor 4 is a so-called depth sensor, and is a sensor for measuring the depth of a measurement object (corresponding to the object) by an infrared method. Specifically, the infrared sensor 4 irradiates infrared rays from the light emitting unit 4a toward the measurement object, and measures the depth by receiving the reflected light at the light receiving unit 4b. More specifically, the light emitting unit 4 a and the light receiving unit 4 b of the infrared sensor 4 face the display screen forming surface of the display 5. On the other hand, a film capable of transmitting infrared light is attached to a portion of the touch panel 5a of the display 5 constituting the forming surface at a position immediately before the infrared sensor 4. The infrared light reflected from the measurement object after being irradiated from the light emitting unit 4a passes through the film and is received by the light receiving unit 4b.

なお、本システムＳでは、「深度」として、カメラ２（厳密には、カメラ２のレンズ表面）から計測対象物までの距離、すなわち、奥行距離を計測することとしている。このため、本システムＳでは、赤外線センサ４の受光部４ｂによる受光位置が、ディスプレイ５の奥行方向（厳密には、表示画面の法線方向）において、カメラ２のレンズの表面位置と同一位置となるように設定されている。 In the present system S, as “depth”, the distance from the camera 2 (strictly speaking, the lens surface of the camera 2) to the measurement object, that is, the depth distance is measured. For this reason, in the present system S, the light receiving position by the light receiving unit 4b of the infrared sensor 4 is the same position as the surface position of the lens of the camera 2 in the depth direction of the display 5 (strictly, the normal direction of the display screen). It is set to be.

また、本システムＳでは、深度の計測結果を、カメラ２が撮像した映像を所定数の映像片（画素）に分割した際の当該画素毎に得る。そして、画素毎に得た深度の計測結果を映像単位でまとめると、その映像についての深度データ（距離データに相当）が得られるようになる。この深度データは、カメラ２の撮像映像（厳密には、各フレーム画像）について画素別に赤外線センサ４の計測結果、すなわち深度を規定したものである。つまり、映像についての深度データとは、当該映像のデプスマップであり、当該深度データのうち、カメラ２が撮像した映像中にある対象物の映像と対応する画素群、には当該対象物の奥行距離（深度の値）が規定されている。具体的に説明すると、後述する図５のように、背景の映像と、その前方の映像とでは奥行距離が異なるため、それぞれに対応する画素は、同図に示すように明らかに異なるようになる。なお、図５中、黒抜きの画素は、背景映像と対応し、斜線ハッチングの画素は、背景よりも前方に在る物の映像と対応し、白抜きの画素は、さらに前方に在る人物の映像と対応している。 Further, in the present system S, the depth measurement result is obtained for each pixel when the video captured by the camera 2 is divided into a predetermined number of video pieces (pixels). When the depth measurement results obtained for each pixel are collected in units of video, depth data (corresponding to distance data) for the video can be obtained. This depth data defines the measurement result of the infrared sensor 4, that is, the depth, for each pixel of the captured image (strictly, each frame image) of the camera 2. In other words, the depth data for the video is a depth map of the video, and the depth of the target is included in the pixel group corresponding to the video of the target in the video captured by the camera 2 in the depth data. Distance (depth value) is specified. Specifically, as shown in FIG. 5 to be described later, since the depth distance is different between the background image and the image in front thereof, the corresponding pixels are clearly different as shown in FIG. . In FIG. 5, the black pixels correspond to the background image, the hatched pixels correspond to the image of the object in front of the background, and the white pixels are persons in front. It corresponds to the video of.

以上のような深度データを利用することで、映像の中から人物の映像を抽出することが可能である。なお、深度データを利用した人物映像の抽出方法については、後述する。また、本システムＳでは、深度データから人物の位置を特定することが可能である。ただし、これに限定されるものではなく、例えば、位置検知用のセンサが赤外線センサ４とは別に設置されており、かかる位置検知用のセンサの検知結果から人物の位置を特定してもよい。 By using the depth data as described above, it is possible to extract a person's video from the video. A person video extraction method using depth data will be described later. In the present system S, the position of a person can be specified from the depth data. However, the present invention is not limited to this. For example, a position detection sensor may be installed separately from the infrared sensor 4, and the position of the person may be specified from the detection result of the position detection sensor.

スピーカ６は、ホームサーバ１が音声データを展開することで再生される音声（再生音）を発するものであり、公知のスピーカによって構成されている。なお、本実施形態において、スピーカ６は、図２に示すように、ディスプレイ５の横幅方向においてディスプレイ５を挟む位置に複数（図２では４個）設置されている。 The speaker 6 emits sound (reproduced sound) that is reproduced when the home server 1 expands the sound data, and is configured by a known speaker. In the present embodiment, as shown in FIG. 2, a plurality of speakers 4 (four in FIG. 2) are installed at positions sandwiching the display 5 in the horizontal width direction of the display 5.

ディスプレイ５は、映像の表示画面を形成するものである。具体的に説明すると、ディスプレイ５は、透明なガラスによって構成されたパネルを有し、当該パネルの前面に表示画面を形成する。なお、本システムＳにおいて、上記のパネルは、タッチパネル５ａであり、ユーザが行う操作（タッチ操作）を受け付ける。 The display 5 forms a video display screen. More specifically, the display 5 has a panel made of transparent glass, and forms a display screen on the front surface of the panel. In the present system S, the above-described panel is the touch panel 5a and receives an operation (touch operation) performed by the user.

さらに、上記のパネルは、人の全身映像を表示するのに十分なサイズを有している。そして、本システムＳによる対面対話では、上記のパネルの前面に形成された表示画面に、対話相手の全身映像が等身大のサイズで表示されることになっている。つまり、Ｂさん側のディスプレイ５には、Ａさんの全身映像を等身大サイズにて表示することが可能である。これにより、表示画面を見ているＢさんは、あたかもＡさんと会っている感覚、特に、ガラス越しで対面している感覚を感じるようになる。 Further, the panel has a size sufficient to display a whole body image of a person. In the face-to-face conversation by the system S, the whole body image of the conversation partner is displayed in a life-size size on the display screen formed on the front surface of the panel. That is, it is possible to display Mr. A's whole body image in a life-size size on the display 5 on the Mr. B side. As a result, Mr. B who is looking at the display screen feels as if he is meeting Mr. A, in particular, the feeling of facing through the glass.

さらにまた、本システムＳのディスプレイ５は、通常時には部屋内に配置された家具、具体的には姿見として機能し、対面対話時にのみ表示画面を形成するものとなっている。以下、図３の（Ａ）及び（Ｂ）を参照しながらディスプレイ５の構成について詳しく説明する。図３の（Ａ）及び（Ｂ）は、本システムＳで用いられているディスプレイ５の構成例を示した図であり、（Ａ）が非対話時の状態を、（Ｂ）が対面対話時の状態をそれぞれ示している。 Furthermore, the display 5 of the present system S normally functions as furniture arranged in the room, specifically as a look, and forms a display screen only during face-to-face conversation. Hereinafter, the configuration of the display 5 will be described in detail with reference to FIGS. FIGS. 3A and 3B are diagrams showing a configuration example of the display 5 used in the present system S. FIG. 3A shows a non-interactive state, and FIG. 3B shows a face-to-face conversation. Each state is shown.

ディスプレイ５が有するタッチパネル５ａは、対面対話が行われる部屋内に配置された姿見の一部分、具体的には鏡面部分を構成する。そして、上記のタッチパネル５ａは、図３の（Ａ）に示すように、対話が行われていない非対話時、すなわち映像が表示されていない間には表示画面を形成しない。すなわち、本システムＳのディスプレイ５は、非対話時には姿見としての外観を現すことになる。一方、対面対話が開始されると、上記のタッチパネル５ａは、その前面に表示画面を形成する。これにより、ディスプレイ５は、図３の（Ｂ）に示すように、タッチパネル５ａの前面にて対話相手及びその背景の映像を表示するようになる。 The touch panel 5a included in the display 5 constitutes a part of the appearance arranged in the room where the face-to-face conversation is performed, specifically, a specular part. Then, as shown in FIG. 3A, the touch panel 5a does not form a display screen during a non-dialogue when no dialogue is performed, that is, while no video is displayed. In other words, the display 5 of the present system S shows an appearance as a figure at the time of non-dialogue. On the other hand, when the face-to-face conversation is started, the touch panel 5a forms a display screen on the front surface. Thereby, as shown in FIG. 3B, the display 5 displays the conversation partner and the video of the background on the front surface of the touch panel 5a.

ちなみに、表示画面のオンオフは、ホームサーバ１が赤外線センサ４の計測結果に応じて切り替えることになっている。より詳しく説明すると、対面対話を開始するにあたってユーザがディスプレイ５の正面位置に立つと、カメラ２が上記ユーザを含んだ映像（以下、実映像）を撮像すると共に、赤外線センサ４が深度を計測する。これにより、実映像についての深度データが取得され、ホームサーバ１は、当該深度データに基づいてユーザとカメラ２との間の距離、すなわち奥行距離を特定する。そして、上記の奥行距離が所定距離以下であるとき、ホームサーバ１は、ディスプレイ５を制御してタッチパネル５ａの前面に表示画面を形成させる。この結果、それまで姿見として機能していたディスプレイ５のタッチパネル５ａが映像表示用のスクリーンとして機能するようになる。反対に、上記の奥行距離が所定の距離以上となると、ホームサーバ１がディスプレイ５を制御し、それまで形成されていた表示画面をオフするようになる。これにより、ディスプレイ５は、再び姿見として機能するようになる。 Incidentally, on / off of the display screen is switched by the home server 1 according to the measurement result of the infrared sensor 4. More specifically, when the user stands at the front position of the display 5 in starting the face-to-face conversation, the camera 2 captures an image including the user (hereinafter referred to as an actual image) and the infrared sensor 4 measures the depth. . Thereby, the depth data about the actual video is acquired, and the home server 1 specifies the distance between the user and the camera 2, that is, the depth distance based on the depth data. When the depth distance is equal to or less than the predetermined distance, the home server 1 controls the display 5 to form a display screen on the front surface of the touch panel 5a. As a result, the touch panel 5a of the display 5 that has been functioning as a figure until then functions as a screen for displaying images. On the other hand, when the depth distance is equal to or greater than the predetermined distance, the home server 1 controls the display 5 and turns off the display screen that has been formed so far. As a result, the display 5 functions as a figure again.

以上のように、本システムＳでは、ディスプレイ５が非対話時には姿見として利用される。これにより、非対話時には表示画面の存在が気付かれ難くなる。その一方で、対面対話時には、表示画面が形成されて対話相手の映像が表示されるようになり、ユーザは、あたかも対話相手とガラス越しに対話しているような視覚的演出効果を得るようになる。なお、映像表示スクリーンと姿見とを兼用する構成については、例えば国際公開第２００９／１２２７１６号に記載された構成のように公知の構成が利用可能である。また、ディスプレイ５については、姿見として兼用される構成に限定されるものではない。ディスプレイ５として用いられる機器については、対話相手の全身映像を表示するのに十分なサイズを有しているものであればよい。そして、非対話時に表示画面の存在を気付き難くする観点からは、対面対話用の部屋内に設置された家具や建築材料であって鏡面部分を有するものが好適であり、例えば扉（ガラス戸）や窓（ガラス窓）をディスプレイ５として利用してもよい。なお、ディスプレイ５については、扉や窓等の建築材料若しくは姿見等の家具として兼用されるものに限定されず、起動中、表示画面を常時形成する通常の表示器であってもよい。 As described above, in the present system S, the display 5 is used as an appearance during non-interaction. This makes it difficult to notice the presence of the display screen during non-interaction. On the other hand, during a face-to-face conversation, a display screen is formed and the image of the conversation partner is displayed, so that the user can obtain a visual effect as if he is interacting with the conversation partner through the glass. Become. In addition, about the structure which uses both a video display screen and appearance, a well-known structure can be utilized like the structure described in the international publication 2009/122716, for example. Further, the display 5 is not limited to a configuration that is also used as a figure. The device used as the display 5 only needs to have a size sufficient to display the whole body image of the conversation partner. And from the viewpoint of making it difficult to notice the presence of the display screen during non-interaction, furniture and building materials installed in a room for face-to-face interaction and having a mirror surface portion are suitable. For example, a door (glass door) A window (glass window) may be used as the display 5. In addition, about the display 5, it is not limited to what is used also as furniture, such as building materials, such as a door and a window, or a figure, etc., The normal display which forms a display screen constantly during starting may be used.

＜＜映像合成について＞＞
本システムＳを用いた対面対話では、Ｂさん側のディスプレイ５にＡさんの映像及びその背景の映像が表示され、Ａさん側のディスプレイ５にＢさんの映像及びその背景の映像が表示される。ここで、各ディスプレイ５に表示される人物映像及び背景映像は、カメラ２が同時に撮像したものではなく、異なるタイミングで撮像されたものとなっている。すなわち、各ディスプレイ５には、異なるタイミングで撮像された人物映像及び背景映像を合成した合成映像が表示されることになる。また、本システムＳでは、人物映像及び背景映像に加えて、前景の映像を更に合成した合成映像を表示することとしている。 << About video composition >>
In the face-to-face conversation using the system S, the video of Mr. A and the background video thereof are displayed on the display 5 on the B-san side, and the video of Mr. B and the video of the background thereof are displayed on the display 5 of the A-san side. . Here, the person image and the background image displayed on each display 5 are not captured simultaneously by the camera 2 but are captured at different timings. That is, each display 5 displays a composite video obtained by synthesizing a human video and a background video captured at different timings. Further, in the present system S, in addition to the person video and the background video, a synthesized video obtained by further synthesizing the foreground video is displayed.

以下、映像合成の手順について図４を参照しながら概説する。図４は、映像合成の手順についての説明図である。なお、以下の説明では、Ａさんの映像、背景映像及び前景映像を合成するケースを具体例に挙げて説明することとする。 Hereinafter, the procedure of video composition will be outlined with reference to FIG. FIG. 4 is an explanatory diagram of the video composition procedure. In the following description, a case where Mr. A's video, background video, and foreground video are combined will be described as a specific example.

合成される映像のうち、背景映像（図４中、記号Ｐｂにて表記）は、Ａさんが対面対話を行う際に利用する部屋のうち、カメラ２の撮像範囲内にある領域の映像である。そして、本実施形態では、Ａさんが上記部屋に居ないときにカメラ２が背景映像を撮像することになっている。すなわち、背景映像は、単独で撮像されることになっている。なお、背景映像の撮像タイミングについては、Ａさんが上記の部屋に居ない期間内であれば任意に設定することが可能である。 Among the synthesized videos, the background video (indicated by the symbol Pb in FIG. 4) is a video of an area within the imaging range of the camera 2 in the room used when Mr. A performs a face-to-face conversation. . And in this embodiment, when Mr. A is not in the said room, the camera 2 is supposed to image a background image | video. That is, the background video is to be taken alone. In addition, about the imaging | photography timing of a background image | video, it is possible to set arbitrarily if it is in the period when Mr. A is not in the said room.

一方、人物映像（具体的にはＡさんの映像であって、図４中、記号Ｐｕにて表記）は、Ａさんが上記部屋内、厳密にはカメラ２の撮像範囲内に居るときに撮像される。ここで、カメラ２が撮像する映像（すなわち、実映像）には、人物映像の他に背景映像及び前景映像が含まれている。そして、本システムＳでは、実映像の中から人物映像を抽出して用いることとしている。実映像の中から人物映像を抽出する方法については特に限定されるものではないが、一例としては、上述した深度データに用いて人物映像を抽出する方法が挙げられる。以下、図５を参照しながら、深度データを用いた人物映像の抽出方法について説明する。図５は、撮像映像から人物映像を抽出する手順についての説明図である。なお、図５では、図示の都合上、深度データを構成する画素が実際の画素よりも粗くなっている。 On the other hand, a person image (specifically, Mr. A's image, represented by the symbol Pu in FIG. 4) is captured when Mr. A is in the room, strictly speaking, within the imaging range of the camera 2. Is done. Here, the video (that is, the real video) captured by the camera 2 includes a background video and a foreground video in addition to a human video. In the system S, a person video is extracted from the actual video and used. A method for extracting a person image from a real image is not particularly limited, and an example is a method for extracting a person image using the above-described depth data. Hereinafter, a person video extraction method using depth data will be described with reference to FIG. FIG. 5 is an explanatory diagram of a procedure for extracting a person video from a captured video. In FIG. 5, for the convenience of illustration, the pixels constituting the depth data are coarser than the actual pixels.

カメラ２が映像を撮像している期間中、赤外線センサ４が、カメラ２の画角内にある計測対象物の深度を計測する。この結果、実映像についての深度データが得られる。実映像についての深度データとは、実映像を構成するフレーム画像を所定数の画素に分割したときの当該画素毎に赤外線センサ４の計測結果、すなわち深度を規定したものである。なお、実映像についての深度データでは、図５に示すように、人物映像に属する画素（図中、白抜きの画素）とそれ以外の映像に属する画素（図中、黒抜きの画素や斜線ハッチングの画素）とでは明らかに深度が異なる。 During the period when the camera 2 captures an image, the infrared sensor 4 measures the depth of the measurement object within the angle of view of the camera 2. As a result, depth data about the actual video is obtained. The depth data for the real video is obtained by defining the measurement result of the infrared sensor 4, that is, the depth for each pixel when the frame image constituting the real video is divided into a predetermined number of pixels. In the depth data for the actual video, as shown in FIG. 5, the pixels belonging to the human video (the white pixels in the figure) and the pixels belonging to the other video (the black pixels and the hatched hatching in the figure). The depth is clearly different.

そして、深度データ及びカメラ２の撮像映像（厳密には、撮像映像におけるＡさんの顔の映像の位置を特定するための情報）に基づいてＡさんの骨格モデルを特定する。骨格モデルとは、図５に示すようにＡさんの骨格（具体的には身体中、頭部、肩、肘、手首、上半身中心、腰、膝、足首）に関する位置情報を簡易的にモデル化したものである。なお、骨格モデルを取得する方法については、公知の方法が利用可能であり、例えば、特開２０１４−１５５６９３号公報や特開２０１３−１１６３１１号公報に記載された発明において採用されている方法と同様の方法を利用してもよい。 And Mr. A's skeleton model is specified based on the depth data and the captured image of the camera 2 (strictly, information for specifying the position of the image of the face of Mr. A in the captured image). The skeletal model is a simple model of position information about Mr. A's skeleton (specifically, in the body, head, shoulders, elbows, wrists, upper body center, waist, knees, ankles) as shown in FIG. It is what. In addition, about the method of acquiring a skeleton model, a well-known method can be utilized, for example, it is the same as the method employ | adopted in the invention described in Unexamined-Japanese-Patent No. 2014-155893 or Unexamined-Japanese-Patent No. 2013-116311. The method may be used.

そして、骨格モデルを特定した後、当該骨格モデルに基づいて実映像の中から人物映像を抽出する。本明細書では、骨格モデルに基づいて実映像の中から人物映像を抽出する技法に関する詳細については説明を省略するが、大まかな手順を説明すると、特定した骨格モデルに基づいて深度データ中、Ａさんの人物映像に属する画素群を特定する。その後、特定した画素群と対応する領域を実映像の中から抽出する。このような手順によって抽出された映像が実映像中のＡさんの人物映像に該当する。 Then, after specifying the skeleton model, a person image is extracted from the actual image based on the skeleton model. In the present specification, a description of a technique for extracting a person image from a real image based on a skeleton model will be omitted. However, a rough procedure will be described. In the depth data based on the identified skeleton model, A Identify the pixel group that belongs to the person's video. Thereafter, an area corresponding to the specified pixel group is extracted from the actual video. The image extracted by such a procedure corresponds to the person image of Mr. A in the actual image.

また、本システムＳでは、前景映像（図４中、記号Ｐｆにて表記）を、人物映像の場合と同様に実映像の中から抽出して用いることとしている。実映像の中から前景映像を抽出する方法については特に限定されるものではないが、一例を挙げて説明すると、人物映像と同様に深度データを用いて前景映像を抽出する方法が考えられる。具体的に説明すると、実映像についての深度データ中、人物映像に属する画素よりも奥行距離が小さい画素群を特定する。そして、実映像中、特定した画素群と対応する部分の映像を前景映像として抽出してくることになる。 Further, in the present system S, the foreground video (denoted by the symbol Pf in FIG. 4) is extracted from the actual video and used in the same manner as the person video. The method for extracting the foreground video from the actual video is not particularly limited. However, for example, a method for extracting the foreground video using the depth data as in the case of the person video can be considered. More specifically, a pixel group having a depth depth smaller than the pixels belonging to the person video is specified in the depth data of the real video. Then, in the actual video, the video corresponding to the specified pixel group is extracted as the foreground video.

以上までに説明してきた手順により実映像から人物映像及び前景映像を抽出した後、背景映像、人物映像及び前景映像を合成する。具体的に説明すると、カメラ２が撮像した背景映像中、実際にディスプレイ５に表示される部分の映像（図４中、破線にて囲まれた範囲であって、以下、表示範囲）を設定する。ここで、表示範囲は、カメラ２が撮像した背景映像のうち、合成映像中に含まれる部分に相当する。なお、表示範囲の大きさについては、ディスプレイ５の大きさに応じて決定される。また、本実施形態において、初期（デフォルト）の表示範囲は、背景映像の中央部分に設定されている。ただし、初期の表示範囲については特に限定されるものではなく、背景映像の中央部分以外の部分でもよい。 After extracting the person image and the foreground image from the actual image by the procedure described above, the background image, the person image and the foreground image are synthesized. More specifically, in the background image captured by the camera 2, an image of a portion actually displayed on the display 5 (a range surrounded by a broken line in FIG. 4, hereinafter referred to as a display range) is set. . Here, the display range corresponds to a portion included in the synthesized video in the background video captured by the camera 2. Note that the size of the display range is determined according to the size of the display 5. In the present embodiment, the initial (default) display range is set at the center of the background video. However, the initial display range is not particularly limited, and may be a portion other than the central portion of the background video.

そして、背景映像における上記の表示範囲と、抽出された人物映像と、抽出された前景映像を合成して合成映像（図４中、記号Ｐｍにて表記）を取得する。この結果、Ｂさん側のディスプレイ５には、図４に示すように、背景の手前にＡさんが位置し、かつ、Ａさんの手前に前景が位置した映像が表示されるようになる。 Then, the above-described display range in the background video, the extracted person video, and the extracted foreground video are synthesized to obtain a synthesized video (indicated by symbol Pm in FIG. 4). As a result, as shown in FIG. 4, an image in which Mr. A is positioned in front of the background and the foreground is positioned in front of Mr. A is displayed on the display 5 on the Mr. B side.

以上のように本システムＳではディスプレイ５の表示映像として合成映像を表示する。そして、合成映像を表示する構成では、人物映像、背景映像及び前景映像の各々について、表示位置や表示サイズ等を個別に調整することが可能となる。具体的に説明すると、例えば、人物映像であるＡさんの映像の表示サイズについては、背景映像や前景映像の表示サイズを変えずに調整することが可能である。 As described above, in the present system S, the composite video is displayed as the display video on the display 5. In the configuration for displaying the composite video, the display position, the display size, and the like can be individually adjusted for each of the human video, the background video, and the foreground video. Specifically, for example, the display size of the video of Mr. A, which is a person video, can be adjusted without changing the display size of the background video and the foreground video.

なお、本システムＳでは、Ａさんの映像の表示サイズをＡさんの実際のサイズ（等身大サイズ）と一致するように調整する。この結果、Ｂさん側のディスプレイ５にはＡさんの映像が等身大サイズにて表示されるようになり、本システムＳを用いた対面対話の臨場感がより一層向上する。ただし、人物映像の表示サイズについては、等身大サイズに限定されるものではない。ここで、等身大サイズとは、カメラ２の前方位置でカメラ２から所定距離（具体的には、後述する図１０Ｂ中の距離ｄ１であり、以下、基準距離）だけ離れた位置にいるときに撮像された人物映像をそのままのサイズにて表示したときのサイズを意味する。また、上記の基準距離ｄ１については、予め設定されており、ホームサーバ１のメモリに記憶されている。 In this system S, the display size of Mr. A's video is adjusted so as to coincide with Mr. A's actual size (life size). As a result, Mr. A's video is displayed in a life-size size on the display 5 on the Mr. B side, and the realism of the face-to-face conversation using the system S is further improved. However, the display size of the person video is not limited to the life size. Here, the life-size size means that the camera 2 is located at a position away from the camera 2 by a predetermined distance (specifically, a distance d1 in FIG. 10B to be described later, hereinafter referred to as a reference distance). It means the size when the captured human image is displayed as it is. The reference distance d1 is set in advance and stored in the memory of the home server 1.

＜＜三次元映像の生成について＞＞
本システムＳにおいて、ディスプレイ５には三次元映像が表示されることになっている。より具体的に説明すると、前節にて説明したように、ディスプレイ５には、背景映像、人物映像及び前景映像を合成した合成映像が表示されることになっているが、合成される各映像は、三次元化された映像（三次元映像）となっている。この三次元映像は、カメラ２が撮像した２次元映像（具体的には、ＲＧＢ形式のフレーム画像からなる映像）と、その映像についての深度データと、を用いて３ＤＣＧによるレンダリング処理を実行することで得られる。ここで、レンダリング処理とは、厳密にはサーフェスレンダリング方式の映像表示処理であり、仮想的に設定された視点から見た際の三次元映像を生成するための処理である。 << About 3D image generation >>
In the system S, a 3D image is to be displayed on the display 5. More specifically, as described in the previous section, the display 5 is configured to display a composite video obtained by combining the background video, the human video, and the foreground video. 3D video (3D video). This 3D image is obtained by executing a rendering process by 3DCG using a 2D image (specifically, an image composed of frame images in RGB format) captured by the camera 2 and depth data about the image. It is obtained by. Here, strictly speaking, the rendering process is a surface rendering video display process, which is a process for generating a 3D video when viewed from a virtually set viewpoint.

そして、本システムＳでは、レンダリング処理としてテクスチャマッピングを採用した処理を実行する。以下、図６を参照しながら、三次元映像を生成する手順について説明する。図６は、三次元映像を生成する手順についての説明図である。なお、図中のメッシュモデルは、図示の都合上、実際のメッシュサイズよりも粗くなっている。また、以下では、Ａさんの三次元映像を生成するケースを例に挙げて説明することとする。 In the present system S, processing employing texture mapping is executed as rendering processing. Hereinafter, a procedure for generating a 3D image will be described with reference to FIG. FIG. 6 is an explanatory diagram of a procedure for generating a 3D video. Note that the mesh model in the figure is coarser than the actual mesh size for convenience of illustration. In the following, a case where a 3D video of Mr. A is generated will be described as an example.

カメラ２が撮像したＡさんの映像（厳密には、実映像から抽出されたＡさんの映像）は、二次元映像であり、テクスチャマッピングにおいてテクスチャとして用いられる。一方、Ａさんの映像を含む実映像について取得された深度データ（すなわち、デプスマップ）は、三次元映像の骨格をなすメッシュモデルを構築するために用いられる。ここで、メッシュモデルは、ポリゴンメッシュにて人物（Ａさん）を表現したものである。なお、深度データ（デプスマップ）からメッシュモデルを構築する方法については、公知の方法を利用することが可能である。 The video of Mr. A captured by the camera 2 (strictly speaking, the video of Mr. A extracted from the actual video) is a two-dimensional video and is used as a texture in texture mapping. On the other hand, the depth data (that is, the depth map) acquired for the actual video including the video of Mr. A is used to construct a mesh model that forms the skeleton of the 3D video. Here, the mesh model represents a person (Mr. A) with a polygon mesh. As a method of constructing a mesh model from depth data (depth map), a known method can be used.

そして、メッシュモデルが得られた後、図６に示すように、当該メッシュモデルにテクスチャとしての二次元映像（具体的にはＡさんの映像）を貼り付けることで立体的なＡさんの映像、すなわち、奥行感を有する三次元映像を生成することが可能となる。このようなテクスチャマッピングにて三次元映像が生成され、さらに移動や回転等のプロセッシングを行うことで視点を変えたときの三次元映像を取得することが可能となる。これにより、Ａさんの顔を下方から見たときの三次元映像や、Ａさんの顔を側方から見たときの三次元映像を取得することも可能となる。 Then, after the mesh model is obtained, as shown in FIG. 6, a stereoscopic image of Mr. A is obtained by pasting a two-dimensional image (specifically, an image of Mr. A) as a texture to the mesh model. That is, it becomes possible to generate a 3D image having a sense of depth. A 3D image is generated by such texture mapping, and further, by performing processing such as movement and rotation, it is possible to acquire a 3D image when the viewpoint is changed. As a result, it is also possible to acquire a 3D image when the face of Mr. A is viewed from below and a 3D image when the face of Mr. A is viewed from the side.

また、背景や前景についても、人物の場合と同様の手順により、三次元映像を生成することが可能である。つまり、カメラ２が撮像した背景映像と、背景映像について取得された深度データと、を用いてテクスチャマッピングによるレンダリング処理を実行することで、背景の三次元映像が取得される。また、カメラ２が撮像した前景映像（厳密には、実映像から抽出した前景映像）と、前景映像について取得された深度データ（厳密には、前景映像を含む実映像についての深度データ）と、を用いてテクスチャマッピングによるレンダリング処理を実行することで、前景の三次元映像が取得される。 Also, for the background and foreground, it is possible to generate a 3D video by the same procedure as that for a person. That is, a background three-dimensional image is acquired by executing a rendering process by texture mapping using the background image captured by the camera 2 and the depth data acquired for the background image. In addition, the foreground video captured by the camera 2 (strictly, the foreground video extracted from the real video), the depth data acquired for the foreground video (strictly, the depth data for the real video including the foreground video), A foreground 3D image is acquired by executing a rendering process by texture mapping using.

なお、本システムＳでは、テクスチャマッピングを利用しているが、三次元映像を取得するためのレンダリング処理については、テクスチャマッピングを利用したものに限られず、例えばバンプマッピングを利用したレンダリング処理であってもよい。 In this system S, texture mapping is used. However, the rendering process for acquiring a 3D image is not limited to the one using texture mapping. For example, the rendering process uses bump mapping. Also good.

また、深度データにおいては、欠損部分、すなわち、何らかの理由によって深度の計測結果が得られない画素が生じる虞がある。特に、人物映像と背景映像との境界付近（エッジ付近）では欠損部分が発生し易い。このように欠損部分が生じた場合には、欠損部分の位置が特定できるのであれば、テクスチャマッピングにおいて当該欠損部分に対してテクスチャである二次元映像をそのまま貼ればよい。あるいは、その周辺の映像を貼ってもよい。また、深度データを構成する画素のうち、人物映像と対応している画素群において、そのエッジ付近に欠損部分が生じた場合には、テクスチャマッピングにおいて上記の画素群よりも一回り大きい画素群を抽出し、当該画素群に対応する二次元映像を貼ればよい。 Further, in the depth data, there is a possibility that a defective portion, that is, a pixel for which a depth measurement result cannot be obtained for some reason may occur. In particular, a missing portion is likely to occur near the boundary between the person image and the background image (near the edge). In this way, when a missing part is generated, if the position of the missing part can be specified, a two-dimensional image that is a texture may be pasted as it is on the missing part in texture mapping. Alternatively, the surrounding video may be pasted. In addition, among the pixels constituting the depth data, in the pixel group corresponding to the person video, when a missing portion is generated near the edge, a pixel group that is slightly larger than the above pixel group in texture mapping is selected. What is necessary is just to extract and paste the two-dimensional image | video corresponding to the said pixel group.

＜＜ホームサーバの機能について＞＞
次に、ホームサーバ１の機能、特に、映像表示処理に関する機能について説明する。なお、Ａさん側のホームサーバ１及びＢさん側のホームサーバ１の双方は、同様の機能を有しており、対面対話の実施にあたり双方向通信して同様のデータ処理を実行する。このため、以下では、一方のホームサーバ１（例えば、Ｂさん側のホームサーバ１）の機能のみを説明することとする。 << About home server functions >>
Next, functions of the home server 1, particularly functions related to video display processing will be described. Note that both the Mr. A's home server 1 and Mr. B's home server 1 have the same function, and execute the same data processing through two-way communication when performing the face-to-face conversation. For this reason, only the function of one home server 1 (for example, Mr. B's home server 1) will be described below.

ホームサーバ１は、同装置のＣＰＵが対話用プログラムを実行することでホームサーバ１としての機能を発揮し、具体的には、対面対話に関する一連のデータ処理を実行する。ここで、図７を参照しながら、ホームサーバ１の構成をその機能面、特に映像表示機能の観点から説明する。図７は、ホームサーバ１の構成を機能面から示した図である。 The home server 1 functions as the home server 1 when the CPU of the apparatus executes a dialogue program, and specifically executes a series of data processing related to a face-to-face dialogue. Here, with reference to FIG. 7, the configuration of the home server 1 will be described from the viewpoint of its function, particularly the video display function. FIG. 7 is a diagram showing the configuration of the home server 1 in terms of functions.

ホームサーバ１は、図７に示すように、データ送信部１１、データ受信部１２、背景映像記憶部１３、第１深度データ記憶部１４、実映像記憶部１５、人物映像抽出部１６、骨格モデル記憶部１７、第２深度データ記憶部１８、前景映像抽出部１９、高さ検知部２０、三次元映像生成部２１、合成映像表示部２２、判定部２３及び顔移動検知部２４を備える。これらのデータ処理部は、それぞれ、ホームサーバ１のハードウェア機器（具体的には、ＣＰＵ、メモリ、通信用インタフェース及びハードディスクドライブ等）がソフトウェアとしての対話用プログラムと協働することによって実現される。以下、各データ処理部について説明する。 As shown in FIG. 7, the home server 1 includes a data transmission unit 11, a data reception unit 12, a background video storage unit 13, a first depth data storage unit 14, a real video storage unit 15, a human video extraction unit 16, a skeleton model. A storage unit 17, a second depth data storage unit 18, a foreground video extraction unit 19, a height detection unit 20, a 3D video generation unit 21, a composite video display unit 22, a determination unit 23, and a face movement detection unit 24 are provided. Each of these data processing units is realized by a hardware device (specifically, CPU, memory, communication interface, hard disk drive, etc.) of the home server 1 cooperating with a dialogue program as software. . Hereinafter, each data processing unit will be described.

データ送信部１１は、Ｂさん側のカメラ２が撮像した映像の信号をデジタル化し、映像データとしてＡさん側のホームサーバ１へ送信する。ここで、データ送信部１１が送信する映像データの種類は、２種類に分類される。一つは、背景映像の映像データであり、具体的には、背景に相当する部屋内にＢさんが居ないときに撮像された同室の映像（厳密には、カメラ２の撮像範囲内にある領域の映像）を示すデータである。もう一つは、実映像の映像データであり、Ｂさんが上記部屋に在室している間に撮像された映像、より具体的にはＢさん及びその背景や前景の映像を示すデータである。 The data transmission unit 11 digitizes the video signal captured by the B-side camera 2 and transmits it as video data to the A-side home server 1. Here, the types of video data transmitted by the data transmission unit 11 are classified into two types. One is the video data of the background video, specifically, the video of the same room captured when Mr. B is not in the room corresponding to the background (strictly, it is within the imaging range of the camera 2). This is data indicating the image of the area. The other is actual video data, which is an image captured while Mr. B is in the room, more specifically, data indicating Mr. B and its background and foreground images. .

また、データ送信部１１は、背景映像の映像データを送信するにあたり、赤外線センサ４の計測結果に基づいて、背景映像についての深度データを生成し、当該深度データを背景映像の映像データとともに送信する。この深度データは、背景の三次元映像を取得するためのレンダリング処理を実行する際に用いられると共に、背景とカメラ２との間の距離（奥行距離）を特定する際にも用いられる。同様に、データ送信部１１は、実映像の映像データを送信するにあたり、赤外線センサ４の計測結果に基づいて、実映像についての深度データを生成し、当該深度データを実映像の映像データとともに送信する。この深度データは、実映像から人物映像（具体的にはＢさんの映像）や前景映像を抽出する際に用いられる。また、上記の深度データは、Ｂさんの三次元映像を取得するためのレンダリング処理、及び、前景の三次元映像を取得するためのレンダリング処理のそれぞれの実行時に用いられる。さらに、上記の深度データは、Ｂさんとカメラ２との間の距離（奥行距離）を特定する際にも用いられる。 Further, when transmitting the video data of the background video, the data transmission unit 11 generates depth data about the background video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the background video. . This depth data is used when executing a rendering process for acquiring a 3D image of the background, and also when specifying a distance (depth distance) between the background and the camera 2. Similarly, when transmitting the video data of the real video, the data transmission unit 11 generates depth data for the real video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the real video. To do. This depth data is used when extracting a person image (specifically, an image of Mr. B) and a foreground image from an actual image. The depth data is used at the time of each of the rendering process for acquiring Mr. B's 3D video and the rendering process for acquiring the 3D video of the foreground. Furthermore, the depth data is also used when specifying the distance (depth distance) between Mr. B and the camera 2.

データ受信部１２は、Ａさん側のホームサーバ１から送信されてくる各種データを受信する。データ受信部１２が受信するデータの中には、背景映像の映像データ及び背景映像についての深度データ、並びに、実映像の映像データ及び実映像についての深度データが含まれている。ここで、データ受信部１２が受信する背景映像の映像データは、背景に相当する部屋内にＡさんが居ないときに撮像された同室の映像を示すデータである。このようにデータ受信部１２は、背景映像の映像データを受信することで、Ａさん側のカメラ２が撮像した背景の映像を取得する。かかる意味で、データ受信部１２は、映像取得部に該当すると言える。 The data receiving unit 12 receives various data transmitted from the home server 1 on the A side. The data received by the data receiving unit 12 includes video data of the background video, depth data about the background video, and video data of the real video and depth data about the real video. Here, the video data of the background video received by the data receiving unit 12 is data indicating the video of the same room captured when Mr. A is not in the room corresponding to the background. In this way, the data receiving unit 12 receives the background video data, and thereby acquires the background video captured by the camera A's camera 2. In this sense, it can be said that the data receiving unit 12 corresponds to a video acquisition unit.

また、データ受信部１２が受信する背景映像についての深度データは、背景の三次元映像を取得するためのレンダリング処理を実行する際に用いられると共に、背景とカメラ２との間の距離（奥行距離）を特定する際にも用いられる。なお、以下では、データ受信部１２が受信する背景映像についての深度データを「第１深度データ」と呼ぶこととする。 Further, the depth data regarding the background video received by the data receiving unit 12 is used when performing rendering processing for acquiring the background three-dimensional video, and the distance (depth distance) between the background and the camera 2 is used. ) Is also used to specify. Hereinafter, the depth data regarding the background video received by the data receiving unit 12 is referred to as “first depth data”.

また、データ受信部１２が受信する実映像の映像データは、Ａさんが上記部屋に在室している間に撮像されたＡさん、背景及び前景の映像を示すデータである。また、データ受信部１２が受信する実映像についての深度データは、実映像からＡさんの映像や前景映像を抽出する際に用いられる。また、上記の深度データは、Ａさんの三次元映像を取得するためのレンダリング処理、及び、前景の三次元映像を取得するためのレンダリング処理のそれぞれの実行時に用いられる。さらに、上記の深度データは、Ａさんとカメラ２との間の距離（奥行距離）、及び、前景とカメラ２との間の距離（奥行距離）を特定する際にも用いられる。なお、以下では、データ受信部１２が受信する実映像についての深度データを「第２深度データ」と呼ぶこととする。 The video data of the actual video received by the data receiving unit 12 is data indicating the video of Mr. A, the background, and the foreground captured while Mr. A is in the room. The depth data about the actual video received by the data receiving unit 12 is used when extracting the video of A and the foreground video from the actual video. The depth data is used at the time of each of the rendering process for acquiring Mr. A's 3D video and the rendering process for acquiring the 3D video of the foreground. Further, the depth data is also used when specifying the distance between A and the camera 2 (depth distance) and the distance between the foreground and the camera 2 (depth distance). Hereinafter, the depth data regarding the actual video received by the data receiving unit 12 is referred to as “second depth data”.

以上のようにデータ受信部１２は、第１深度データと第２深度データとをＡさん側のホームサーバ１から受信することで、背景映像についての深度データ、人物映像についての深度データ、及び前景映像についての深度データをそれぞれ取得する。かかる意味で、データ受信部１２は、距離データである深度データを取得する距離データ取得部に該当すると言える。 As described above, the data receiving unit 12 receives the first depth data and the second depth data from the home server 1 on the A side, so that the depth data for the background video, the depth data for the person video, and the foreground Obtain depth data for each video. In this sense, it can be said that the data receiving unit 12 corresponds to a distance data acquiring unit that acquires depth data that is distance data.

背景映像記憶部１３は、データ受信部１２が受信した背景映像の映像データを記憶する。第１深度データ記憶部１４は、データ受信部１２が受信した背景映像についての深度データ、すなわち、第１深度データを記憶する。実映像記憶部１５は、データ受信部１２が受信した実映像の映像データを記憶する。 The background video storage unit 13 stores the video data of the background video received by the data receiving unit 12. The first depth data storage unit 14 stores depth data regarding the background video received by the data receiving unit 12, that is, first depth data. The real video storage unit 15 stores the video data of the real video received by the data receiving unit 12.

人物映像抽出部１６は、データ受信部１２が受信した実映像の映像データを展開し、当該実映像から人物映像（すなわち、Ａさんの映像）を抽出する。骨格モデル記憶部１７は、人物映像抽出部１６が人物映像を抽出する際に用いる骨格モデル（具体的には、Ａさんの骨格モデル）を記憶する。第２深度データ記憶部１８は、データ受信部１２が受信した実映像についての深度データ、すなわち第２深度データを記憶する。 The person video extracting unit 16 expands the video data of the real video received by the data receiving unit 12 and extracts the human video (that is, the video of Mr. A) from the real video. The skeleton model storage unit 17 stores a skeleton model (specifically, Mr. A's skeleton model) used when the person video extraction unit 16 extracts a person video. The second depth data storage unit 18 stores depth data on the actual video received by the data receiving unit 12, that is, second depth data.

人物映像抽出部１６は、実映像からＡさんの映像を抽出するにあたり、実映像記憶部１５から実映像を、第２深度データ記憶部１８から実映像についての第２深度データを、それぞれ読み出す。そして、人物映像抽出部１６は、読み出した第２深度データ及びカメラ２の撮像映像からＡさんの骨格モデルを特定する。特定されたＡさんの骨格モデルは、骨格モデル記憶部１７に記憶される。その後、人物映像抽出部１６は、骨格モデル記憶部１７からＡさんの骨格モデルを読み出し、当該骨格モデルに基づいて実映像から人物映像、すなわちＡさんの映像を抽出する。このように人物映像抽出部１６は、実映像から人物映像を抽出することで、Ａさん側のカメラ２が撮像したＡさんの映像を取得する。かかる意味で、人物映像抽出部１６は、映像取得部に該当すると言える。 In extracting the video of Mr. A from the real video, the human video extraction unit 16 reads the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18. Then, the person video extraction unit 16 specifies Mr. A's skeleton model from the read second depth data and the captured video of the camera 2. The identified skeleton model of Mr. A is stored in the skeleton model storage unit 17. Thereafter, the person video extraction unit 16 reads out Mr. A's skeleton model from the skeleton model storage unit 17 and extracts a person video, that is, Mr. A's video from the actual video based on the skeleton model. In this way, the person video extraction unit 16 extracts the person video from the actual video, thereby acquiring the video of Mr. A captured by the camera 2 on the A side. In this sense, it can be said that the person video extraction unit 16 corresponds to a video acquisition unit.

前景映像抽出部１９は、データ受信部１２が受信した実映像の映像データを展開し、当該実映像から前景映像を抽出する。具体的に説明すると、前景映像抽出部１９は、実映像から前景映像を抽出するにあたり、実映像記憶部１５から実映像を、第２深度データ記憶部１８から当該実映像についての第２深度データを、それぞれ読み出す。そして、前景映像抽出部１９は、読み出した第２深度データ中、前景映像と対応する画素群を抽出する。ここで、前景映像と対応する画素群とは、人物映像抽出部１６によって第２深度データから抽出された画素群（すなわち、人物映像と対応する画素群）よりも奥行距離が小さい画素群のことである。その後、前景映像抽出部１９は、実映像記憶部１５から読み出した実映像中、上記の画素群と対応する部分の映像を前景映像として抽出する。このように前景映像抽出部１９は、実映像から前景映像を抽出することで、Ａさん側のカメラ２が撮像した前景映像を取得する。かかる意味で、前景映像抽出部１９は、映像取得部に該当すると言える。 The foreground video extraction unit 19 expands the video data of the real video received by the data receiving unit 12 and extracts the foreground video from the real video. More specifically, the foreground video extraction unit 19 extracts the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18 when extracting the foreground video from the real video. Are read out respectively. Then, the foreground video extraction unit 19 extracts a pixel group corresponding to the foreground video from the read second depth data. Here, the pixel group corresponding to the foreground image is a pixel group having a depth distance smaller than the pixel group extracted from the second depth data by the person image extraction unit 16 (that is, the pixel group corresponding to the person image). It is. Thereafter, the foreground video extraction unit 19 extracts a video of a portion corresponding to the pixel group from the real video read from the real video storage unit 15 as a foreground video. In this way, the foreground video extraction unit 19 extracts the foreground video from the actual video, thereby acquiring the foreground video captured by the camera A's camera 2. In this sense, it can be said that the foreground video extraction unit 19 corresponds to a video acquisition unit.

高さ検知部２０は、Ａさん側のホームサーバ１から受信したデータに基づいて、Ａさんの目の高さを検知する。具体的に説明すると、高さ検知部２０は、第２深度データ記憶部１８から第２深度データを読み出し、読み出した第２深度データ中、人物映像と対応する画素群を抽出する。その後、高さ検知部２０は、抽出した画素群の中から目に相当する画素を特定し、その特定した画素の位置から目の高さを割り出す。そして、目の高さに関する検知結果については、三次元映像生成部２１に引き渡され、三次元映像生成部２１は、当該検知結果に応じた三次元映像（特に、人物の三次元映像）を生成するようになる。かかる内容については、次節にて詳しく説明する。 The height detection unit 20 detects the height of A's eyes based on the data received from the home server 1 on the A's side. More specifically, the height detection unit 20 reads the second depth data from the second depth data storage unit 18, and extracts a pixel group corresponding to the person video from the read second depth data. Thereafter, the height detection unit 20 specifies a pixel corresponding to the eye from the extracted pixel group, and calculates the eye height from the position of the specified pixel. Then, the detection result regarding the eye height is transferred to the 3D video generation unit 21. The 3D video generation unit 21 generates a 3D video (particularly, a 3D video of a person) according to the detection result. Will come to do. This will be explained in detail in the next section.

なお、目の高さを特定する方法については、特に制限されるものではなく、公知の方法を利用することが可能である。具体的に説明すると、本システムＳでは第２深度データに基づいて目の高さを検知することとしたが、これに限定されず、例えば、Ａさんの映像を含む実映像を解析して目の高さを検知してもよい。 The method for specifying the eye height is not particularly limited, and a known method can be used. Specifically, in the present system S, the eye height is detected based on the second depth data. However, the present invention is not limited to this. For example, the system S analyzes the actual video including the video of Mr. May be detected.

三次元映像生成部２１は、３ＤＣＧのレンダリング処理を実行して三次元映像を取得する。具体的に説明すると、三次元映像生成部２１は、背景映像記憶部１３に記憶された背景映像と、第１深度データ記憶部１４に記憶された背景映像についての第１深度データと、を用いたレンダリング処理を実行して背景の三次元映像を生成する。なお、三次元映像生成部２１は、背景の三次元映像を生成する際、背景映像記憶部１３に記憶された背景映像のうち、直近で取得された背景映像を用いることになっている。同様に、第１深度データ記憶部１４に記憶された第１深度データについても、直近で取得された第１深度データを用いることになっている。 The 3D video generation unit 21 executes 3DCG rendering processing to acquire a 3D video. Specifically, the 3D video generation unit 21 uses the background video stored in the background video storage unit 13 and the first depth data regarding the background video stored in the first depth data storage unit 14. The 3D image of the background is generated by executing the rendering process. Note that the 3D video generation unit 21 uses the most recently acquired background video among the background videos stored in the background video storage unit 13 when generating the background 3D video. Similarly, the first depth data acquired most recently is used for the first depth data stored in the first depth data storage unit 14.

また、三次元映像生成部２１は、人物映像抽出部１６が抽出した人物映像（具体的にはＡさんの映像）と、第２深度データ記憶部１８に記憶された第２深度データ（厳密には、第２深度データ中、人物映像と対応する画素群のデータ）とを用いたレンダリング処理を実行して人物（Ａさん）の三次元映像を生成する。同様に、三次元映像生成部２１は、前景映像抽出部１９が抽出した前景映像と、第２深度データ記憶部１８に記憶された第２深度データ（厳密には、第２深度データ中、前景映像に相当する画素群のデータ）とを用いたレンダリング処理を実行して前景の三次元映像を生成する。なお、本システムＳでは、上述したように、レンダリング処理としてテクスチャマッピングを採用した処理を実行する。 In addition, the 3D video generation unit 21 extracts the person video extracted by the person video extraction unit 16 (specifically, the video of Mr. A) and the second depth data (strictly, stored in the second depth data storage unit 18). Performs the rendering process using the pixel data corresponding to the person image in the second depth data to generate a 3D image of the person (Mr. A). Similarly, the 3D video generation unit 21 uses the foreground video extracted by the foreground video extraction unit 19 and the second depth data stored in the second depth data storage unit 18 (strictly speaking, the foreground in the second depth data). A foreground 3D image is generated by executing a rendering process using pixel group data corresponding to the image). Note that, in the present system S, as described above, processing using texture mapping is executed as rendering processing.

合成映像表示部２２は、三次元映像生成部２１によって生成された背景、人物及び前景のそれぞれの三次元映像を合成し、その合成映像をＢさん側のディスプレイ５に表示させる。なお、合成映像表示部２２は、三次元映像生成部２１によって生成された背景の三次元映像の中から合成映像の中に含める映像、すなわち、表示範囲を選定する。そして、合成映像表示部２２は、選定した表示範囲の手前にＡさんが位置し、且つＡさんの手前に前景が位置した合成映像を、Ｂさん側のディスプレイ５に表示させる。 The synthesized video display unit 22 synthesizes the 3D videos of the background, the person, and the foreground generated by the 3D video generation unit 21 and displays the synthesized video on the display 5 on the B side. The composite video display unit 22 selects a video to be included in the composite video, that is, a display range, from the background 3D video generated by the 3D video generation unit 21. Then, the composite video display unit 22 displays the composite video in which Mr. A is positioned in front of the selected display range and the foreground is positioned in front of Mr. A on the B-side display 5.

判定部２３は、合成映像表示部２２が合成映像をディスプレイ５に表示している期間中（換言すると、Ａさん側のカメラ２がＡさんの映像を撮像している期間中）、Ａさん側のカメラ２とＡさんと間の距離（すなわち、Ａさんの奥行距離）が変化したかどうかを判定する。かかる判定は、第２深度データ記憶部１８に記憶された第２深度データに基づいて行われる。そして、奥行距離が変化したと判定部２３が判定すると、その判定結果が合成映像表示部２２に引き渡され、合成映像表示部２２は、当該判定結果に応じた合成映像をディスプレイ５に表示させる。かかる内容については、次節にて詳しく説明する。 During the period when the composite video display unit 22 displays the composite video on the display 5 (in other words, during the period when the camera 2 on the A side captures the video of Mr. A), It is determined whether the distance between the camera 2 and Mr. A (that is, the depth distance of Mr. A) has changed. Such a determination is made based on the second depth data stored in the second depth data storage unit 18. When the determination unit 23 determines that the depth distance has changed, the determination result is transferred to the composite video display unit 22, and the composite video display unit 22 displays a composite video corresponding to the determination result on the display 5. This will be explained in detail in the next section.

顔移動検知部２４は、赤外線センサ４の計測結果に基づいて、Ｂさん側のカメラ２が撮像した実映像についての深度データを生成するとともに、当該深度データから、Ｂさんの顔の横移動の有無を検知する。具体的に説明すると、合成映像表示部２２によって合成映像がディスプレイ５に表示されている期間中、顔移動検知部２４は、上記の深度データからＢさんの映像に相当する画素群を特定し、当該画素群の位置の変化を監視する。そして、顔移動検知部２４は、当該画素群の位置の変化を認識したとき、Ｂさんの顔が横移動したことを検知する。なお、横移動とは、Ｂさんの顔がＢさん側のディスプレイ５に対して左右方向（ディスプレイ５の幅方向）に移動することである。 Based on the measurement result of the infrared sensor 4, the face movement detection unit 24 generates depth data about the actual image captured by the Mr. B-side camera 2, and from this depth data, the face movement of Mr. B's face is laterally moved. Detect the presence or absence. More specifically, during the period in which the composite video is displayed on the display 5 by the composite video display unit 22, the face movement detection unit 24 specifies a pixel group corresponding to the video of Mr. B from the depth data, A change in the position of the pixel group is monitored. The face movement detection unit 24 detects that the face of Mr. B has moved sideways when recognizing the change in the position of the pixel group. The lateral movement means that the face of Mr. B moves in the left-right direction (the width direction of the display 5) with respect to the display 5 on the Mr. B side.

Ｂさんの顔が横移動したことの検知結果については、合成映像表示部２２に引き渡され、合成映像表示部２２は、当該検知結果に応じた合成映像をディスプレイ５に表示させる。かかる内容については、次節にて詳しく説明する。 About the detection result that Mr. B's face moved sideways, it is handed over to the synthetic | combination video display part 22, and the synthetic | combination video display part 22 displays the synthetic | combination video according to the said detection result on the display 5. FIG. This will be explained in detail in the next section.

＜＜対面対話の臨場感を向上させるためのプロセスについて＞＞
本システムＳでは、同システムを用いた対面対話の臨場感を向上させるために、各ユーザの目線や顔の位置に応じて、ディスプレイ５に表示させる映像やその表示サイズを調整・変更することとしている。具体的には、下記（Ｒ１）〜（Ｒ３）の映像表示プロセスを行う。
（Ｒ１）目線高さ合わせ用のプロセス
（Ｒ２）顔移動時のプロセス
（Ｒ３）奥行距離変化時のプロセス << About the process to improve the realism of face-to-face dialogue >>
In this system S, in order to improve the realism of the face-to-face conversation using the system, the video displayed on the display 5 and the display size thereof are adjusted / changed according to each user's line of sight and the position of the face. Yes. Specifically, the following video display processes (R1) to (R3) are performed.
(R1) Eye height adjustment process (R2) Face movement process (R3) Depth distance change process

以下、上記３つの映像表示プロセスの各々について個別に説明することとする。なお、以下では、Ａさんの三次元映像を含む合成映像をＢさん側のディスプレイ５にて表示するケースを例に挙げて説明することとする。 Hereinafter, each of the three video display processes will be described individually. In the following description, a case in which a composite image including a 3D image of Mr. A is displayed on the display 5 on the side of Mr. B will be described as an example.

＜目線高さ合わせ用のプロセスについて＞
本システムＳでは、前述したように、カメラ２が床から約１ｍの高さに設置されている。したがって、Ａさんの身長次第では、Ａさんの目の高さとカメラ２が設置されている高さとが異なってしまう。かかる場合、Ｂさん側のディスプレイ５に表示されるＡさんの映像が、実際にＡさんと対面した場合に見えるＡさんの姿（像）とは異なったものとなる。 <About the process for adjusting the eye height>
In the present system S, as described above, the camera 2 is installed at a height of about 1 m from the floor. Therefore, depending on the height of Mr. A, the height of Mr. A's eyes differs from the height at which the camera 2 is installed. In such a case, the image of Mr. A displayed on the display 5 on the Mr. B side is different from the appearance (image) of Mr. A seen when actually facing Mr. A.

具体的に説明すると、Ａさんの目の高さがカメラ２の設置高さよりも高くなっている場合、そのカメラ２は、Ａさんの顔の映像を下方から撮像することになる。この間、Ａさんは、Ａさん側のディスプレイ５を正面視しているため、Ａさんの目線は正面を向いていることになる。以上の状況下では、図８の（Ａ）に示すように、Ｂさん側のディスプレイ５に表示されるＡさんの映像（厳密には三次元映像であるが、図８の（Ａ）では簡略化して図示）が、Ａさんの顔を仰視したような映像となってしまう。図８は、目線高さ合わせ用のプロセスについての説明図であり、図中の（Ａ）は、実際のカメラ位置から撮像したＡさんの映像を示している。 More specifically, when the eye height of Mr. A is higher than the installation height of the camera 2, the camera 2 takes an image of the face of Mr. A from below. During this time, since Mr. A is looking at the display 5 on the side of Mr. A in front, Mr. A's eyes are facing the front. Under the above circumstances, as shown in FIG. 8A, Mr. A's video displayed on the Mr. B side display 5 (strictly, it is a three-dimensional video, but is simplified in FIG. 8A). Becomes an image as if looking up at Mr. A's face. FIG. 8 is an explanatory diagram of the process for adjusting the eye height, and (A) in the figure shows an image of Mr. A taken from the actual camera position.

以上のようにＡさんの顔を仰視したような映像がディスプレイ５に表示された場合、その表示映像においてＡさんの顔は、図８の（Ａ）に示すように、目線が正面を向いておらず幾分上方を向いた状態で映し出されることになる。かかる場合には、ディスプレイ５に表示されたＡさんの目線と、ディスプレイ５を見ているＢさんの目線と、を一致させ難くなり、対面対話の臨場感が損なわれてしまう虞がある。 As described above, when an image as if looking up at Mr. A's face is displayed on the display 5, the face of Mr. A in the displayed image has a line of sight toward the front as shown in FIG. It will be projected in a state of facing upwards. In such a case, it is difficult to match Mr. A's line of sight displayed on the display 5 with Mr. B's line of sight looking at the display 5, and there is a possibility that the sense of reality of the face-to-face dialogue will be impaired.

そこで、本システムＳでは、Ａさんの目の高さとカメラ２の設置高さとが異なるとき、ディスプレイ５に表示されるＡさんの目線とディスプレイ５を見ているＢさんの目線とを一致させるために、目線高さ合わせ用のプロセスを行うこととしている。当該プロセスについて説明すると、３ＤＣＧのレンダリング処理を実行し、Ａさんの目の高さと同じ高さにある仮想的な視点から見たＡさんの三次元映像を取得することとしている。具体的に説明すると、目線高さ合わせ用のプロセスを行うにあたり、Ｂさん側のホームサーバ１（厳密には、前述の高さ検知部２０）がＡさんの目の高さを検知する。一方、Ｂさん側のホームサーバ１は、Ａさん側のカメラ２が設置されている高さに関する情報を記憶している。そして、Ａさんの目の高さ及びＡさん側のカメラ２の設置高さの双方が異なっているとき、Ｂさん側のホームサーバ１（厳密には、前述の三次元映像生成部２１）は、検知した目の高さにある仮想的な視点から見たときのＡさんの三次元映像を取得するためのレンダリング処理を実行する。 Therefore, in the present system S, when the height of Mr. A's eyes and the installation height of the camera 2 are different, the eyes of Mr. A displayed on the display 5 and the eyes of Mr. B looking at the display 5 are matched. In addition, a process for adjusting the eye height is performed. The process will be described. A 3DCG rendering process is executed to acquire a three-dimensional image of Mr. A viewed from a virtual viewpoint at the same height as the height of Mr. A's eyes. More specifically, when performing the eye height adjustment process, Mr. B's home server 1 (strictly speaking, the height detection unit 20 described above) detects Mr. A's eye height. On the other hand, Mr. B's home server 1 stores information about the height at which Mr. A's camera 2 is installed. When both the height of Mr. A's eyes and the installation height of the camera 2 on the A's side are different, the home server 1 on the B's side (strictly speaking, the 3D video generation unit 21 described above) Then, a rendering process for acquiring a 3D video of Mr. A when viewed from a virtual viewpoint at the detected eye level is executed.

上記のレンダリング処理について図８の（Ｂ）を参照しながら説明する。図８の（Ｂ）は、カメラ２とＡさんの目線との位置関係を示した図である。Ｂさん側のホームサーバ１は、Ａさんの目の高さ及びＡさん側のカメラ２の設置高さの双方が異なっているとき、当該双方の差（図８の（Ｂ）では記号Ｈにて表記）を特定する。また、Ｂさん側のホームサーバ１は、記憶されている第２深度データに基づいて、ＡさんとＡさん側のカメラ２との間の距離（すなわち、Ａさんの奥行距離であり、図８の（Ｂ）では記号Ｌにて表記）を特定する。その上で、Ｂさん側のホームサーバ１は、検知したＡさんの目の高さと同じ高さに設置された仮想的なカメラ（図８の（Ｂ）において破線にて示す）の撮像方向と、実際のカメラ２の撮像方向と、の間の相違を特定する。具体的には、下記の式（１）にて求められる角度αを上記の相違として算出する。
α＝ａｒｃｔａｎ（Ｈ／Ｌ）（１） The rendering process will be described with reference to FIG. FIG. 8B is a diagram showing the positional relationship between the camera 2 and the eyes of Mr. A. When the height of the eyes of Mr. A and the installation height of the camera 2 of Mr. A are different, the home server 1 on the side of Mr. B has a difference between the two (the symbol H in FIG. 8B). Specified). Further, the home server 1 on the B side is based on the stored second depth data, and the distance between the A and the A camera 2 (that is, the depth distance of the A, FIG. (B) is indicated by the symbol L). Then, Mr. B's home server 1 determines the imaging direction of a virtual camera (shown by a broken line in FIG. 8B) installed at the same height as the detected eye of Mr. A. The difference between the actual imaging direction of the camera 2 is specified. Specifically, the angle α obtained by the following equation (1) is calculated as the difference.
α = arctan (H / L) (1)

そして、Ｂさん側のホームサーバ１は、角度αの算出結果を用いて、上記の仮想的なカメラから撮像したＡさんの映像（三次元映像）を取得するためのレンダリング処理を実行する。具体的には、カメラ２が撮像したＡさんの映像（厳密には、実映像から抽出したＡさんの映像）と、実映像についての深度データである第２深度データと、を用いたテクスチャマッピングを行い、さらに、算出した角度αに相当する高さだけ視点を変位させる映像処理を行う。これにより、仮想的なカメラから撮像したときのＡさんの三次元映像、換言すると、図８の（Ｃ）のように目線が正面を向いたＡさんの三次元映像が取得されるようになる。図８の（Ｃ）は、仮想的なカメラ位置から撮像したＡさんの映像を示している。 Then, Mr. B's home server 1 executes a rendering process for acquiring Mr. A's video (three-dimensional video) captured from the virtual camera using the calculation result of the angle α. Specifically, the texture mapping using the video of Mr. A taken by the camera 2 (strictly speaking, the video of Mr. A extracted from the real video) and the second depth data which is depth data about the real video. Furthermore, video processing is performed for displacing the viewpoint by a height corresponding to the calculated angle α. As a result, Mr. A's 3D image captured from a virtual camera, in other words, Mr. A's 3D image with his eyes facing the front, as shown in FIG. 8C, is acquired. . FIG. 8C shows a video of Mr. A taken from a virtual camera position.

その後、Ｂさん側のホームサーバ１（厳密には、前述の合成映像表示部２２）は、上記の手順により取得したＡさんの三次元映像と、背景及び前景のそれぞれの三次元映像とを合成し、その合成映像をＢさん側のディスプレイ５に表示させる。 Thereafter, Mr. B's home server 1 (strictly speaking, the above-described synthesized video display unit 22) synthesizes the 3D video of Mr. A acquired by the above procedure and the 3D video of the background and foreground. Then, the synthesized video is displayed on the display 5 on the B side.

＜顔移動時のプロセスについて＞
ＡさんとＢさんとが実際に対面している場面においてＢさんの顔が横移動したとき、Ｂさんの視界（Ｂさんの目に映る像）は、顔移動に伴って変化する。このような顔移動に伴う見え方の変化を映像表示システムで再現するには、ディスプレイ５に表示される映像を、ディスプレイ５を見ている者の顔の移動に連動させて変化させる必要がある。このため、従来の映像表示システムでは、図９に示すように、ディスプレイ５を見ている者（例えば、Ｂさん）の顔が横移動すると、ディスプレイ５に表示されている映像が鉛直軸を中心に回転するように切り替わるようになっていた。具体的には、同図に示すように、表示映像として、左部と右部との間で奥行距離が異なった映像がディスプレイ５に表示されていた。図９は、従来の映像表示システムの構成例を示した図であり、ディスプレイ５を見ているＢさんの移動に連動して表示映像が変化する様子を図示している。 <About the process of moving the face>
When Mr. B's face moves sideways in a scene where Mr. A and Mr. B are actually facing each other, Mr. B's field of view (image reflected in Mr. B's eyes) changes as the face moves. In order to reproduce such a change in appearance caused by the movement of the face by the video display system, it is necessary to change the video displayed on the display 5 in conjunction with the movement of the face of the person who is viewing the display 5. . For this reason, in the conventional video display system, as shown in FIG. 9, when the face of a person (for example, Mr. B) who is watching the display 5 moves sideways, the video displayed on the display 5 is centered on the vertical axis. It was supposed to switch to rotate. Specifically, as shown in the figure, as the display image, an image in which the depth distance is different between the left part and the right part is displayed on the display 5. FIG. 9 is a diagram illustrating a configuration example of a conventional video display system, and illustrates how the display video changes in conjunction with the movement of Mr. B who is looking at the display 5.

しかしながら、ＡさんとＢさんが実際に対面して対話を行っている場面においてＢさんの顔が横移動したとき、Ｂさんが見ているＡさんの姿は、上記のように回転することはなく、水平移動するに過ぎない。また、図９に図示の映像表示システムでは、Ｂさんの顔が横移動したときに、Ａさんの映像及び背景映像の双方を同じ回転量（回転角度）だけ回転させることとしている。このため、図９に図示の映像表示システムでは、Ｂさんの顔が横移動した際にディスプレイ５に表示されているＡさんの映像が、実際に対面しているときの見え方とは異なる映像となってしまう。 However, when Mr. B's face moves sideways in the scene where Mr. A and Mr. B are actually facing each other, Mr. A's appearance of Mr. B is rotated as described above. There is nothing but horizontal movement. In the video display system shown in FIG. 9, when Mr. B's face moves sideways, both Mr. A's image and the background image are rotated by the same rotation amount (rotation angle). For this reason, in the video display system illustrated in FIG. 9, when the face of Mr. B moves sideways, the picture of Mr. A displayed on the display 5 is different from the way it looks when actually facing each other. End up.

これに対して、本システムＳでは、Ｂさんの顔が横移動した際に顔移動時のプロセスを行うこととし、実際にＡさんと対面してＡさんを見ているときの見え方を正確に反映して、ディスプレイ５に表示される映像（合成映像）を遷移させることとしている。以下、顔移動時のプロセスについて図１０Ａ、図１０Ｂ及び図１１を参照しながら説明する。図１０Ａは、Ｂさんの顔が横移動した状況を模式的に示した図である。図１０Ｂは、Ａさん、背景及び前景の各々の奥行距離についての説明図である。図１１は、後述の遷移処理を実行したときの合成映像の変化を示した説明図であり、（Ａ）は、遷移処理前の合成映像を、（Ｂ）は、遷移処理後の合成映像を、それぞれ示している。 On the other hand, in this system S, when Mr. B's face moves sideways, the process at the time of the face movement is performed, and the appearance when actually looking at Mr. A while facing Mr. A is accurate. As a result, the video (synthesized video) displayed on the display 5 is changed. Hereafter, the process at the time of a face movement is demonstrated, referring FIG. 10A, FIG. 10B, and FIG. FIG. 10A is a diagram schematically illustrating a situation in which Mr. B's face has moved laterally. FIG. 10B is an explanatory diagram regarding the depth distance of each of Mr. A, the background, and the foreground. FIG. 11 is an explanatory diagram showing changes in the composite video when a transition process described later is executed. (A) shows the composite video before the transition process, and (B) shows the composite video after the transition process. , Respectively.

なお、以下では、当初ディスプレイ５の略中央位置に立っていたＢさんが横移動したケースを例に挙げて説明することとする。また、以下の説明中、ディスプレイ５の幅方向（すなわち、左右方向）において互いに反対向きである２つの向きの一方を「第一向き」と呼び、他方を「第二向き」と呼ぶ。ここで、第一向きと第二向きの関係は、相対的なものであり、左右方向における一方の向きを第一向きとしたときに、他方の向きが第二向きとなる。したがって、ディスプレイ５を正面視したときに左向きを第一向きとしたときには、右向きが第二向きとなり、反対に、右向きを第一向きとしたときには、左向きが第二向きとなる。 In the following description, a case where Mr. B who was standing at the approximate center of the display 5 has moved laterally will be described as an example. In the following description, one of the two directions opposite to each other in the width direction (that is, the left-right direction) of the display 5 is referred to as “first direction”, and the other is referred to as “second direction”. Here, the relationship between the first direction and the second direction is relative, and when one direction in the left-right direction is the first direction, the other direction is the second direction. Therefore, when the left direction is the first direction when the display 5 is viewed from the front, the right direction is the second direction. Conversely, when the right direction is the first direction, the left direction is the second direction.

Ｂさん側のホームサーバ１（厳密には、前述の顔移動検知部２４）は、Ｂさん側のディスプレイ５に合成映像を表示している間、Ｂさんの顔の移動の有無を検知する。そして、Ｂさんの顔の横移動を検知すると、Ｂさん側のホームサーバ１は、移動の向き及び移動量を同時に検知する。さらに、Ｂさん側のホームサーバ１（厳密には、前述の合成映像表示部２２）は、Ｂさんの顔移動に関する検知結果に応じて遷移処理を実行する。遷移処理とは、Ｂさん側のディスプレイ５に表示されている合成映像を、Ｂさんの顔の横移動を検知する前の状態から遷移させる処理である。具体的には、合成映像におけるＡさんの三次元映像及び前景の三次元映像の表示位置、並びに、背景の三次元映像の中で合成映像中に含まれる範囲（すなわち、表示範囲）の双方を左右方向にずらした状態へ合成映像を遷移させる。 Mr. B's home server 1 (strictly speaking, the aforementioned face movement detection unit 24) detects the presence or absence of movement of Mr. B's face while displaying the composite video on the Mr. B display 5. When the lateral movement of Mr. B's face is detected, Mr. B's home server 1 simultaneously detects the direction of movement and the amount of movement. Furthermore, Mr. B's home server 1 (strictly speaking, the above-described composite video display unit 22) executes a transition process according to the detection result related to Mr. B's face movement. The transition process is a process for transitioning the composite video displayed on the display 5 on the Mr. B side from the state before detecting the lateral movement of the Mr. B face. Specifically, both the display position of Mr. A's 3D video and foreground 3D video in the composite video, and the range (ie, display range) included in the composite video in the background 3D video are displayed. The composite video is shifted to a state shifted in the horizontal direction.

遷移処理について詳しく説明すると、本処理では、先ず、合成映像におけるＡさんの三次元映像及び前景の三次元映像の表示位置、並びに、背景の三次元映像の表示範囲の各々についてずれ量を設定する。ここで、Ｂさんの顔が第一向きに移動量ｘだけ移動した場合を想定すると、各々のずれ量は、Ｂさんの顔の移動量ｘと、カメラ２とその被写体（Ａさんとその背景及び前景）との間の距離（すなわち、奥行距離）と、に応じて設定される。なお、本システムＳでは、ずれ量を設定するにあたり、Ｂさんの顔の移動量ｘを移動角度に換算する。移動角度とは、Ｂさんの視線ラインの変化量を角度にて示したものである。また、視線ラインとは、Ｂさんの両眼の中央位置からディスプレイ５の中心に向かう仮想直線である。 The transition process will be described in detail. In this process, first, a shift amount is set for each of the display position of the 3D video of Mr. A and the foreground 3D video in the synthesized video, and the display range of the background 3D video. . Here, assuming that Mr. B's face has moved in the first direction by the amount of movement x, the amount of each shift is the amount of movement x of Mr. B, the camera 2 and its subject (Mr. A and its background). And the foreground) (ie, the depth distance). In this system S, when setting the shift amount, the movement amount x of Mr. B's face is converted into a movement angle. The movement angle is an angle indicating the amount of change in the line of sight of Mr. B. The line of sight is a virtual straight line from the center position of Mr. B's eyes toward the center of the display 5.

図１０Ａを参照しながら説明すると、一点鎖線にて図示したラインが、Ｂさんの顔が移動する前の視線ラインに相当し、二点鎖線にて図示したラインが、移動後の視線ラインに相当する。そして、両視線ラインがなす鋭角、すなわち、図１０Ａ中の角度θが移動角度に相当する。なお、Ｂさんの顔が移動する前の視線ラインについては、図１０Ａに示すように、ディスプレイ５の表示画面の法線方向に沿ったラインとなっているものとする。 Referring to FIG. 10A, the line illustrated by the one-dot chain line corresponds to the line of sight before the face of Mr. B moves, and the line illustrated by the two-dot chain line corresponds to the line of sight after movement. To do. The acute angle formed by the two line-of-sight lines, that is, the angle θ in FIG. 10A corresponds to the movement angle. It is assumed that the line of sight before Mr. B's face moves is a line along the normal direction of the display screen of the display 5 as shown in FIG. 10A.

また、ずれ量を設定するにあたっては、Ａさん、背景（例えば、壁）、前景（例えば、Ａさんの前にある箱）の各々の奥行距離を特定する。ここで、対面対話中、Ａさんの奥行距離は、図１０Ｂに示すように、Ａさん側のカメラ２から基準距離ｄ１だけ離れた位置に維持されるものとする。一方、背景である部屋の壁の奥行距離は、図１０Ｂに示すように、Ａさん側のカメラ２から距離ｄｗだけ離れている。この距離ｄｗは、当然ながら、Ａさんの奥行距離である基準距離ｄ１よりも長い距離となっている。また、前景であるＡさんの前方に置かれた箱の奥行距離は、図１０Ｂに示すように、Ａさん側のカメラ２から距離ｄｆだけ離れている。この距離ｄｆは、当然ながら、Ａさんの奥行距離である基準距離ｄ１よりも短い距離となっている。 In setting the amount of deviation, the depth distance of each of Mr. A, the background (for example, the wall), and the foreground (for example, the box in front of Mr. A) is specified. Here, during the face-to-face conversation, the depth distance of Mr. A is maintained at a position separated from the camera 2 on the Mr. A side by the reference distance d1, as shown in FIG. 10B. On the other hand, as shown in FIG. 10B, the depth distance of the wall of the room as the background is separated from the camera 2 on the A side by a distance dw. Naturally, this distance dw is longer than the reference distance d1 which is the depth distance of Mr. A. Further, the depth distance of the box placed in front of Mr. A, which is the foreground, is separated from the camera 2 on the Mr. A side by a distance df, as shown in FIG. 10B. Naturally, this distance df is shorter than the reference distance d1 which is the depth distance of Mr. A.

そして、移動角度θ、並びにＡさん、背景及び前景の各々の奥行距離ｄ１、ｄｗ、ｄｆが特定された後、合成映像におけるＡさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲の各々に対してずれ量を設定する。具体的に説明すると、Ａさんの三次元映像の表示位置に対するずれ量をｔ１とすると、当該ずれ量ｔ１は、下記の式（２）によって算出される。
ｔ１＝ｄ１×ｓｉｎθ （２） Then, after the movement angle θ and the depth distances d1, dw, df of each of Mr. A, the background, and the foreground are specified, the display position of Mr. A's 3D image in the composite image, the display position of the foreground 3D image And a deviation amount are set for each of the display ranges of the background 3D video. More specifically, assuming that the shift amount with respect to the display position of Mr. A's 3D image is t1, the shift amount t1 is calculated by the following equation (2).
t1 = d1 × sin θ (2)

また、背景の三次元映像の表示範囲に対するずれ量をｔ２とすると、当該ずれ量ｔ２は、下記の式（３）によって算出される。
ｔ２＝ｄｗ×ｓｉｎθ （３） Also, assuming that the amount of deviation from the background 3D video display range is t2, the amount of deviation t2 is calculated by the following equation (3).
t2 = dw × sin θ (3)

また、前景の三次元映像の表示位置に対するずれ量をｔ３とすると、当該ずれ量ｔ３は、下記の式（４）によって算出される。
ｔ３＝ｄｆ×ｓｉｎθ （４） Also, assuming that the amount of deviation with respect to the display position of the foreground 3D image is t3, the amount of deviation t3 is calculated by the following equation (4).
t3 = df × sin θ (4)

上記のずれ量ｔ１、ｔ２、ｔ３を設定した後には、Ａさんの三次元映像の表示位置をずれ量ｔ１だけ、背景の三次元映像の表示範囲をずれ量ｔ２だけ、前景の三次元映像の表示位置をずれ量ｔ３だけ、それぞれ第二向きにずらした状態へ合成映像を遷移させる。これにより、Ｂさん側のディスプレイ５には、当初、図１１の（Ａ）に図示した合成映像が表示されていたところ、Ｂさんの顔の横移動に連動して、合成映像が図１１の（Ｂ）に図示した状態へ徐々に遷移するようになる。 After setting the above-described shift amounts t1, t2, and t3, the display position of Mr. A's 3D image is shifted by the shift amount t1, the display range of the background 3D image is shifted by the shift amount t2, and the 3D video of the foreground is displayed. The composite image is shifted to a state where the display position is shifted in the second direction by the shift amount t3. Thus, when the synthesized video shown in FIG. 11A was initially displayed on the display 5 on the B-san side, the synthesized video is linked to the lateral movement of Mr. B's face. The state gradually changes to the state shown in FIG.

以上までに説明したように、本システムＳでは、Ｂさん側のディスプレイ５に合成映像が表示されている期間中にＢさんの顔が第一向きへ移動すると、合成映像におけるＡさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲が、ともに第二向きにずれるようになる。また、Ａさんの三次元映像の表示位置に対するずれ量ｔ１よりも、背景の三次元映像の表示範囲に対するずれ量ｔ２の方が大きくなっている。また、Ａさんの三次元映像の表示位置に対するずれ量ｔ１よりも、前景の三次元映像の表示位置に対するずれ量ｔ３の方が小さくなっている。このようにＡさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲を、それぞれ互いに異なるずれ量だけずらした状態に合成映像を遷移させることにより、Ｂさん側のディスプレイ５には、Ｂさんが移動後の顔の位置から実際にＡさんを見たときの見え方を反映した映像が表示されるようになる。 As described above, in the present system S, when Mr. B's face moves in the first direction during the period in which the synthesized video is displayed on the Mr. B display 5, the three-dimensional image of Mr. A in the synthesized video is displayed. The display position of the video, the display position of the foreground 3D video, and the display range of the background 3D video are all shifted in the second direction. Further, the shift amount t2 with respect to the display range of the background 3D video is larger than the shift amount t1 with respect to the display position of the 3D video of Mr. A. Further, the shift amount t3 with respect to the display position of the foreground 3D image is smaller than the shift amount t1 with respect to the display position of the 3D image of Mr. A. In this way, by transitioning the composite video to a state in which the display position of Mr. A's 3D video, the display position of the 3D video in the foreground, and the display range of the background 3D video are shifted from each other by different shift amounts. On the display 5 on the side of Mr. B, an image reflecting the appearance when Mr. A actually sees Mr. A from the position of the face after the movement is displayed.

分かり易く説明すると、仮にＢさんが実際にＡさんと対面して対話している場合、Ｂさんの顔が横移動すると、移動後のＢさんの位置から見えるものは、当初の位置からずれた位置にあるように見える。ここで、Ｂさんに対してより近くにあるものほど小さなずれ量だけ当初の位置からずれた位置に見えるようになり、より遠くにあるものほど大きなずれ量だけ当初の位置からずれた位置に見えるようになる。本システムＳでは、以上のような見え方を再現すべく、Ｂさんの顔が横移動したことを検知したとき、Ａさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景映像の表示範囲をそれぞれ異なるずれ量だけずらすように合成映像を遷移させる。この際、Ａさんの三次元映像の表示位置に対するずれ量ｔ１よりも、背景の三次元映像の表示範囲に対するずれ量ｔ２の方が大きくなっている。この結果、遷移処理後の合成映像では、背景のうち、当初の合成映像（Ｂさんの顔が移動する前の合成映像）では表示されていなかった範囲の映像を見ること、いわゆる覗き込みが可能となる。 To explain in an easy-to-understand manner, if Mr. B actually interacts with Mr. A, when Mr. B's face moves sideways, what is visible from Mr. B's position after the movement is shifted from the original position. Looks like it's in position. Here, the closer to Mr. B, the smaller the amount of deviation appears from the original position, and the farther away, the larger the amount of deviation appears from the original position. It becomes like this. In this system S, when it is detected that Mr. B's face has moved sideways in order to reproduce the above-described appearance, the display position of Mr. A's 3D image, the display position of the foreground 3D image, and The composite video is shifted so that the display range of the background video is shifted by different shift amounts. At this time, the displacement amount t2 with respect to the display range of the background 3D image is larger than the displacement amount t1 with respect to the display position of Mr. A's 3D image. As a result, in the composite video after the transition process, it is possible to see the video in the range that was not displayed in the original composite video (composite video before Mr. B's face moved) in the background, so-called peeking is possible It becomes.

＜奥行距離変化時のプロセスについて＞
対面対話の実行時、Ａさんは、通常、Ａさん側のカメラ２から基準距離ｄ１だけ離れた位置に立っている。このとき、カメラ２が撮像したＡさんの映像をディスプレイ５にて表示すると、当該映像は図１２に示すように等身大サイズで表示される。一方、Ａさんが上記の位置よりも後方に移動したとき、カメラ２の撮像映像をそのままのサイズにてディスプレイ５にて表示すると、当該映像は、図１２に示すように等身大サイズよりも幾分小さいサイズで表示されるようになる。このような表示サイズの変化は、カメラ２のレンズの光学的特性に起因して不可避的に生じる。なお、図１２は、従来の映像表示システムの構成例を示した図であり、Ａさんの奥行距離が大きくなるほどディスプレイ５に表示されるＡさんの映像の表示サイズが小さくなる様子を図示している。 <About the process when the depth distance changes>
When performing the face-to-face conversation, Mr. A usually stands at a position separated from the camera 2 on the Mr. A side by the reference distance d1. At this time, when Mr. A's image captured by the camera 2 is displayed on the display 5, the image is displayed in a life-size size as shown in FIG. On the other hand, when Mr. A moves rearward from the above position, when the captured image of the camera 2 is displayed on the display 5 in the same size, the image is smaller than the life size as shown in FIG. A small size is displayed. Such a change in display size is unavoidably caused by the optical characteristics of the lens of the camera 2. FIG. 12 is a diagram showing a configuration example of a conventional video display system, illustrating that the display size of the video of Mr. A displayed on the display 5 becomes smaller as the depth distance of Mr. A increases. Yes.

しかし、ＢさんとＡさんとが実際に対面している場面においてＡさんがＢさんに対して多少近接又は離間したとしても、Ａさんの姿（大きさ）は、Ｂさんから見たときの見え方（見た目）では殆ど変化しないように見える。そこで、本システムＳでは、Ａさんの奥行距離が変化したときの実際の見え方を再現すべく、奥行距離変化時のプロセスを行うようにしている。これにより、Ｂさん側のディスプレイ５に表示されるＡさんの映像（厳密には三次元映像）の表示サイズは、Ａさんの奥行距離が変化した後にも等身大サイズのままで維持されるようになる。 However, even when Mr. A and Mr. A are actually facing each other, even if Mr. A is slightly closer to or away from Mr. B, the appearance (size) of Mr. A is as seen from Mr. B. The appearance (appearance) looks almost unchanged. Therefore, in the present system S, a process at the time of changing the depth distance is performed in order to reproduce the actual appearance when the depth distance of Mr. A changes. As a result, the display size of Mr. A's video (strictly, a three-dimensional image) displayed on Mr. B's display 5 is maintained at a life-size size even after Mr. A's depth distance has changed. become.

以下、奥行距離変化時のプロセスについて説明する。なお、以下では、Ａさんの奥行距離が基準距離ｄ１から、基準距離ｄ１よりも大きい距離ｄ２に変化したケースを想定して説明することとする。奥行距離変化時のプロセスは、Ｂさん側のディスプレイ５に合成映像が表示されている期間（換言すると、Ａさん側のカメラ２がＡさんの映像を撮像している期間）においてＡさんの奥行距離が変化したときに行われる。具体的には、Ｂさん側のホームサーバ１（厳密には、前述の判定部２３）が、上記期間中、奥行距離の変化の有無を判定する。そして、奥行距離が変化したと判定したとき、Ｂさん側のホームサーバ１は、これをトリガーとして奥行距離変化時のプロセスを開始する。 Hereinafter, a process when the depth distance is changed will be described. In the following description, it is assumed that the depth distance of Mr. A has changed from the reference distance d1 to a distance d2 that is larger than the reference distance d1. The process at the time of changing the depth distance is the depth of Mr. A during the period in which the composite image is displayed on the Mr. B side display 5 (in other words, the period in which Mr. A's camera 2 captures the image of Mr. A). This is done when the distance changes. Specifically, Mr. B's home server 1 (strictly, the determination unit 23 described above) determines whether there is a change in the depth distance during the period. When it is determined that the depth distance has changed, Mr. B's home server 1 uses this as a trigger to start the process when the depth distance changes.

奥行距離変化時のプロセスにおいて、Ｂさん側のホームサーバ１（厳密には、合成映像表示部２２）は、合成映像におけるＡさんの三次元映像の表示サイズを調整する調整処理を実行する。調整処理では、先ず、変化後の奥行距離ｄ２を特定する。その後、特定した変化後の奥行距離ｄ２に基づき、奥行方向においてＡさんの位置が変化する前の表示サイズ、すなわち、等身大サイズとなるようにＡさんの映像の表示サイズを調整する。具体的に説明すると、Ａさんの奥行距離がｄ１からｄ２へ変化したとき、調整処理では、Ａさんの映像の表示サイズ（厳密には、映像の縦サイズ及び横サイズの各々）に奥行距離の比（ｄ１／ｄ２）に乗じて上記表示サイズを補正する。 In the process of changing the depth distance, Mr. B's home server 1 (strictly speaking, the composite video display unit 22) executes adjustment processing for adjusting the display size of Mr. A's 3D video in the composite video. In the adjustment process, first, the depth distance d2 after the change is specified. Thereafter, based on the specified depth distance d2 after the change, the display size before the position of Mr. A in the depth direction is changed, that is, the display size of the video of Mr. A is adjusted to be a life size. Specifically, when the depth distance of Mr. A changes from d1 to d2, in the adjustment process, the display distance of the Mr. A's video (strictly speaking, the vertical size and horizontal size of the video) is changed to the depth distance. The display size is corrected by multiplying the ratio (d1 / d2).

その後、Ｂさん側のホームサーバ１は、サイズ補正されたＡさんの三次元映像と、背景及び前景の三次元映像とを合成し、その合成映像をディスプレイ５に表示させる。これにより、図１３（Ａ）及び（Ｂ）に示すように、Ａさんの奥行距離が変化したとしても、当該奥行距離が変化する前の表示サイズにてＡさんの映像が表示されるようになる。このようにＡさんの奥行距離が変化したときに、実際にＡさんと対面しているときの見え方を反映してＡさんの三次元映像の表示サイズを調整する結果、本システムＳを用いた対面対話の臨場感（リアル感）がより一層向上することとなる。なお、図１３は、調整処理の実行結果についての説明図であり、同図の（Ａ）は、奥行距離が変化する前の合成映像を、同図の（Ｂ）は、奥行距離の変化後に調整処理が行われた合成映像を、それぞれ示している。また、図１３の（Ｂ）には、表示サイズの比較のために、奥行距離が変化した後であって調整処理が行われる前段階のＡさんの映像を破線にて示している。 Thereafter, the home server 1 on the B side synthesizes the 3D video of Mr. A whose size has been corrected, and the 3D video of the background and foreground, and displays the synthesized video on the display 5. Accordingly, as shown in FIGS. 13A and 13B, even if the depth distance of A changes, the video of Mr. A is displayed at the display size before the depth distance changes. Become. When the depth distance of Mr. A changes in this way, the display size of Mr. A's 3D image is adjusted to reflect the appearance when actually facing Mr. A. As a result, this system S is used. The realism of the face-to-face dialogue will be further improved. FIG. 13 is an explanatory diagram of the execution result of the adjustment process. FIG. 13A shows a composite image before the depth distance changes, and FIG. 13B shows a result after the depth distance changes. The composite images that have undergone the adjustment process are respectively shown. Further, in FIG. 13B, for comparison of the display size, the video of Mr. A after the depth distance has changed and before the adjustment process is performed is indicated by a broken line.

＜＜映像表示フローについて＞＞
次に、本システムＳを用いた対面対話のうち、映像表示に係る一連のデータ処理、すなわち映像表示フローについて、その流れを説明する。ここで、以下に説明する映像表示フォローにおいては本発明の映像表示方法が適用されている。すなわち、以下では、本発明の映像表示方法に関する説明として、当該映像表示方法を適用した映像表示フローの流れを説明することとする。換言すると、以下に述べる映像表示フロー中の各ステップは、本発明の映像表示方法の構成要素に相当する。 << About the video display flow >>
Next, in the face-to-face conversation using the system S, a flow of a series of data processing related to video display, that is, a video display flow will be described. Here, in the video display follow described below, the video display method of the present invention is applied. That is, in the following, the flow of a video display flow to which the video display method is applied will be described as an explanation regarding the video display method of the present invention. In other words, each step in the video display flow described below corresponds to a component of the video display method of the present invention.

なお、以下では、Ａさんの三次元映像を含む合成映像をＢさん側のディスプレイ５にて表示するケースを例に挙げて説明する。ちなみに、Ｂさんの三次元映像を含む合成映像をＡさん側のディスプレイ５に表示する際の手順についても、下記の手順と略同様となる。 In the following, a case where a composite video including a 3D video of Mr. A is displayed on the display 5 on the side of B will be described as an example. Incidentally, the procedure for displaying the synthesized video including the 3D video of Mr. B on the display 5 on the side of Mr. A is substantially the same as the following procedure.

映像表示フローは、コンピュータであるＢさん側のホームサーバ１が、図１４及び１５に示す各ステップを実施することにより進行する。図１４及び１５は、映像表示フローの流れを示す図である。具体的に説明すると、先ず、Ｂさん側のホームサーバ１が、Ａさん側のホームサーバ１と通信することで背景映像の映像データ及び背景映像についての深度データ（第１深度データ）を受信する（Ｓ００１）。これにより、Ｂさん側のホームサーバ１は、背景映像として、Ａさんが対面対話を行う際に利用する部屋の映像を取得する。これと同時に、Ｂさん側のホームサーバ１は、背景とカメラ２との距離を示すデータ（距離データ）としての第１データを取得する。なお、本ステップＳ００１は、Ａさん側のカメラ２が背景映像のみを撮像している間、すなわち、対面対話が行われる部屋にＡさんが居ない期間中に行われる。また、取得した背景映像及び第１深度データについては、Ｂさん側のホームサーバ１のハードディスクドライブ等に記憶される。 The video display flow proceeds when the home server 1 on the side of Mr. B, which is a computer, performs the steps shown in FIGS. 14 and 15 are diagrams showing the flow of the video display flow. Specifically, first, the Mr. B's home server 1 communicates with the Mr. A's home server 1 to receive the video data of the background video and the depth data (first depth data) about the background video. (S001). Thereby, Mr. B's home server 1 acquires a video of a room used when Mr. A performs a face-to-face conversation as a background video. At the same time, Mr. B's home server 1 acquires first data as data (distance data) indicating the distance between the background and the camera 2. Note that this step S001 is performed while the A-side camera 2 captures only the background video, that is, during a period when there is no A in the room where the face-to-face conversation is performed. Further, the acquired background video and first depth data are stored in the hard disk drive or the like of the home server 1 on the side of Mr. B.

そして、Ｂさん側のホームサーバ１は、記憶された背景映像及び第１深度データのうち、直近で取得された背景映像及び第１深度データを読み出し、これらを用いたレンダリング処理としてテクスチャマッピングによる処理を実行する。これにより、Ｂさん側のホームサーバ１は、背景の三次元映像を取得する（Ｓ００２）。 Then, Mr. B's home server 1 reads the latest background video and first depth data out of the stored background video and first depth data, and performs texture mapping as a rendering process using these. Execute. Thereby, Mr. B's home server 1 acquires a background three-dimensional image (S002).

一方、Ａさんが対面対話用の部屋に入室して対面対話を開始すると、同室内に設置されたカメラ２が、Ａさんとその背景及び前景を含む映像、すなわち、実映像を撮像する。そして、Ａさん側のホームサーバ１が、カメラ２が撮像した実映像の映像データを送信し、Ｂさん側のホームサーバ１が当該映像データを受信する。これにより、Ｂさん側のホームサーバ１は、上記の実映像を取得する。また、Ａさん側のホームサーバ１は、実映像の映像データの送信と同時に、実映像についての深度データ（第２深度データ）を送信し、Ｂさん側のホームサーバ１が当該第２深度データを受信する。これにより、Ｂさん側のホームサーバ１は、上記の第２深度データを、実映像とセットにした状態で取得する（Ｓ００３）。なお、取得した実映像及び第２深度データについては、Ｂさん側のホームサーバ１のハードディスクドライブ等に記憶される。 On the other hand, when Mr. A enters the room for face-to-face conversation and starts the face-to-face conversation, the camera 2 installed in the room picks up an image including Mr. A, the background and the foreground, that is, a real image. Then, Mr. A's home server 1 transmits the video data of the actual video captured by the camera 2, and Mr. B's home server 1 receives the video data. Thereby, Mr. B's home server 1 acquires the above-mentioned actual video. In addition, the home server 1 on the Mr. A side transmits depth data (second depth data) on the real video simultaneously with the transmission of the video data of the real video, and the home server 1 on the B side transmits the second depth data. Receive. Thereby, Mr. B's home server 1 acquires the second depth data in a state of being set with the actual video (S003). The acquired actual video and second depth data are stored in the hard disk drive or the like of Mr. B's home server 1.

その後、Ｂさん側のホームサーバ１は、取得した実映像から人物映像、具体的にはＡさんの映像を抽出する（Ｓ００４）。具体的に説明すると、Ｂさん側のホームサーバ１は、前ステップＳ００２で取得した第２深度データと、カメラ２の撮像映像と、に基づいてＡさんの骨格モデルを特定した上で、当該骨格モデルに基づいて実映像からＡさんの映像を抽出する。 Thereafter, the home server 1 on the side of Mr. B extracts a person video, specifically, a video of Mr. A from the acquired actual video (S004). Specifically, Mr. B's home server 1 specifies Mr. A's skeleton model based on the second depth data acquired in the previous step S002 and the captured video of the camera 2, and then A's video is extracted from the actual video based on the model.

そして、Ｂさん側のホームサーバ１は、前ステップＳ００４にて抽出されたＡさんの映像と第２深度データとを用いたレンダリング処理を実行し、具体的にはテクスチャマッピングによる処理を実行する。これにより、Ｂさん側のホームサーバ１は、人物（Ａさん）の三次元映像を取得する（Ｓ００５）。 Then, Mr. B's home server 1 executes a rendering process using the video of Mr. A extracted in the previous step S004 and the second depth data, and specifically executes a process by texture mapping. Thereby, Mr. B's home server 1 acquires a 3D image of the person (Mr. A) (S005).

また、Ｂさん側のホームサーバ１は、第２深度データに基づいて、ステップＳ００３にて取得した実映像から前景映像を抽出する（Ｓ００６）。その後、Ｂさん側のホームサーバ１は、抽出された前景映像と第２深度データとを用いたテクスチャマッピングによるレンダリング処理を実行する。これにより、Ｂさん側のホームサーバ１は、前景の三次元映像を取得する（Ｓ００７）。 Further, the home server 1 on the B-side side extracts the foreground video from the real video acquired in step S003 based on the second depth data (S006). Thereafter, Mr. B's home server 1 executes a rendering process by texture mapping using the extracted foreground video and the second depth data. Thereby, Mr. B's home server 1 acquires a 3D image of the foreground (S007).

Ａさん及び前景の各々の三次元映像を取得した後、Ｂさん側のホームサーバ１は、これらの三次元映像と、ステップＳ００２にて取得した背景の三次元映像中の所定範囲内にある映像（表示範囲）と、を合成する（Ｓ００８）。そして、Ｂさん側のホームサーバ１は、Ｂさん側のディスプレイ５に合成映像を表示させる（Ｓ００９）。これにより、Ｂさん側のディスプレイ５には、背景の三次元映像よりも手前位置にＡさんの三次元映像が等身大サイズにて表示され、また、Ａさんの三次元映像よりも手前位置に前景の三次元映像が表示されるようになる。 After acquiring the 3D images of Mr. A and the foreground, the home server 1 on the side of Mr. B, the images within the predetermined range in these 3D images and the background 3D images acquired in step S002. (Display range) is synthesized (S008). Then, Mr. B's home server 1 displays the synthesized video on the Mr. B's display 5 (S009). Accordingly, Mr. A's 3D image is displayed in a life-size size in front of the background 3D image on the display 5 on the side of Mr. B, and in front of the 3D image of Mr. A. A foreground 3D image is displayed.

ここで、以上までに述べた映像表示フローに係る一連のステップのうち、人物の三次元映像を取得するステップＳ００５について、図１６を参照しながら、より詳細に説明する。図１６は、人物の三次元映像を取得する手順を示した図である。本ステップＳ００５では、先ず、前ステップＳ００４にて抽出されたＡさんの映像と第２深度データとを用いたテクスチャマッピングを行う（Ｓ０１１）。これにより、Ａさんの三次元映像として、Ａさん側のカメラ２が設置されている位置から見た映像が取得される。 Here, of the series of steps related to the video display flow described above, step S005 for acquiring a 3D video of a person will be described in more detail with reference to FIG. FIG. 16 is a diagram illustrating a procedure for acquiring a 3D video of a person. In step S005, first, texture mapping is performed using the video of Mr. A extracted in the previous step S004 and the second depth data (S011). As a result, an image viewed from the position where the camera 2 on the A's side is installed is acquired as a 3D image of the A's.

次に、Ｂさん側のホームサーバ１が記憶している第１深度データに基づいて、Ａさんの目の高さを検知する（Ｓ０１２）。その後、Ｂさん側のホームサーバ１は、検知したＡさんの目の高さとＡさん側のカメラ２の設定高さとを対比する（Ｓ０１３）。そして、双方の高さが異なる場合、Ｂさん側のホームサーバ１は、目線高さ合わせ用のプロセスを行う（Ｓ０１４）。同プロセスにおいて、Ｂさん側のホームサーバ１は、検知したＡさんの目の高さと同じ高さにある仮想的な視点から見たＡさんの三次元映像を取得するためのレンダリング処理を行う。厳密には、ステップＳ０１１にて取得した三次元映像に対して、前述した式（１）にて算出した角度αに相当する高さだけ視点を変位させる映像処理を施す。これにより、上記の仮想的な視点から見たＡさんの三次元映像、すなわち、目線が正面を向いたＡさんの三次元映像を取得することが可能となる（Ｓ０１５）。 Next, based on the first depth data stored in Mr. B's home server 1, the height of Mr. A's eyes is detected (S012). Thereafter, the home server 1 on the B side compares the detected height of the eyes of the A with the set height of the camera 2 on the A side (S013). If the heights of the two are different, the home server 1 on the side of Mr. B performs a process for adjusting the eye height (S014). In this process, Mr. B's home server 1 performs a rendering process for acquiring a three-dimensional image of Mr. A viewed from a virtual viewpoint at the same height as the detected eye height of Mr. A. Strictly speaking, the 3D image acquired in step S011 is subjected to image processing for displacing the viewpoint by a height corresponding to the angle α calculated by the above-described equation (1). As a result, it is possible to acquire the 3D video of Mr. A viewed from the above-described virtual viewpoint, that is, the 3D video of Mr. A with the line of sight facing the front (S015).

一方、検知したＡさんの目の高さとＡさん側のカメラ２の設定高さとが一致している場合、Ｂさん側のホームサーバ１は、ステップＳ０１１にて取得した三次元映像を、そのままの状態で以降のステップに用いる。 On the other hand, when the detected eye height of Mr. A coincides with the set height of the camera 2 on the A's side, the home server 1 on the B's side uses the 3D video acquired in step S011 as it is. Used in subsequent steps in state.

ところで、映像表示フローにおいて、Ｂさん側のホームサーバ１は、Ｂさん側のカメラ２が撮像した実映像（Ｂさん、背景及び前景の映像）を取得すると共に、赤外線センサ４からの計測結果に基づいて上記実映像の深度データ（第２深度データ）を取得する。かかる深度データに基づいて、Ｂさん側のホームサーバ１は、Ｂさん側のディスプレイ５に合成映像が表示されている期間中にＢさんの顔が横移動したか否かを判定する（Ｓ０２１）。そして、Ｂさんの顔が横移動したと判定した場合、Ｂさん側のホームサーバ１は、当該顔の移動の向き及び移動量を、移動前の深度データ及び移動後の深度データに基づいて特定する（Ｓ０２２）。 By the way, in the video display flow, Mr. B's home server 1 acquires the actual image (Mr. B, the background and the foreground image) captured by Mr. B's camera 2 and also displays the measurement result from the infrared sensor 4. Based on this, the depth data (second depth data) of the actual video is acquired. Based on such depth data, Mr. B's home server 1 determines whether Mr. B's face has moved laterally during the period in which the composite video is displayed on the Mr. B display 5 (S021). . When it is determined that the face of Mr. B has moved sideways, the home server 1 on the side of Mr. B specifies the direction and amount of movement of the face based on the depth data before the movement and the depth data after the movement. (S022).

さらに、Ｂさん側のホームサーバ１は、ステップＳ００３で取得した第２深度データに基づいて、Ａさん、背景及び前景の各々の奥行距離を特定する（Ｓ０２３）。その後、Ｂさん側のホームサーバ１は、ステップＳ０２２及びＳ０２３において特定した各値に基づいて、次のステップＳ０２５で実行する遷移処理において用いるずれ量を算出する（Ｓ０２４）。より具体的に説明すると、本ステップＳ０２４では、合成映像におけるＡさんの三次元映像の表示位置に対するずれ量ｔ１、背景の三次元映像の中で合成映像の中に含まれる範囲（表示範囲）に対するずれ量ｔ２、及び、合成映像における前景の三次元映像の表示位置に対するずれ量ｔ３を、それぞれ、既述の式（２）〜（４）に従って算出する。 Furthermore, Mr. B's home server 1 specifies the depth distance of each of Mr. A, the background, and the foreground based on the second depth data acquired in step S003 (S023). Thereafter, the home server 1 on the B-side side calculates a deviation amount used in the transition process executed in the next step S025 based on the values specified in steps S022 and S023 (S024). More specifically, in step S024, the shift amount t1 with respect to the display position of Mr. A's 3D video in the composite video, and the range (display range) included in the composite video in the background 3D video. The shift amount t2 and the shift amount t3 with respect to the display position of the foreground 3D video in the composite video are calculated according to the above-described equations (2) to (4), respectively.

そして、Ｂさん側のホームサーバ１は、ずれ量を算出した後に遷移処理を実行する（Ｓ０２５）。この遷移処理の実行により、ディスプレイ５に表示されている合成映像が、Ｂさんの顔の横移動を検知する前の状態から遷移する。具体的に説明すると、Ｂさんの顔が第一向きに横移動したことを検知したとき、Ｂさん側のホームサーバ１は、遷移処理において、合成映像におけるＡさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び、背景の三次元映像の表示範囲を、それぞれ前ステップＳ０２４で算出したずれ量だけ第二向きにずらした状態へ合成映像を遷移させる。この際、Ａさんの三次元映像の表示位置に対するずれ量よりも、背景の三次元映像の表示範囲に対するずれ量の方がより大きくなっている。また、Ａさんの三次元映像の表示位置に対するずれ量よりも、前景の三次元映像の表示位置に対するずれ量の方がより小さくなっている。 Then, Mr. B's home server 1 executes the transition process after calculating the deviation amount (S025). By executing this transition process, the composite image displayed on the display 5 transitions from a state before detecting the lateral movement of Mr. B's face. Specifically, when it is detected that Mr. B's face has moved sideways in the first direction, Mr. B's home server 1 displays the display position of Mr. A's 3D image in the composite image in the transition process, The composite video is shifted to a state in which the display position of the foreground 3D video and the display range of the background 3D video are shifted in the second direction by the shift amount calculated in the previous step S024. At this time, the shift amount with respect to the display range of the background 3D video is larger than the shift amount with respect to the display position of the 3D video of Mr. A. Further, the shift amount of the foreground 3D video display position is smaller than the shift amount of Mr. A with respect to the 3D video display position.

遷移処理が完了すると、Ｂさん側のホームサーバ１は、遷移処理後の合成映像、すなわち、Ａさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲を当初の状態からずらした状態の合成映像をディスプレイ５に表示させる（Ｓ０２６）。これにより、ディスプレイ５には、横移動後のＢさんの顔の位置から見たときの見え方を再現した映像が表示されるようになる。なお、前述したように、遷移処理後の合成映像では、Ａさんの三次元映像の表示位置に対するずれ量よりも、背景の三次元映像の表示範囲に対するずれ量の方が大きくなっている。このため、Ｂさんは、背景の三次元映像のうち、当初ディスプレイ５に表示されていなかった映像を左右に顔を動かして覗き見ることが可能となる。 When the transition processing is completed, Mr. B's home server 1 displays the composite video after the transition processing, that is, the display position of Mr. A's 3D video, the display position of the foreground 3D video, and the background 3D video. The composite image with the display range shifted from the initial state is displayed on the display 5 (S026). As a result, the display 5 displays an image that reproduces the appearance when seen from the position of the face of Mr. B after the lateral movement. As described above, in the synthesized video after the transition process, the shift amount with respect to the display range of the background 3D video is larger than the shift amount of Mr. A with respect to the display position of the 3D video. For this reason, Mr. B can peek at the image that was not initially displayed on the display 5 among the three-dimensional images of the background by moving his / her face to the left and right.

また、Ｂさん側のホームサーバ１は、ステップＳ００３で取得した第２深度データに基づいて、Ｂさん側のディスプレイ５に合成映像が表示されている期間中にＡさんの奥行距離が変化したか否かを判定する（Ｓ０２７）。そして、Ａさんの奥行距離が変化したと判定したとき、Ｂさん側のホームサーバ１は、変化後の第２深度データに基づいて、変化後の奥行距離を特定する（Ｓ０２８）。その後、Ｂさん側のホームサーバ１は、特定した変化後の奥行距離に応じて、Ａさんの三次元映像の表示サイズを調整する（Ｓ０２９）。この際、Ｂさん側のホームサーバ１は、奥行距離変化後のＡさんの三次元映像が奥行距離変化前の表示サイズ、すなわち等身大サイズにて表示されるように表示サイズを調整する。表示サイズの調整が完了した後、Ｂさん側のホームサーバ１は、サイズ調整後のＡさんの三次元映像と、背景及び前景のそれぞれの三次元映像とを合成し、その合成映像をディスプレイ５に表示させる（Ｓ０３０）。これにより、Ａさんの奥行距離が変化した後にも、引き続き、ディスプレイ５に表示されるＡさんの三次元映像が等身大サイズで表示されるようになる。 In addition, the home server 1 on the B side has changed the depth distance of the A during the period in which the composite video is displayed on the display 5 on the B side based on the second depth data acquired in step S003. It is determined whether or not (S027). When it is determined that the depth distance of Mr. A has changed, the home server 1 on the side of Mr. B identifies the depth distance after the change based on the second depth data after the change (S028). Thereafter, the home server 1 on the B side adjusts the display size of the 3D image of the A in accordance with the identified depth distance after the change (S029). At this time, Mr. B's home server 1 adjusts the display size so that the three-dimensional image of Mr. A after the depth distance change is displayed in the display size before the depth distance change, that is, the life-size size. After the adjustment of the display size is completed, Mr. B's home server 1 synthesizes the 3D video of Mr. A after the size adjustment and the 3D video of the background and foreground, and displays the synthesized video on the display 5. (S030). Thereby, even after Mr. A's depth distance changes, Mr. A's three-dimensional image displayed on the display 5 continues to be displayed in a life-size size.

＜＜映像表示システムの変形例＞＞
上述した本システムＳの構成では、各ユーザの映像を撮像するカメラ２が一台ずつ設けられていることとした。すなわち、上記の実施形態では、単一のカメラ２にてユーザの映像を撮像し、ディスプレイ５には、単一のカメラ２にて撮像された映像を元にした三次元映像を表示することとした。これに対して、互いに撮像方向から異なる複数のカメラ２にてユーザの映像を撮像すれば、より多くの視点からユーザの映像を取得することが可能となる。この結果、カメラ２の撮像映像を用いたレンダリング処理によって生成されるユーザの三次元映像については、単一のカメラ２のみでは視認され得ない死角領域をより少なくし、三次元映像を見る際の視点（仮想的な視点）の設定位置に対する自由度についても高くなる。 << Variation of video display system >>
In the configuration of the system S described above, one camera 2 that captures each user's video is provided. That is, in the above embodiment, a single camera 2 captures a user's video, and the display 5 displays a 3D video based on the video captured by the single camera 2. did. On the other hand, if a user's image | video is imaged with the some camera 2 from which an imaging direction mutually differs, it will become possible to acquire a user's image | video from more viewpoints. As a result, for the user's 3D image generated by the rendering process using the captured image of the camera 2, the blind spot area that cannot be visually recognized only by the single camera 2 is reduced, and the 3D image is viewed. The degree of freedom with respect to the setting position of the viewpoint (virtual viewpoint) is also increased.

以下、複数のカメラ２によってユーザの映像を撮像する構成（以下、変形例）を説明することとする。なお、以下の説明では、先に説明した構成と同様の構成についての説明を省略し、異なる構成のみについて説明することとする。また、以下では、Ａさんの映像を上下２台のカメラ２にて撮像するケースを例に挙げて説明することとする。なお、カメラ２の台数、設置箇所及びそれぞれの撮像方向については、以下に説明する内容に限定されず、任意に設定することが可能である。 Hereinafter, a configuration (hereinafter, modified example) in which a plurality of cameras 2 capture a user's video will be described. In the following description, description of the same configuration as that described above will be omitted, and only a different configuration will be described. In the following, a case where the image of Mr. A is captured by the upper and lower two cameras 2 will be described as an example. In addition, about the number of cameras 2, an installation location, and each imaging direction, it is not limited to the content demonstrated below, It is possible to set arbitrarily.

変形例では、図１７に示すように、Ａさんの映像を上下２台のカメラ２にて撮像する。図１７は、上下２台のカメラ２にてＡさんの映像を撮像する様子を模式的に示した図である。また、上下２台のカメラ２は、それぞれ、互いに異なる位置にてＡさんの映像を撮像する。具体的に説明すると、上側のカメラ２は、Ａさんの身長よりも幾分高い位置に設置されており、下側のカメラ２は、床面よりも若干上方に設置されている。 In the modified example, as shown in FIG. 17, the image of Mr. A is captured by the upper and lower two cameras 2. FIG. 17 is a diagram schematically illustrating a state in which the image of Mr. A is captured by the two upper and lower cameras 2. Further, the upper and lower two cameras 2 respectively capture the image of Mr. A at different positions. Specifically, the upper camera 2 is installed at a position somewhat higher than Mr. A's height, and the lower camera 2 is installed slightly above the floor surface.

また、変形例では、ディスプレイ５の映像表示画面（厳密にはタッチパネル５ａの前面）を基準面としており、上下２台のカメラ２のそれぞれの撮像方向は、基準面の法線方向に対して鉛直方向に傾いている。撮像方向とは、カメラ２のレンズの光軸方向のことであり、上側のカメラ２の撮像方向は、Ａさんに近付くにつれて下降する方向に設定されている。つまり、上側のカメラ２は、Ａさんの身体を上方から撮像する。他方、下側のカメラ２の撮像方向は、Ａさんに近付くにつれて上昇する方向に設定されている。つまり、下側のカメラ２は、Ａさんの身体を下方から撮像する。 In the modification, the video display screen of the display 5 (strictly, the front surface of the touch panel 5a) is used as a reference plane, and the imaging directions of the two upper and lower cameras 2 are perpendicular to the normal direction of the reference plane. Tilt in the direction. The imaging direction is the optical axis direction of the lens of the camera 2, and the imaging direction of the upper camera 2 is set to a direction that descends as approaching Mr. A. That is, the upper camera 2 images Mr. A's body from above. On the other hand, the imaging direction of the lower camera 2 is set to a direction that rises as it approaches Mr. A. That is, the lower camera 2 images Mr. A's body from below.

また、変形例に係る対面対話において、Ａさんは、上記の基準位置から基準距離ｄ１だけ離れた位置に立っている。かかる位置にＡさんが立っているとき、上側のカメラ２は、Ａさんの頭部から腰部までの映像（以下、上半身映像）を撮像し、下側のカメラ２は、Ａさんの足から腹部までの映像（以下、下半身映像）を撮像する。さらに、変形例では、カメラ２毎に赤外線センサ４が設けられている。これにより、上下２台のカメラ２の各々が撮像する映像（実映像）について、深度データ（厳密には第２深度データ）を個別に取得することが可能となる。 Further, in the face-to-face conversation according to the modification, Mr. A stands at a position separated from the reference position by a reference distance d1. When Mr. A stands at such a position, the upper camera 2 captures an image from the head of A to the waist (hereinafter referred to as an upper body image), and the lower camera 2 captures the abdomen from the foot of Mr. A. The previous video (hereinafter, lower body video) is captured. Furthermore, in the modification, an infrared sensor 4 is provided for each camera 2. Thereby, it is possible to individually acquire the depth data (strictly, the second depth data) for the video (actual video) captured by each of the upper and lower cameras 2.

一方、変形例において、Ｂさん側のホームサーバ１は、カメラ２別にＡさんの映像を取得する。具体的に説明すると、Ａさん側のホームサーバ１は、上側のカメラ２が撮像した上半身映像を含む実映像の映像データと、下側のカメラ２が撮像した下半身映像を含む実映像の映像データと、を送信する。Ｂさん側のホームサーバ１は、これらの映像データを取得し、それぞれの映像データが示す実映像の中からＡさんの映像、具体的には上半身映像や下半身映像を抽出する。 On the other hand, in the modified example, the home server 1 on the B side acquires the video of Mr. A for each camera 2. More specifically, the home server 1 on the side of Mr. A has the video data of the real video including the upper body video captured by the upper camera 2 and the video data of the real video including the lower body video captured by the lower camera 2. And send. Mr. B's home server 1 acquires these video data, and extracts the video of Mr. A, specifically the upper body video and the lower body video, from the actual video indicated by each video data.

また、変形例において、Ｂさん側のホームサーバ１は、各カメラ２が撮像した実映像についての深度データを、Ａさん側のホームサーバ１からカメラ別に受信する。すなわち、変形例において、Ｂさん側のホームサーバ１は、Ａさんの上半身映像や下半身映像を含む実映像についての深度データを、カメラ別に取得することになる。さらに、変形例において、Ｂさん側のホームサーバ１（厳密には、三次元映像生成部２１）は、カメラ別に取得した実映像及び深度データに基づいて、カメラ別の三次元映像片を生成する工程、すなわち映像片生成工程を行う。 In the modification, the Mr. B's home server 1 receives depth data about the actual video captured by each camera 2 from the Mr. A's home server 1 for each camera. That is, in the modified example, the home server 1 on the side of Mr. B acquires the depth data for the real video including the upper body video and the lower body video of Mr. A for each camera. Further, in the modified example, Mr. B's home server 1 (strictly, the 3D video generation unit 21) generates a 3D video piece for each camera based on the actual video and depth data acquired for each camera. A process, that is, a video piece generation process is performed.

具体的に説明すると、映像片生成工程において、Ｂさん側のホームサーバ１は、上側のカメラ２が撮像した実映像から得られるＡさんの上半身映像と、上側のカメラ２が撮像した実映像についての深度データと、を用いてレンダリング処理を行う。これにより、上側のカメラ２の撮像方向から見た三次元映像片、具体的には、図１８に図示したＡさんの上半身の三次元映像片が取得される。同様に、映像片生成工程において、Ｂさん側のホームサーバ１は、下側のカメラ２が撮像した実映像から得られるＡさんの下半身映像と、下側のカメラ２が撮像した実映像についての深度データと、を用いたレンダリング処理を行う。これにより、下側のカメラ２の撮像方向から見た三次元映像片、具体的には、図１８に図示したＡさんの下半身の三次元映像片が取得される。図１８は、カメラ別に生成した三次元映像片と、後述する結合工程において生成されるＡさんの三次元映像と、を示した図である。 More specifically, in the video piece generation process, the home server 1 on the B side uses the upper body video of Mr. A obtained from the actual video captured by the upper camera 2 and the actual video captured by the upper camera 2. The depth data is used to perform rendering processing. As a result, a 3D image piece viewed from the imaging direction of the upper camera 2, specifically, a 3D image piece of Mr. A's upper body shown in FIG. 18 is acquired. Similarly, in the video piece generation process, the home server 1 on the B side has a lower body video obtained from the real video captured by the lower camera 2 and an actual video captured by the lower camera 2. A rendering process using depth data is performed. As a result, a 3D image piece viewed from the imaging direction of the lower camera 2, specifically, a 3D image piece of Mr. A's lower body shown in FIG. 18 is acquired. FIG. 18 is a diagram illustrating a 3D image piece generated for each camera and a 3D image of Mr. A generated in a combining step described later.

また、変形例では、Ａさんの目を含む部分の映像を撮像するカメラ２（すなわち、上側のカメラ２）の設置高さとＡさんの目の高さとが異なっている。このために、変形例では、上述の映像片生成工程中、Ａさんの目を含む部分の三次元映像片（具体的には、上半身の三次元映像片）を生成する際に、前述の目線高さ合わせ用のプロセスを行うことになっている。つまり、変形例では、Ａさんの目の高さにある仮想的な視点から見たときの上半身の三次元映像片を取得するためのレンダリング処理を実行する。以下、図１９を参照しながら、変形例においてＡさんの三次元映像を取得する手順について説明する。図１９は、変形例においてＡさんの三次元映像を取得する手順を示した図である。 In the modification, the installation height of the camera 2 (that is, the upper camera 2) that captures the image of the portion including the eyes of Mr. A is different from the height of the eyes of Mr. A. For this reason, in the modified example, during the above-described video piece generation process, when generating the 3D video piece of the portion including Mr. A's eyes (specifically, the 3D video piece of the upper body), A process for leveling is to be performed. In other words, in the modified example, a rendering process for acquiring a 3D image piece of the upper body when viewed from a virtual viewpoint at the height of Mr. A's eyes is executed. Hereinafter, a procedure for acquiring the 3D video of Mr. A in the modification will be described with reference to FIG. FIG. 19 is a diagram showing a procedure for acquiring the 3D video of Mr. A in the modification.

変形例に係る映像表示フローにおいて、Ｂさん側のホームサーバ１は、Ａさんの三次元映像を生成するにあたり、先ず、映像片生成工程を行う（Ｓ０４１）。映像片生成工程において、Ｂさん側のホームサーバ１は、テクスチャマッピングによるレンダリング処理を実行することで、Ａさんの上半身及び下半身のそれぞれの三次元映像片を生成する（Ｓ０４２、Ｓ０４３）。具体的に説明すると、Ｂさん側のホームサーバ１は、映像片生成工程中、上半身の三次元映像片を生成する際に、上側のカメラ２から見た上半身の三次元映像片を生成する。その後、Ｂさん側のホームサーバ１は、上側のカメラ２の設置高さとＡさんの目の高さとの差を特定すると共に、Ａさんと上側のカメラ２との間の距離（奥行距離）を特定する。さらに、Ｂさん側のホームサーバ１は、これらの特定結果に基づき、その後に行う映像処理で用いる回転角度αを求める。そして、Ｂさん側のホームサーバ１は、前ステップで生成された上半身の三次元映像片に対して、回転角度αに相当する高さだけ視点を変位させる映像処理を施す。これにより、Ａさんの上半身の三次元映像片として、Ａさんの目の高さにある仮想的な視点から見たときの三次元映像片が取得されるようになる。すなわち、目線が正面を向いたＡさんの上半身の三次元映像片が取得される。 In the video display flow according to the modified example, when Mr. B's home server 1 generates the 3D video of Mr. A, first, a video piece generation process is performed (S041). In the video piece generation step, Mr. B's home server 1 generates a 3D video piece for each of the upper body and the lower body of Mr. A by executing a rendering process using texture mapping (S042, S043). Specifically, Mr. B's home server 1 generates a 3D image piece of the upper body viewed from the upper camera 2 when generating the 3D image piece of the upper body during the image piece generating process. After that, Mr. B's home server 1 specifies the difference between the installation height of the upper camera 2 and the eyes of Mr. A, and determines the distance (depth distance) between Mr. A and the upper camera 2. Identify. Furthermore, Mr. B's home server 1 obtains the rotation angle α used in the subsequent video processing based on these identification results. Then, Mr. B's home server 1 performs video processing for displacing the viewpoint by a height corresponding to the rotation angle α on the 3D video piece of the upper body generated in the previous step. As a result, a 3D image piece when viewed from a virtual viewpoint at the height of A's eyes is acquired as a 3D image piece of Mr. A's upper body. That is, a 3D image piece of the upper body of Mr. A with the line of sight facing the front is acquired.

また、Ｂさん側のホームサーバ１は、映像片生成工程中、下半身の三次元映像片を生成するにあたり、下側のカメラ２が撮像した実映像から得られる三次元映像片に対して映像回転処理を実行する。具体的に説明すると、下側のカメラ２は、基準面であるディスプレイ５の表示画面の法線方向とは異なる撮像方向からＡさんの下半身の映像を撮像する。そして、Ｂさん側のホームサーバ１は、下側のカメラ２が撮像した実映像（すなわち、上記の撮像方向にて撮像された映像）と、当該実映像についての深度データと、を用いたテクスチャマッピングを行い、Ａさんの下半身の三次元映像片を生成する。この段階で生成される三次元映像片は、下側のカメラ２の撮像方向から見たときの三次元映像片である。 In addition, the home server 1 on the B-side rotates the image with respect to the 3D image piece obtained from the actual image captured by the lower camera 2 when generating the 3D image piece of the lower body during the image piece generating process. Execute the process. More specifically, the lower camera 2 captures an image of the lower body of Mr. A from an imaging direction different from the normal direction of the display screen of the display 5 that is the reference plane. Then, Mr. B's home server 1 uses the texture using the actual video captured by the lower camera 2 (that is, the video captured in the above imaging direction) and the depth data of the actual video. Mapping is performed to generate a 3D image piece of Mr. A's lower body. The 3D image piece generated at this stage is a 3D image piece when viewed from the imaging direction of the lower camera 2.

一方、Ｂさん側のホームサーバ１は、下側のカメラ２の撮像方向から見たときの三次元映像片に対して映像回転処理を実行する。この映像回転処理は、下側のカメラ２の撮像方向から見たときの三次元映像片を、基準面であるディスプレイ５の表示画面の法線方向から仮想的に見た場合の三次元映像片へ変換させるための処理である。具体的には、上記の法線方向に対する下側のカメラ２の撮像方向の傾き度合いを角度（傾き角度）にて特定し、当該傾き角度だけ、三次元映像片を回転させる。これにより、Ａさんの下半身の三次元映像片として、基準面の法線方向から見たときの三次元映像片が取得されるようになる。なお、上記の映像回転処理は、公知の映像処理によって実現される。 On the other hand, Mr. B's home server 1 executes video rotation processing on a 3D video piece as viewed from the imaging direction of the lower camera 2. In this video rotation process, a three-dimensional image fragment when the three-dimensional image fragment when viewed from the imaging direction of the lower camera 2 is virtually viewed from the normal direction of the display screen of the display 5 serving as a reference plane. It is a process for making it convert into. Specifically, the inclination degree of the imaging direction of the lower camera 2 with respect to the normal direction is specified by an angle (inclination angle), and the 3D image piece is rotated by the inclination angle. As a result, the 3D image piece when viewed from the normal direction of the reference plane is acquired as the 3D image piece of the lower half of Mr. A. Note that the above video rotation processing is realized by known video processing.

上半身及び下半身の各々の三次元映像片を取得した後、Ｂさん側のホームサーバ１は、Ａさんの三次元映像を生成するために上記三次元映像片同士を結合する結合工程を行う（Ｓ０４４）。この結合工程では、上半身及び下半身の各々の三次元映像片を、当該各々の三次元映像片に含まれる共通の映像領域（具体的には、Ａさんの腹部の映像を示す領域）同士が重なり合うように結合する。なお、映像片の結合に際して、上半身の三次元映像片のうち、腹部より下の映像を切り捨て、下半身の三次元映像片のうち、腹部より上の映像を切り捨てる。 After acquiring the 3D image pieces of the upper and lower bodies, the Mr. B's home server 1 performs a combining step of combining the 3D image pieces to generate the 3D image of Mr. A (S044). ). In this joining step, the 3D image pieces of the upper body and the lower body are overlapped with a common image area (specifically, an area showing the image of Mr. A's abdomen) included in each 3D image piece. To join. When combining the video pieces, the upper part of the three-dimensional video piece is cut off the video below the abdomen, and the lower half of the three-dimensional video piece is cut off the video above the abdomen.

そして、結合工程が完了した時点でＡさんの三次元映像が完成する（Ｓ０４５）。かかる三次元映像は、図１８に示すようにＡさんを正面（換言すると、基準面の法線方向）から見たときの三次元映像となっている。その後、Ｂさん側のホームサーバ１（厳密には、合成映像表示部２２）は、上記の手順により得られたＡさんの三次元映像と、背景及び前景のそれぞれの三次元映像と、を合成し、その合成映像をディスプレイ５に表示させる。この際、Ａさんの三次元映像中、三次元映像片の結合部分付近の映像（具体的には、腹部付近）が違和感なく表示されることとなる。 When the joining process is completed, the 3D video of Mr. A is completed (S045). Such a 3D image is a 3D image when A is viewed from the front (in other words, the normal direction of the reference plane) as shown in FIG. Thereafter, Mr. B's home server 1 (strictly speaking, the synthesized video display unit 22) synthesizes the 3D video of Mr. A obtained by the above procedure and the 3D video of the background and the foreground. Then, the synthesized video is displayed on the display 5. At this time, in Mr. A's 3D image, an image in the vicinity of the joined portion of the 3D image pieces (specifically, near the abdomen) is displayed without a sense of incongruity.

分かり易く説明すると、上側のカメラ２が撮像した実映像及びその深度データをそのまま用いて取得した上半身の三次元映像片と、下側のカメラ２が撮像した実映像及びその深度データをそのまま用いて取得した下半身の三次元映像片と、を単に結合させたとする。この場合に得られるＡさんの三次元映像をディスプレイ５に表示させると、当該三次元映像中、三次元映像片同士を結合した部分付近が屈曲しているかのように見えてしまう（つまり、直立姿勢に対してやや前屈しているかのように見えてしまう）。これに対して、本変形例では、上半身の三次元映像片を生成する際に目線高さ合わせ用のプロセスを行っている。また、下半身の三次元映像片を生成する際には、深度データを基準面の法線方向から見た映像についてのデータに変換し、変換後の深度データに基づいて三次元映像片を生成する。これにより、三次元映像片同士を結合することで取得されるＡさんの三次元映像については、三次元映像片同士の結合部分付近が屈曲して見えるような違和感を抑制することが可能となる。 To explain in an easy-to-understand manner, a three-dimensional image piece of the upper body obtained by using the actual image and its depth data captured by the upper camera 2 as it is, and an actual image and its depth data captured by the lower camera 2 as they are. Assume that the acquired 3D image piece of the lower body is simply combined. When the 3D image of Mr. A obtained in this case is displayed on the display 5, it appears as if the vicinity of the portion where the 3D image pieces are joined is bent in the 3D image (that is, upright). It looks as if it is slightly bent forward with respect to the posture). On the other hand, in the present modification, a process for adjusting the eye height is performed when generating a 3D image piece of the upper body. In addition, when generating a 3D image piece of the lower body, the depth data is converted into data about an image viewed from the normal direction of the reference plane, and a 3D image piece is generated based on the converted depth data. . As a result, for Mr. A's 3D video acquired by joining the 3D video pieces, it is possible to suppress a sense of incongruity that the vicinity of the joined portion of the 3D video pieces appears to be bent. .

なお、本変形例では、複数のカメラ２（具体的には２台のカメラ２）が上下に並んで配置されていることとしたが、これに限定されるものではない。例えば、２台のカメラ２が左右に並んで配置されていてもよい。かかる場合にも上記と同様の手順にて、三次元映像片（具体的には、左半身及び右半身のそれぞれの三次元映像片）を生成し、三次元映像片同士を結合してＡさんの三次元映像を生成することになる。 In this modification, a plurality of cameras 2 (specifically, two cameras 2) are arranged side by side, but the present invention is not limited to this. For example, two cameras 2 may be arranged side by side. In such a case, a 3D image piece (specifically, a 3D image piece for each of the left and right bodies) is generated in the same procedure as described above, and the 3D image pieces are combined with each other. 3D images will be generated.

＜＜その他の実施形態＞＞
上記の実施形態では、本発明の映像表示システム及び映像表示方法について具体例を挙げて説明した。ただし、上記の実施形態は、本発明の理解を容易にするための一例に過ぎず、本発明を限定するものではない。すなわち、本発明は、その趣旨を逸脱することなく、変更、改良され得ると共に、本発明にはその等価物が含まれることは勿論である。 << Other Embodiments >>
In the above embodiment, the video display system and the video display method of the present invention have been described with specific examples. However, said embodiment is only an example for making an understanding of this invention easy, and does not limit this invention. That is, the present invention can be changed and improved without departing from the gist thereof, and the present invention includes its equivalents.

また、上記の実施形態では、本システムＳを通じて二人のユーザ（ＡさんとＢさん）が対面対話をするケースを例に挙げて説明したが、これに限定されるものではなく、同時に対面対話をすることが可能な人数については三人以上であってもよい。 In the above embodiment, the case where two users (Mr. A and Mr. B) have a face-to-face conversation through the system S has been described as an example. However, the present invention is not limited to this, and the face-to-face conversation is performed simultaneously. The number of people who can do this may be three or more.

また、上記の実施形態では、映像表示に係る一連の工程、厳密にはユーザ（例えばＡさん）及びその背景や前景の各々について三次元映像を生成して当該三次元映像同士を合成する工程が、第二のユーザ（例えばＢさん）側のホームサーバ１によって実施されることとした。ただし、これに限定されるものではなく、上記一連の工程が、ユーザ（Ａさん）側のホームサーバ１によって実施されてもよい。 In the above embodiment, a series of steps relating to video display, strictly speaking, a step of generating a 3D video for each of the user (for example, Mr. A) and the background and foreground and synthesizing the 3D video. The home server 1 on the side of the second user (for example, Mr. B) is supposed to be implemented. However, the present invention is not limited to this, and the series of steps described above may be performed by the home server 1 on the user (Mr. A) side.

また、上記の実施形態では、背景映像として、背景に相当する空間内にユーザが居ないときに撮像した当該空間の映像を用いることとした。ただし、これに限定されるものではなく、例えば、カメラ２がユーザとその背景を同時に撮像したときの映像、すなわち、実映像から人物映像及び背景映像をそれぞれ分離し、分離された背景映像を用いてもよい。かかる場合には、背景映像のうち、人物映像と重なっている部分の映像が欠落しているので、補完を行う必要がある。これに対して、ユーザが居ないときに撮像した背景映像を用いれば、上記のような映像の欠落がないため、映像補完を行う必要がない分、より容易に背景映像を取得することが可能となる。 In the above-described embodiment, as the background video, the video of the space captured when there is no user in the space corresponding to the background is used. However, the present invention is not limited to this. For example, when the camera 2 captures the user and the background at the same time, that is, the person video and the background video are separated from the actual video, and the separated background video is used. May be. In such a case, since a portion of the background video that overlaps the human video is missing, it is necessary to complement the background video. On the other hand, if the background video captured when there is no user is used, there is no omission of the video as described above, so it is not necessary to perform video complementation, so the background video can be acquired more easily. It becomes.

また、上記の実施形態では、第二のユーザの顔の移動を検知した場合に実行される遷移処理において、合成映像におけるユーザの三次元映像の表示位置、及び、背景の三次元映像において合成映像中に含まれる範囲（表示範囲）の双方をずらすこととした。ただし、これに限定されるものではなく、ユーザの三次元映像の表示位置及び背景の三次元映像の表示範囲のうちの一方のみをずらし、他方については固定する（ずらさない）こととしてもよい。 In the above-described embodiment, in the transition process executed when the movement of the second user's face is detected, the display position of the user's 3D video in the composite video and the composite video in the background 3D video It was decided to shift both the range (display range) included in the inside. However, the present invention is not limited to this, and only one of the display position of the user's 3D video and the display range of the background 3D video may be shifted, and the other may be fixed (not shifted).

また、上記の実施形態では、遷移処理において、前景の三次元映像の表示位置、Ａさんの三次元映像の表示位置、背景の三次元映像の表示範囲の順でずれ量が大きくなることとした。ただし、ずれ量の大小関係については、上記の大小関係と異なっていてもよい。すなわち、背景の三次元映像の表示範囲、Ａさんの三次元映像の表示位置、前景映像の表示位置の順で、ずれ量が大きくなってもよい。より具体的に説明すると、Ｂさん側のディスプレイ５に当初、図２０の（Ａ）に図示した合成映像が表示されているときに、Ｂさんの顔が横移動すると、第二の遷移処理が実行され、この結果、合成映像が図２０の（Ｂ）に図示した状態へ徐々に遷移するようになる。図２０は、第二の遷移処理に関する説明図であり、（Ａ）が第二の遷移処理前の合成映像を、（Ｂ）が第二の遷移処理後の合成映像を、それぞれ示している。 In the above embodiment, in the transition process, the shift amount increases in the order of the display position of the foreground 3D image, the display position of Mr. A's 3D image, and the display range of the background 3D image. . However, the magnitude relationship of the shift amounts may be different from the above magnitude relationship. That is, the shift amount may increase in the order of the display range of the background 3D video, the display position of Mr. A's 3D video, and the display position of the foreground video. More specifically, when Mr. B's face moves sideways while the composite video shown in FIG. 20A is initially displayed on the display 5 on the B side, the second transition process is performed. As a result, the synthesized video gradually transitions to the state shown in FIG. 20A and 20B are explanatory diagrams relating to the second transition process, in which FIG. 20A shows a composite video before the second transition process, and FIG. 20B shows a composite video after the second transition process.

ところで、先に説明した遷移処理（すなわち、図１１に図示した遷移処理）と、図２０に図示した第二の遷移処理と、では、ディスプレイ５を見ているＢさんの視線の向き、厳密には視線が向いている対象が異なっている。分かり易く説明すると、仮にＢさんがＡさんと実際に対面している場合、Ｂさんの視線がＡさんに向いた状態でＢさんの顔が横移動すると、Ｂさんに対してより遠くにあるものほど大きなずれ量だけ当初の位置からずれた位置に見えるようになる。このような見え方を再現するため、先に説明した遷移処理、すなわち、図１１に図示した遷移処理では、前景の三次元映像の表示位置、Ａさんの三次元映像の表示位置、背景の三次元映像の表示範囲の順でずれ量が大きくなっている。これに対して、Ｂさんの視線がＡさんの背景に向いた状態でＢさんの顔が横移動すると、Ｂさんに対してより近くにあるものほど大きくずれ量だけ当初の位置からずれた位置に見えるようになる。このような見え方を再現するため、第二の遷移処理では、背景の三次元映像の表示範囲、Ａさんの三次元映像の表示位置、前景映像の表示位置の順で、ずれ量が大きくなっている。 By the way, in the transition process described above (that is, the transition process illustrated in FIG. 11) and the second transition process illustrated in FIG. Is different in the subject whose line of sight is facing. To make it easier to understand, if Mr. B is actually facing Mr. A, if Mr. B's face moves sideways with Mr. B's line of sight facing Mr. A, he will be farther away from Mr. B. The larger the amount of shift, the more the position is shifted from the original position. In order to reproduce such an appearance, in the transition process described above, that is, in the transition process illustrated in FIG. 11, the display position of the foreground 3D image, the display position of Mr. A's 3D image, and the tertiary of the background The amount of deviation increases in the order of the display range of the original video. On the other hand, when Mr. B's face moves sideways with Mr. B's line of sight facing Mr. A's background, the position closer to Mr. B is more displaced from the original position by the amount of deviation. Become visible. In order to reproduce such an appearance, in the second transition process, the amount of deviation increases in the order of the background 3D video display range, Mr. A's 3D video display position, and the foreground video display position. ing.

なお、遷移処理の実行モードについては、背景の三次元映像の表示範囲のずれ量を最も大きくするモード（先に説明した遷移処理に相当）と、前景の三次元映像の表示位置のずれ量を最も大きくするモード（第二の遷移処理に相当）と、の間で切り替え自在としてもよい。かかる場合には、遷移処理が、そのときのＢさんの視線の向きに応じて適切に実行されるようになる。 Regarding the execution mode of the transition process, the mode that maximizes the amount of shift in the display range of the background 3D video (corresponding to the transition process described above) and the amount of shift in the display position of the foreground 3D video It may be possible to switch between the largest mode (corresponding to the second transition process). In such a case, the transition process is appropriately executed according to the direction of the line of sight of Mr. B at that time.

１ホームサーバ
２カメラ（撮像装置）
３マイク
４赤外線センサ
４ａ発光部
４ｂ受光部
５ディスプレイ
５ａタッチパネル
６スピーカ
１１データ送信部
１２データ受信部
１３背景映像記憶部
１４第１深度データ記憶部
１５実映像記憶部
１６人物映像抽出部
１７骨格モデル記憶部
１８第２深度データ記憶部
１９前景映像抽出部
２０高さ検知部
２１三次元映像生成部
２２合成映像表示部
２３判定部
２４顔移動検知部
１００通信ユニット
ＧＮ外部通信ネットワーク
Ｓ本システム 1 Home server 2 Camera (imaging device)
3 Microphone 4 Infrared sensor 4a Light emitting unit 4b Light receiving unit 5 Display 5a Touch panel 6 Speaker 11 Data transmitting unit 12 Data receiving unit 13 Background video storage unit 14 First depth data storage unit 15 Real video storage unit 16 Human video extraction unit 17 Skeletal model Storage unit 18 Second depth data storage unit 19 Foreground image extraction unit 20 Height detection unit 21 3D image generation unit 22 Composite image display unit 23 Determination unit 24 Face movement detection unit 100 Communication unit GN External communication network S This system

Claims

A video acquisition unit that acquires the video of the user captured by the imaging device;
A distance data acquisition unit that acquires distance data indicating a distance from the object in the video piece from the imaging device for each video piece when the video is divided into a predetermined number of video pieces;
A 3D video generation unit that generates a 3D video of the user by executing a rendering process using the video of the user and the distance data;
A height detection unit that detects the height of the eyes of the user,
When both the height at which the imaging device is installed and the height of the eyes detected by the height detection unit are different, the 3D image generation unit determines the difference between the two and the imaging device and the user. The rendering process for acquiring the 3D video of the user when viewed from a virtual viewpoint at the eye height detected by the height detection unit is executed based on the distance between A video display system characterized by that.

The video acquisition unit acquires the video of the user captured by the imaging device and the background video captured by the imaging device, respectively.
The distance data acquisition unit acquires the distance data for each of the user's video and the background video,
The 3D video generation unit generates the 3D video of the user by executing the rendering process using the video of the user and the distance data acquired for the video of the user. Generating the 3D video of the background by performing the rendering process using the distance data acquired for the video and the video of the background;
The composite video display unit comprising: a composite video display unit configured to combine the 3D video of the user with the 3D video of the background and display a composite video positioned by the user in front of the background. The video display system according to 1.

The video acquisition unit further acquires a foreground video captured by the imaging device,
The distance data acquisition unit further acquires the distance data for the foreground video,
The 3D image generation unit further generates the 3D image of the foreground by executing the rendering process using the distance data acquired for the foreground image and the foreground image;
The synthesized video display unit synthesizes the 3D video of the user, the 3D video of the background, and the 3D video of the foreground, the user is positioned in front of the background, and the user 3. The video display system according to claim 2, wherein the composite video in which the foreground is positioned in front of is displayed on the display.

A determination unit that determines whether a distance between the imaging device and the user has changed based on the distance data;
When the determination unit determines that the distance between the imaging device and the user has changed while the imaging device is capturing the video of the user, the composite video display unit 4. The video display system according to claim 2, wherein the display size of the video of the user is adjusted to be the display size before the distance between the imaging device and the user is changed. .

A face movement detection unit that detects that the face of the second user who sees the composite image displayed on the display has moved in the width direction of the display;
When the face movement detection unit detects the movement of the face, the composite image display unit displays the composite image displayed on the display, and the state before the face movement detection unit detects the movement of the face. In the transition process, the display position of the 3D video of the user in the composite video and the range included in the composite video in the 3D video of the background 5. The video according to claim 2, wherein the composite video is transitioned to a state in which one of them is shifted in the width direction by an amount larger than the other shift amount. 6. Display system.

The video acquisition unit acquires the video of the user captured by a plurality of the imaging devices that capture the video of the user in different imaging directions, for each imaging device,
The distance data acquisition unit acquires the distance data about the user's video for each imaging device,
The 3D video generation unit
A video piece generating step for generating a 3D video piece of the user for each imaging device based on the video of the user acquired for the imaging device and the distance data acquired for the imaging device;
In order to generate the 3D video of the user, a combining step of combining each of the 3D video pieces of the user for each imaging device so that common video areas included in the respective 3D video images overlap each other,
When generating the 3D image piece of the portion including the user's eyes in the image piece generating step, if the two are different, the difference between the two and the distance between the imaging device and the user are set. The video display system according to any one of claims 2 to 5, wherein the rendering processing for acquiring the 3D video piece when viewed from the virtual viewpoint is executed based on the rendering process. .

When the imaging direction is different from the normal direction of the reference plane, the 3D video generation unit generates the 3D video piece of the user generated based on the video captured in the imaging direction in the video piece generation step. The video display system according to claim 6, wherein the video is converted into the three-dimensional video piece when viewed virtually from the normal direction.

A computer acquiring an image of a user imaged by an imaging device;
Obtaining distance data indicating a distance from the imaging device to the object in the video piece for each video piece when the computer divides the video into a predetermined number of video pieces;
A computer generates a 3D video of the user by executing a rendering process using the video of the user and the distance data;
Detecting a height of the user's eyes,
When both the height at which the imaging device is installed and the detected eye height are different, the computer detects the detected based on the difference between the two and the distance between the imaging device and the user. An image display method comprising: executing the rendering process for acquiring the 3D image of the user when viewed from a virtual viewpoint at an eye level.