WO2016159166A1 - Image display system and image display method - Google Patents

Image display system and image display method Download PDF

Info

Publication number
WO2016159166A1
WO2016159166A1 PCT/JP2016/060533 JP2016060533W WO2016159166A1 WO 2016159166 A1 WO2016159166 A1 WO 2016159166A1 JP 2016060533 W JP2016060533 W JP 2016060533W WO 2016159166 A1 WO2016159166 A1 WO 2016159166A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
user
display
image
imaging device
Prior art date
Application number
PCT/JP2016/060533
Other languages
French (fr)
Japanese (ja)
Inventor
康夫 高橋
吏 中野
貴司 折目
雄一郎 竹内
暦本 純一
宮島 靖
Original Assignee
大和ハウス工業株式会社
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大和ハウス工業株式会社, ソニー株式会社 filed Critical 大和ハウス工業株式会社
Publication of WO2016159166A1 publication Critical patent/WO2016159166A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast

Definitions

  • the present invention relates to an image display system and an image display method, and more particularly, to an image display system and an image display method for displaying an image of a conversation partner in a remote place on a display of a conversation person.
  • a communication system (hereinafter referred to as a video display system) that enables users in remote spaces to interact while watching each other's video is already known.
  • video data of video is transmitted from one user side, and the video data is received and expanded on the other user side.
  • the video of one user comes to be displayed on the display of the other user.
  • users watching each other's images on the display feel as if they are facing each other.
  • the installation position of the camera is determined so that the positions of the eyes coincide with each other, so that the height of the eyes is limited. That is, since the installation position of the imaging camera is fixed, it becomes a system that is difficult for a person who has a line of sight at a height different from the installation position (specifically, the position of the line of sight does not match).
  • the position of the camera can be adjusted according to the height of the eyes of the person viewing the display. Since it is necessary to provide an adjustment mechanism, the system construction cost is expensive.
  • the movement of the person watching the display (especially the movement of the face) and the change in the position of the person shown on the display are followed.
  • it is necessary to switch the video on the display when the face of the person watching the display moves sideways, it reflects the appearance of the person moving the face sideways in a situation where the person is actually facing the conversation partner. It is desirable to change the image.
  • the farther the camera subject is from the camera the smaller the display size of the subject image on the display.
  • the figure (size) of the other person when the other person is slightly separated from one of the parties is It seems that there is almost no change in how people look (look).
  • the present invention has been made in view of the above problems, and the object of the present invention is to display the user on the display when the height of the user's eyes projected on the display is different from the installation height of the imaging device. It is an object to provide a video display system and a video display method capable of improving the realistic sensation of a dialogue performed while displaying the video. Another object of the present invention is to change the display image on the display to reflect the actual appearance when the face of the second user who is watching the user image displayed on the display moves sideways. It is. A third object of the present invention is to appropriately adjust the display size of the user's video displayed on the display when the distance between the user and the imaging device changes.
  • a video acquisition unit that acquires a video of a user captured by an imaging device, and (B) when the video is divided into a predetermined number of video pieces.
  • a distance data acquisition unit that acquires distance data indicating a distance between the imaging device and the object in the video piece, and (C) the user's video and the distance data are used.
  • a 3D image generation unit that generates a 3D image of the user by executing a rendering process; and (D) a height detection unit that detects the height of the user's eyes, and (E) When both the height at which the imaging device is installed and the height of the eyes detected by the height detection unit are different, the 3D image generation unit determines the difference between the two and the imaging device and the user. Based on the distance between, the height detector It is solved by executing the rendering process for acquiring the three-dimensional image of the user when viewed from a virtual viewpoint in the eye of the height and knowledge.
  • the user's 3D video is generated by executing the rendering process using the user's video captured by the imaging device and the distance data acquired for the user's video. If the height of the user's eyes is different from the height at which the imaging device is installed, the user's three-dimensional image when viewed from a virtual viewpoint at the same height as the user's eyes Execute the rendering process so that In this way, by rendering processing as 3DCG technology, a 3D image of a user viewed virtually from the same height as the user's eyes is obtained, so that even if the heights of both are different, the display can be viewed. It is possible to match the eyes of the person who is present and the eyes of the person shown on the display. As a result, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.
  • the video acquisition unit acquires the user's video captured by the imaging device and the background video captured by the imaging device
  • the distance data acquisition unit includes: The distance data is acquired for each of the user image and the background image
  • the 3D image generation unit uses the distance data acquired for the user image and the user image. Generating the 3D image of the user by executing a process, and executing the rendering process using the background data and the distance data acquired for the background image, Generating an original video, combining the 3D video of the user and the 3D video of the background, When having a combined image display unit for displaying a combined image in which the user is positioned in front of the background to the display, which is preferable.
  • the 3D video of the user and the 3D video of the background are synthesized, and the synthesized video in which the user is positioned in front of the background is displayed.
  • the composite video having such a feeling of depth the realism of the dialogue performed while displaying the user's video on the display is further improved.
  • the video acquisition unit further acquires a foreground video captured by the imaging device
  • the distance data acquisition unit further acquires the distance data regarding the foreground video.
  • the 3D image generation unit further generates the 3D image of the foreground by executing the rendering process using the distance data acquired for the foreground image and the foreground image, and the composition
  • the video display unit synthesizes the 3D video of the user, the 3D video of the background, and the 3D video of the foreground, the user is positioned in front of the background, and It is more preferable to display the composite image in which the foreground is located on the display.
  • the 3D video of the foreground is further synthesized, and the synthesized video with the foreground positioned in front of the user is displayed.
  • a composite image having a greater sense of depth is displayed.
  • the realism of the dialogue performed while displaying the user's video on the display is further improved.
  • the video display system may further include a determination unit that determines whether the distance between the imaging device and the user has changed based on the distance data, and the imaging device captures an image of the user.
  • the composite video display unit displays the display size of the user video in the composite video, It is more preferable to adjust the display size before the distance between the imaging device and the user is changed. According to the above configuration, even when the distance between the imaging device and the user, that is, the depth distance changes, the display displays the user's three-dimensional video with the display size before the change. .
  • the composite image after the change reflects how it looks when the user actually sees the user (ie, the size of the user recognized through his / her own vision).
  • the 3D image of the user is displayed at the display size.
  • the video display system may further include a face movement detection unit that detects that the face of a second user who views the composite video displayed on the display has moved in the width direction of the display.
  • the composite video display unit transitions the composite video displayed on the display from a state before the face movement detection unit detects the movement of the face.
  • a transition process is performed, and in the transition process, one of a display position of the 3D video of the user in the composite video and a range included in the composite video in the background 3D video It is even more preferable that the composite video is shifted to a state in which it is shifted along the width direction by any amount larger than the other shift amount.
  • the synthesized video obtained by synthesizing the user video and the background video it is possible to individually adjust the display position, the display size, and the like of each of the user video and the background video. .
  • one of the display position of the user's 3D video and the range included in the synthesized video in the background 3D video is changed to the other shift amount.
  • the synthesized video is transitioned to a state that is shifted in the horizontal direction by an amount larger than that.
  • an image reproducing the appearance when the user is actually seen from the position of the face after moving is displayed. Become so.
  • the realism of the dialogue performed while displaying the user's video on the display is further improved.
  • the video acquisition unit acquires the video of the user captured by the plurality of imaging devices that capture the video of the user in different imaging directions, for each imaging device
  • the distance data acquisition unit acquires the distance data regarding the video of the user for each imaging device
  • the 3D video generation unit acquires the video of the user acquired for the imaging device and the imaging device.
  • An image piece generating step for generating the user's 3D image piece for each image pickup device based on the distance data, and for generating the 3D image of the user for each image pickup device.
  • the three-dimensional image piece of the portion including the user's eyes is generated, if the two are different, the virtual image is based on the difference between the two and the distance between the imaging device and the user. It is more preferable that the rendering process for acquiring the 3D video piece when viewed from the viewpoint is executed. According to the above configuration, when a user's video is captured by a plurality of imaging devices having different imaging directions, a 3D video piece is generated for each imaging device, and finally the 3D video pieces are connected to each other. Get 3D video.
  • the 3D image pieces generated for each imaging device when generating a 3D image piece of a part including the user's eyes, it is as seen from a virtual viewpoint at the height of the user's eyes.
  • a rendering process for acquiring a 3D image is executed. As a result, if a 3D video of a user formed by combining 3D video pieces is displayed on the display, the user's line of sight and the line of sight of the person watching the display can be matched.
  • the 3D video generation unit when the imaging direction is different from the normal direction of the reference plane, the 3D video generation unit generates the video piece based on the video captured in the imaging direction in the video piece generation step. It is more preferable to convert the 3D image piece of the user into the 3D image piece when virtually viewed from the normal direction.
  • a user's video when a user's video is captured in an imaging direction different from the normal direction of the reference plane, and a 3D video piece is generated from the video, it is generated based on the video captured in the imaging direction.
  • the user's 3D video piece is converted into a 3D video piece viewed virtually from the normal direction.
  • video is acquired using the converted 3D image piece.
  • the three-dimensional image thus obtained is an image when viewed from the normal direction, and is appropriately displayed when displayed on the display. More specifically, it becomes possible to suppress the vicinity of the portion where the 3D video pieces are joined in the 3D video of the user from appearing to be bent.
  • the above-described problem is that (A) the computer acquires the video of the user captured by the imaging device, and (B) the computer displays the video for a predetermined number of times. Obtaining distance data indicating a distance from the object in the video piece from the imaging device for each video piece when divided into video pieces; And generating a 3D video of the user by executing a rendering process using the distance data, and (D) a computer detecting the eye height of the user, E) When both the height at which the imaging device is installed and the detected eye height are different, the computer detects based on the difference between the two and the distance between the imaging device and the user.
  • the video display system and the video display method of the present invention even if the eye height of the user is different from the installation height of the imaging device, the eye and the display of the person viewing the display (that is, the second user) It is possible to match the line of sight of the person (that is, the user) projected on the screen.
  • the face of the second user moved sideways, the image displayed on the display was reproduced from the position of the face after moving, actually facing the user and looking at the user Transition to video is possible.
  • the distance (depth distance) between the imaging device and the user changes, the display size of the user's 3D image in the composite image displayed on the display is changed to the display size before the change of the depth distance.
  • 3A and 3B are diagrams showing an example of the display of the present invention. It is explanatory drawing about the procedure of an image composition. It is explanatory drawing about the procedure which extracts a person image
  • system S The video display system according to the present embodiment (hereinafter, system S) is used for users who are in rooms separated from each other to interact with each other while watching each other's appearance (video). More specifically, a display as a video display is installed in a room where each user is present, and the other party's video is displayed (displayed) on this display. As a result, each user feels that the display is viewed as glass (for example, window glass or door glass) and interacts with the other party through the glass.
  • glass for example, window glass or door glass
  • the system S is to be used when each user is at his / her home. That is, this system S is used for each user to have a conversation with a conversation partner (a pseudo face-to-face conversation, hereinafter simply referred to as “face-to-face conversation”) while at home.
  • face-to-face conversation a conversation with a conversation partner
  • the present system S is not limited to this, and the system S is a place where the user is not at home, such as a meeting place, a commercial facility, a school classroom, a school, a public facility such as a hospital, a company, an office, etc. May be used when in Moreover, you may use this system S in order for the person who is in the room apart from each other in the same building to have a face-to-face conversation.
  • Mr. A corresponds to a “user”
  • Mr. B corresponds to a “second user”.
  • “user” and “second user” are relative concepts that are switched according to the relationship between the person who sees the image and the person who sees the image. Therefore, when the viewpoint of Mr. A is used as a reference, Mr. B corresponds to “user” and Mr. A corresponds to “second user”.
  • This system S is used for two users (namely, Mr. A and Mr. B) to conduct a face-to-face conversation while watching each other's images, and more specifically, the life-size of the conversation partner for each user. Is displayed, and the other party's voice is played back.
  • each user has a communication unit 100. That is, this system S is comprised by the communication unit 100 which each user possesses.
  • FIG. 1 is a diagram showing the configuration of the system S, more specifically, the configuration of each communication unit 100.
  • Each communication unit 100 includes a home server 1, a camera 2 as an imaging device, a microphone 3 as a sound collection device, an infrared sensor 4, a display 5 as a video display, and a speaker 6 as main components.
  • the camera 2, the microphone 3, the infrared sensor 4, the display 5, and the speaker 6 are arranged in a predetermined room at the home of each user (for example, a room used when performing a face-to-face conversation).
  • the home server 1 is a central device of the system S, and includes a computer, specifically, a server computer constituting a home gateway.
  • the configuration of the home server 1 is publicly known and includes a CPU, a memory such as a ROM and a RAM, a communication interface, a hard disk drive, and the like.
  • the home server 1 is installed with a program for executing data processing necessary for realizing the face-to-face conversation (hereinafter referred to as a dialogue program).
  • This interactive program incorporates a program for displaying 3D images.
  • This program is a program for constructing and displaying a 3D image by 3D computer graphics (hereinafter, 3DCG), and is a so-called renderer.
  • the 3DCG renderer has a function of synthesizing a plurality of 3D images. Then, when an image formed by combining a plurality of 3D images, that is, a combined image is displayed on the display 5, the combined 3D images are arranged at different positions in the depth direction of the display 5. It looks like this.
  • the home server 1 is connected in a communicable state with a communication device via an external communication network GN such as the Internet. That is, the home server 1 belonging to the communication unit 100 owned by Mr. A communicates with the home server 1 belonging to the communication unit 100 owned by Mr. B via the external communication network GN, and transmits and receives various data between the two servers. I do.
  • the data transmitted and received by the home server 1 is data necessary for a face-to-face conversation, for example, video data indicating video of each user and audio data indicating audio.
  • the camera 2 is a known network camera and captures an image of a subject within an imaging range (angle of view).
  • the “video” is constituted by an aggregate of a plurality of continuous frame images (RGB images), but in the following description, in addition to including an aggregate of frame images, individual frames Including images.
  • the imaging range of the camera 2 is fixed. For this reason, the camera 2 always captures an image of a predetermined area of the space in which the camera 2 is installed during its activation.
  • the camera 2 outputs a signal indicating the captured video (video signal) to the home server 1 belonging to the same unit as the communication unit 100 to which the camera 2 belongs.
  • the number of cameras 2 installed is not particularly limited, in the present embodiment, only one camera 2 is provided in each communication unit 100 in consideration of cost.
  • FIG. 2 is a diagram showing arrangement positions of various devices arranged in the rooms of Mr. A and Mr. B as the components of the system S. Note that the position of the camera 2 may be a position away from the display 5.
  • the camera 2 can capture the whole body image from the person's face to the foot.
  • the “whole body image” may be a whole body image in a standing posture or a whole body image in a sitting posture.
  • the “whole body image” includes an image in which a part of the body is hidden by an object placed in front.
  • the camera 2 is installed at a height of about 1 m above the floor. For this reason, when the height (strictly speaking, the height of the eyes) of the person standing in front of the display 5 is higher than the installation position of the camera 2, the camera 2 moves the face of the person who is the subject from below. The image will be taken.
  • the height at which the camera 2 is installed (in other words, the position of the camera 2 in the vertical direction) is not particularly limited, and can be set to an arbitrary height.
  • the microphone 3 collects sound in the room in which the microphone 3 is installed, and the sound signal is sent to the home server 1 (strictly, the home server 1 belonging to the same unit as the communication unit 100 to which the microphone 3 belongs). Output.
  • the microphone 3 is installed at a position directly above the display 5 as shown in FIG.
  • the infrared sensor 4 is a so-called depth sensor, and is a sensor for measuring the depth of a measurement object (corresponding to the object) by an infrared method. Specifically, the infrared sensor 4 irradiates infrared rays from the light emitting unit 4a toward the measurement object, and measures the depth by receiving the reflected light at the light receiving unit 4b. More specifically, the light emitting unit 4 a and the light receiving unit 4 b of the infrared sensor 4 face the display screen forming surface of the display 5. On the other hand, a film capable of transmitting infrared light is attached to a portion of the touch panel 5a of the display 5 constituting the forming surface at a position immediately before the infrared sensor 4. The infrared light reflected from the measurement object after being irradiated from the light emitting unit 4a passes through the film and is received by the light receiving unit 4b.
  • the light receiving position by the light receiving unit 4b of the infrared sensor 4 is the same position as the surface position of the lens of the camera 2 in the depth direction of the display 5 (strictly, the normal direction of the display screen). It is set to be.
  • the depth measurement result is obtained for each pixel when the image captured by the camera 2 is divided into a predetermined number of image pieces (pixels).
  • depth data corresponding to distance data
  • This depth data defines the measurement result of the infrared sensor 4, that is, the depth, for each pixel of the captured image (strictly, each frame image) of the camera 2.
  • the depth data for the video is a depth map of the video, and the depth of the target is included in the pixel group corresponding to the video of the target in the video captured by the camera 2 in the depth data.
  • Distance (depth value) is specified. Specifically, as shown in FIG.
  • the corresponding pixels are clearly different as shown in FIG. .
  • the black pixels correspond to the background image
  • the hatched pixels correspond to the image of the object in front of the background
  • the white pixels are those of the person in front. Corresponds to video.
  • the position of a person can be specified from the depth data.
  • a position detection sensor may be installed separately from the infrared sensor 4, and the position of the person may be specified from the detection result of the position detection sensor.
  • the speaker 6 emits sound (reproduced sound) that is reproduced when the home server 1 develops the sound data, and is configured by a known speaker.
  • a plurality of speakers 4 are installed at positions sandwiching the display 5 in the horizontal width direction of the display 5.
  • the display 5 forms a video display screen. More specifically, the display 5 has a panel made of transparent glass, and forms a display screen on the front surface of the panel.
  • the above-described panel is the touch panel 5a and receives an operation (touch operation) performed by the user.
  • the above panel has a size sufficient to display a whole body image of a person.
  • the whole body image of the conversation partner is displayed in a life-size size on the display screen formed on the front surface of the panel. That is, it is possible to display Mr. A's whole body image in a life-size size on the display 5 on the Mr. B side.
  • Mr. B who is looking at the display screen feels as if he is meeting Mr. A, in particular, the feeling of facing through the glass.
  • FIGS. 3A and 3B are diagrams showing a configuration example of the display 5 used in the present system S.
  • FIG. 3A shows a non-interactive state
  • FIG. 3B shows a face-to-face conversation. Each state is shown.
  • the touch panel 5a included in the display 5 constitutes a part of the appearance arranged in the room where the face-to-face conversation is performed, specifically, a specular part. Then, as shown in FIG. 3A, the touch panel 5a does not form a display screen during a non-dialogue when no dialogue is performed, that is, while no video is displayed. In other words, the display 5 of the present system S shows an appearance as a figure at the time of non-dialogue. On the other hand, when the face-to-face conversation is started, the touch panel 5a forms a display screen on the front surface. Thereby, as shown in FIG. 3B, the display 5 displays the conversation partner and the video of the background on the front surface of the touch panel 5a.
  • the home server 1 is to switch the display screen on and off according to the measurement result of the infrared sensor 4. More specifically, when the user stands at the front position of the display 5 in starting the face-to-face conversation, the camera 2 captures an image including the user (hereinafter referred to as an actual image) and the infrared sensor 4 measures the depth. . Thereby, the depth data about the actual video is acquired, and the home server 1 specifies the distance between the user and the camera 2, that is, the depth distance based on the depth data. When the depth distance is equal to or less than the predetermined distance, the home server 1 controls the display 5 to form a display screen on the front surface of the touch panel 5a.
  • the touch panel 5a of the display 5 that has been functioning as a figure until then functions as a screen for displaying images.
  • the home server 1 controls the display 5 and turns off the display screen that has been formed so far. As a result, the display 5 functions as a figure again.
  • the display 5 is used as an appearance when non-interactive. This makes it difficult to notice the presence of the display screen during non-interaction.
  • a display screen is formed and the image of the conversation partner is displayed, so that the user can obtain a visual effect as if he is interacting with the conversation partner through the glass.
  • a well-known structure can be utilized like the structure described in the international publication 2009/122716, for example.
  • the display 5 is not limited to a configuration that is also used as a figure. The device used as the display 5 only needs to have a size sufficient to display the whole body image of the conversation partner.
  • a door glass door
  • a window glass window
  • the display 5 it is not limited to what is used also as furniture, such as building materials, such as a door and a window, or a figure, etc., The normal display which forms a display screen constantly during starting may be used.
  • Mr. A's video and its background video are displayed on the Mr. B's display 5, and Mr. B's video and his background video are displayed on the Mr. A's display 5.
  • the person image and the background image displayed on each display 5 are not captured simultaneously by the camera 2 but are captured at different timings. That is, each display 5 displays a composite video obtained by synthesizing a human video and a background video captured at different timings. Further, in the present system S, in addition to the person video and the background video, a synthesized video obtained by further synthesizing the foreground video is displayed.
  • FIG. 4 is an explanatory diagram of the video composition procedure.
  • Mr. A's video, background video, and foreground video are combined will be described as a specific example.
  • the background video (indicated by the symbol Pb in FIG. 4) is a video of an area within the imaging range of the camera 2 in the room used when Mr. A performs a face-to-face conversation. .
  • the camera 2 when Mr. A is not in the said room, the camera 2 is supposed to image a background image
  • video it is possible to set arbitrarily if it is in the period when Mr. A is not in the said room.
  • a person image (specifically, Mr. A's image, represented by the symbol Pu in FIG. 4) is captured when Mr. A is in the room, strictly speaking, within the imaging range of the camera 2.
  • the video that is, the real video
  • the real video includes a background video and a foreground video in addition to a human video.
  • a person video is extracted from the actual video and used.
  • a method for extracting a person image from a real image is not particularly limited, and an example is a method for extracting a person image using the above-described depth data.
  • FIG. 5 is an explanatory diagram of a procedure for extracting a person video from a real video.
  • the pixels constituting the depth data are coarser than the actual pixels.
  • the infrared sensor 4 measures the depth of the measurement object within the angle of view of the camera 2.
  • depth data about the actual video is obtained.
  • the depth data for the real video is obtained by defining the measurement result of the infrared sensor 4, that is, the depth for each pixel when the frame image constituting the real video is divided into a predetermined number of pixels.
  • the depth data for the actual video as shown in FIG. 5, the pixels belonging to the human video (the white pixels in the figure) and the pixels belonging to the other video (the black pixels and the hatched hatching in the figure). The depth is clearly different from that of (pixel).
  • Mr. A's skeleton model is identified based on the depth data and the captured image of the camera 2 (strictly, information for identifying the position of the image of Mr. A's face in the captured image).
  • the skeletal model is a simple model of position information about Mr. A's skeleton (specifically, in the body, head, shoulders, elbows, wrists, upper body center, waist, knees, ankles) as shown in FIG. It is what.
  • a known method can be used, for example, the same as the method employed in the invention described in Japanese Patent Application Laid-Open No. 2014-155893 and Japanese Patent Application Laid-Open No. 2013-116311. The method may be used.
  • the person image is extracted from the actual image based on the skeleton model.
  • a description of a technique for extracting a person image from a real image based on a skeleton model will be omitted.
  • a rough procedure will be described.
  • A Identify the pixel group that belongs to the person's video.
  • an area corresponding to the specified pixel group is extracted from the actual video.
  • the image extracted by such a procedure corresponds to the person image of Mr. A in the actual image.
  • the foreground video (indicated by the symbol Pf in FIG. 4) is extracted from the actual video and used in the same manner as the person video.
  • the method for extracting the foreground video from the actual video is not particularly limited. However, for example, a method for extracting the foreground video using the depth data as in the case of the person video can be considered. More specifically, a pixel group having a depth depth smaller than the pixels belonging to the person video is specified in the depth data of the real video. Then, in the actual video, the video corresponding to the specified pixel group is extracted as the foreground video.
  • the background image, the person image and the foreground image are synthesized. More specifically, in the background image captured by the camera 2, an image of a portion actually displayed on the display 5 (a range surrounded by a broken line in FIG. 4, hereinafter referred to as a display range) is set. .
  • the display range corresponds to a portion included in the synthesized video in the background video captured by the camera 2.
  • the size of the display range is determined according to the size of the display 5.
  • the initial (default) display range is set at the center of the background video.
  • the initial display range is not particularly limited, and may be a portion other than the central portion of the background video.
  • the composite video is displayed as the display video on the display 5.
  • the display position, the display size, and the like can be individually adjusted for each of the human video, the background video, and the foreground video.
  • the display size of the video of Mr. A which is a person video, can be adjusted without changing the display size of the background video and the foreground video.
  • the display size of Mr. A's video is adjusted to coincide with Mr. A's actual size (life size).
  • Mr. A's video is displayed in a life-size size on the display 5 on the Mr. B side, and the realism of the face-to-face conversation using the system S is further improved.
  • the display size of the person video is not limited to the life size.
  • the life-size size means that the camera 2 is located at a position away from the camera 2 by a predetermined distance (specifically, a distance d1 in FIG. 10B to be described later, hereinafter referred to as a reference distance). It means the size when the captured human image is displayed as it is.
  • the reference distance d1 is set in advance and stored in the memory of the home server 1.
  • a 3D image is to be displayed on the display 5. More specifically, as described in the previous section, the display 5 is configured to display a composite video obtained by combining the background video, the human video, and the foreground video. 3D video (3D video).
  • This 3D image is obtained by executing a rendering process by 3DCG using a 2D image (specifically, an image composed of frame images in RGB format) captured by the camera 2 and depth data about the image. It is obtained by.
  • the rendering process is a surface rendering video display process, which is a process for generating a 3D video when viewed from a virtually set viewpoint.
  • FIG. 6 is an explanatory diagram of a procedure for generating a 3D video. Note that the mesh model in the figure is coarser than the actual mesh size for convenience of illustration. In the following, a case where a 3D video of Mr. A is generated will be described as an example.
  • the video of Mr. A taken by the camera 2 (strictly speaking, the video of Mr. A extracted from the actual video) is a two-dimensional video and is used as a texture in texture mapping.
  • the depth data (that is, the depth map) acquired for the actual video including the video of Mr. A is used to construct a mesh model that forms the skeleton of the 3D video.
  • the mesh model represents a person (Mr. A) with a polygon mesh.
  • a method of constructing a mesh model from depth data (depth map) a known method can be used.
  • a stereoscopic image of Mr. A is obtained by pasting a two-dimensional image (specifically, an image of Mr. A) as a texture to the mesh model. That is, it becomes possible to generate a 3D image having a sense of depth.
  • a 3D image is generated by such texture mapping, and further, by performing processing such as movement and rotation, it is possible to acquire a 3D image when the viewpoint is changed. As a result, it is also possible to acquire a 3D image when the face of Mr. A is viewed from below and a 3D image when the face of Mr. A is viewed from the side.
  • a background three-dimensional image is acquired by executing a rendering process by texture mapping using the background image captured by the camera 2 and the depth data acquired for the background image.
  • the foreground video captured by the camera 2 (strictly, the foreground video extracted from the real video)
  • the depth data acquired for the foreground video (strictly, the depth data for the real video including the foreground video)
  • a foreground 3D image is acquired by executing a rendering process by texture mapping using.
  • texture mapping is used.
  • the rendering process for acquiring a 3D image is not limited to the one using texture mapping.
  • the rendering process uses bump mapping. Also good.
  • a defective portion that is, a pixel for which a depth measurement result cannot be obtained for some reason may occur.
  • a missing portion is likely to occur near the boundary between the person image and the background image (near the edge).
  • a two-dimensional image that is a texture may be pasted as it is on the missing part in texture mapping.
  • the surrounding video may be pasted.
  • a pixel group that is slightly larger than the above pixel group in texture mapping is selected. What is necessary is just to extract and paste the two-dimensional image
  • the home server 1 functions as the home server 1 when the CPU of the apparatus executes a dialogue program, and specifically executes a series of data processing related to a face-to-face dialogue.
  • the configuration of the home server 1 will be described from the viewpoint of its function, particularly the video display function.
  • FIG. 7 is a diagram showing the configuration of the home server 1 in terms of functions.
  • the home server 1 includes a data transmission unit 11, a data reception unit 12, a background video storage unit 13, a first depth data storage unit 14, a real video storage unit 15, a human video extraction unit 16, a skeleton model.
  • a storage unit 17, a second depth data storage unit 18, a foreground video extraction unit 19, a height detection unit 20, a 3D video generation unit 21, a composite video display unit 22, a determination unit 23, and a face movement detection unit 24 are provided.
  • Each of these data processing units is realized by a hardware device (specifically, CPU, memory, communication interface, hard disk drive, etc.) of the home server 1 cooperating with a dialogue program as software. .
  • a hardware device specifically, CPU, memory, communication interface, hard disk drive, etc.
  • the data transmission unit 11 digitizes the video signal captured by the B-side camera 2 and transmits it to the A-side home server 1 as video data.
  • the types of video data transmitted by the data transmission unit 11 are classified into two types.
  • One is the video data of the background video, specifically, the video of the same room captured when Mr. B is not in the room corresponding to the background (strictly, it is within the imaging range of the camera 2).
  • This is data indicating the image of the area.
  • the other is actual video data, which is an image captured while Mr. B is in the room, more specifically, data indicating Mr. B and its background and foreground images. .
  • the data transmission unit 11 when transmitting the video data of the background video, the data transmission unit 11 generates depth data about the background video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the background video. .
  • This depth data is used when executing a rendering process for acquiring a 3D image of the background, and also when specifying a distance (depth distance) between the background and the camera 2.
  • the data transmission unit 11 when transmitting the video data of the real video, the data transmission unit 11 generates depth data for the real video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the real video. To do. This depth data is used when extracting a person image (specifically, an image of Mr.
  • the depth data is used at the time of each of the rendering process for acquiring Mr. B's 3D video and the rendering process for acquiring the 3D video of the foreground. Furthermore, the depth data is also used when specifying the distance (depth distance) between Mr. B and the camera 2.
  • the data receiving unit 12 receives various data transmitted from the home server 1 on the A side.
  • the data received by the data receiving unit 12 includes video data of the background video, depth data about the background video, and video data of the real video and depth data about the real video.
  • the video data of the background video received by the data receiving unit 12 is data indicating the video of the same room captured when Mr. A is not in the room corresponding to the background.
  • the data receiving unit 12 receives the background video data, and thereby acquires the background video captured by the camera A's camera 2. In this sense, it can be said that the data receiving unit 12 corresponds to a video acquisition unit.
  • the depth data regarding the background video received by the data receiving unit 12 is used when performing rendering processing for acquiring the background three-dimensional video, and the distance (depth distance) between the background and the camera 2 is used. ) Is also used to specify.
  • first depth data the depth data regarding the background video received by the data receiving unit 12 is referred to as “first depth data”.
  • the video data of the actual video received by the data receiving unit 12 is data indicating the video of Mr. A, the background, and the foreground captured while Mr. A is in the room.
  • the depth data about the actual video received by the data receiving unit 12 is used when extracting the video of A and the foreground video from the actual video.
  • the depth data is used at the time of each of the rendering process for acquiring Mr. A's 3D video and the rendering process for acquiring the 3D video of the foreground. Further, the depth data is also used when specifying the distance between A and the camera 2 (depth distance) and the distance between the foreground and the camera 2 (depth distance).
  • second depth data the depth data regarding the actual video received by the data receiving unit 12 is referred to as “second depth data”.
  • the data receiving unit 12 receives the first depth data and the second depth data from the home server 1 on the A side, so that the depth data for the background video, the depth data for the person video, and the foreground Obtain depth data for each video.
  • the data receiving unit 12 corresponds to a distance data acquiring unit that acquires depth data that is distance data.
  • the background video storage unit 13 stores the video data of the background video received by the data receiving unit 12.
  • the first depth data storage unit 14 stores depth data regarding the background video received by the data receiving unit 12, that is, first depth data.
  • the real video storage unit 15 stores the video data of the real video received by the data receiving unit 12.
  • the person video extracting unit 16 expands the video data of the real video received by the data receiving unit 12, and extracts the human video (that is, the video of Mr. A) from the real video.
  • the skeleton model storage unit 17 stores a skeleton model (specifically, Mr. A's skeleton model) used when the person video extraction unit 16 extracts a person video.
  • the second depth data storage unit 18 stores depth data on the actual video received by the data receiving unit 12, that is, second depth data.
  • the person video extraction unit 16 reads the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18 in extracting the video of Mr. A from the real video. Then, the person video extraction unit 16 specifies Mr. A's skeleton model from the read second depth data and the captured video of the camera 2. The identified skeleton model of Mr. A is stored in the skeleton model storage unit 17. Thereafter, the person video extraction unit 16 reads out Mr. A's skeleton model from the skeleton model storage unit 17 and extracts a person video, that is, Mr. A's video from the actual video based on the skeleton model.
  • the person video extraction unit 16 extracts the person video from the actual video, thereby acquiring the video of Mr. A captured by the camera 2 on the A side. In this sense, it can be said that the person video extraction unit 16 corresponds to a video acquisition unit.
  • the foreground video extracting unit 19 develops the video data of the real video received by the data receiving unit 12 and extracts the foreground video from the real video. More specifically, the foreground video extraction unit 19 extracts the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18 when extracting the foreground video from the real video. Are read out respectively. Then, the foreground video extraction unit 19 extracts a pixel group corresponding to the foreground video from the read second depth data.
  • the pixel group corresponding to the foreground image is a pixel group having a depth distance smaller than the pixel group extracted from the second depth data by the person image extraction unit 16 (that is, the pixel group corresponding to the person image). It is.
  • the foreground video extraction unit 19 extracts a video of a portion corresponding to the pixel group from the real video read from the real video storage unit 15 as a foreground video. In this way, the foreground video extraction unit 19 extracts the foreground video from the actual video, thereby acquiring the foreground video captured by the camera A's camera 2. In this sense, it can be said that the foreground video extraction unit 19 corresponds to a video acquisition unit.
  • the height detection unit 20 detects the eye height of Mr. A based on the data received from the home server 1 on the Mr. A side. More specifically, the height detection unit 20 reads the second depth data from the second depth data storage unit 18, and extracts a pixel group corresponding to the person video from the read second depth data. Thereafter, the height detection unit 20 specifies a pixel corresponding to the eye from the extracted pixel group, and calculates the eye height from the position of the specified pixel. Then, the detection result regarding the eye height is transferred to the 3D video generation unit 21. The 3D video generation unit 21 generates a 3D video (particularly, a 3D video of a person) according to the detection result. Will come to do. This will be explained in detail in the next section.
  • the method for specifying the eye height is not particularly limited, and a known method can be used.
  • the eye height is detected based on the second depth data.
  • the present invention is not limited to this.
  • the system S analyzes the actual video including the video of Mr. May be detected.
  • the 3D video generation unit 21 executes 3DCG rendering processing to acquire a 3D video. Specifically, the 3D video generation unit 21 uses the background video stored in the background video storage unit 13 and the first depth data regarding the background video stored in the first depth data storage unit 14. The 3D image of the background is generated by executing the rendering process. Note that the 3D video generation unit 21 uses the most recently acquired background video among the background videos stored in the background video storage unit 13 when generating the background 3D video. Similarly, the first depth data acquired most recently is used for the first depth data stored in the first depth data storage unit 14.
  • the 3D video generation unit 21 extracts the person video extracted by the person video extraction unit 16 (specifically, the video of Mr. A) and the second depth data (strictly, stored in the second depth data storage unit 18). Performs the rendering process using the pixel data corresponding to the person image in the second depth data to generate a 3D image of the person (Mr. A).
  • the 3D video generation unit 21 uses the foreground video extracted by the foreground video extraction unit 19 and the second depth data stored in the second depth data storage unit 18 (strictly speaking, the foreground in the second depth data).
  • a foreground 3D image is generated by executing a rendering process using pixel group data corresponding to the image). Note that, in the present system S, as described above, processing using texture mapping is executed as rendering processing.
  • the synthesized video display unit 22 synthesizes the 3D videos of the background, the person, and the foreground generated by the 3D video generation unit 21 and displays the synthesized video on the display 5 on the B side.
  • the composite video display unit 22 selects a video to be included in the composite video, that is, a display range, from the background 3D video generated by the 3D video generation unit 21. Then, the composite video display unit 22 displays the composite video in which Mr. A is positioned in front of the selected display range and the foreground is positioned in front of Mr. A on the B-side display 5.
  • the composite video display unit 22 displays the composite video on the display 5 (in other words, during the period when the camera 2 on the A side captures the video of Mr. A). It is determined whether the distance between the camera 2 and Mr. A (that is, the depth distance of Mr. A) has changed. Such a determination is made based on the second depth data stored in the second depth data storage unit 18. When the determination unit 23 determines that the depth distance has changed, the determination result is transferred to the composite video display unit 22, and the composite video display unit 22 displays a composite video corresponding to the determination result on the display 5. This will be explained in detail in the next section.
  • the face movement detection unit 24 Based on the measurement result of the infrared sensor 4, the face movement detection unit 24 generates depth data about the actual image captured by the Mr. B-side camera 2, and from this depth data, the face movement of Mr. B's face is laterally moved. Detect the presence or absence. More specifically, during the period in which the composite video is displayed on the display 5 by the composite video display unit 22, the face movement detection unit 24 specifies a pixel group corresponding to the video of Mr. B from the depth data, A change in the position of the pixel group is monitored. The face movement detection unit 24 detects that the face of Mr. B has moved sideways when recognizing the change in the position of the pixel group. The lateral movement means that the face of Mr. B moves in the left-right direction (the width direction of the display 5) with respect to the display 5 on the Mr. B side.
  • the camera 2 is installed at a height of about 1 m from the floor. Therefore, depending on the height of Mr. A, the height of Mr. A's eyes differs from the height at which the camera 2 is installed. In such a case, the image of Mr. A displayed on the display 5 on the Mr. B side is different from the appearance (image) of Mr. A seen when actually facing Mr. A.
  • Mr. A when the eye height of Mr. A is higher than the installation height of the camera 2, the camera 2 takes an image of the face of Mr. A from below. During this time, since Mr. A is looking at the display 5 on the side of Mr. A in front, Mr. A's eyes are facing the front. Under the above circumstances, as shown in FIG. 8A, Mr. A's video displayed on the Mr. B side display 5 (strictly, it is a three-dimensional video, but is simplified in FIG. 8A). Becomes an image as if looking up at Mr. A's face.
  • FIG. 8 is an explanatory diagram of the process for adjusting the eye height, and (A) in the figure shows an image of Mr. A taken from the actual camera position.
  • the home server 1 on the B's side (strictly speaking, the 3D video generation unit 21 described above) Then, a rendering process for acquiring a 3D video of Mr. A when viewed from a virtual viewpoint at the detected eye level is executed.
  • FIG. 8B is a diagram showing the positional relationship between the camera 2 and the eyes of Mr. A.
  • the home server 1 on the side of Mr. B has a difference between the two (the symbol H in FIG. 8B). Specified). Further, the home server 1 on the B side is based on the stored second depth data, and the distance between the A and the A camera 2 (that is, the depth distance of the A, FIG. (B) is indicated by the symbol L). Then, Mr. B's home server 1 determines the imaging direction of a virtual camera (shown by a broken line in FIG.
  • Mr. B's home server 1 executes a rendering process for acquiring Mr. A's video (three-dimensional video) captured from the virtual camera using the calculation result of the angle ⁇ .
  • the texture mapping using the video of Mr. A taken by the camera 2 (strictly speaking, the video of Mr. A extracted from the real video) and the second depth data which is depth data about the real video.
  • video processing is performed for displacing the viewpoint by a height corresponding to the calculated angle ⁇ .
  • Mr. A's 3D image captured from a virtual camera in other words, Mr. A's 3D image with his eyes facing the front, as shown in FIG. 8C, is acquired.
  • FIG. 8C shows a video of Mr. A taken from a virtual camera position.
  • Mr. B's home server 1 (strictly speaking, the above-described synthesized video display unit 22) synthesizes the 3D video of Mr. A acquired by the above procedure and the 3D video of the background and foreground. Then, the synthesized video is displayed on the display 5 on the B side.
  • Mr. B's field of view image reflected in Mr. B's eyes
  • the video displayed on the display 5 is centered on the vertical axis. It was supposed to switch to rotate.
  • FIG. 9 is a diagram illustrating a configuration example of a conventional video display system, and illustrates how the display video changes in conjunction with the movement of Mr. B who is looking at the display 5.
  • Mr. B's face moves sideways in the scene where Mr. A and Mr. B are actually facing each other
  • Mr. A's appearance of Mr. B is rotated as described above. There is nothing but horizontal movement.
  • both Mr. A's image and the background image are rotated by the same rotation amount (rotation angle).
  • the picture of Mr. A displayed on the display 5 is different from the way it looks when actually facing each other. End up.
  • FIG. 10A is a diagram schematically illustrating a situation in which Mr. B's face has moved laterally.
  • FIG. 10B is an explanatory diagram regarding the depth distance of each of Mr. A, the background, and the foreground.
  • FIG. 11 is an explanatory diagram showing changes in the composite video when a transition process described later is executed. (A) shows the composite video before the transition process, and (B) shows the composite video after the transition process. , Respectively.
  • first direction one of the two directions opposite to each other in the width direction (that is, the left-right direction) of the display 5
  • second direction the other direction
  • Mr. B's home server 1 (strictly speaking, the aforementioned face movement detection unit 24) detects the presence or absence of movement of Mr. B's face while displaying the composite video on the display 5 of Mr. B.
  • Mr. B's home server 1 simultaneously detects the direction of movement and the amount of movement.
  • Mr. B's home server 1 (strictly speaking, the above-described composite video display unit 22) executes a transition process according to the detection result related to Mr. B's face movement.
  • the transition process is a process for transitioning the composite video displayed on the display 5 on the Mr. B side from the state before detecting the lateral movement of the Mr. B face. Specifically, both the display position of Mr. A's 3D video and foreground 3D video in the composite video, and the range (ie, display range) included in the composite video in the background 3D video are displayed.
  • the composite video is shifted to a state shifted in the horizontal direction.
  • a shift amount is set for each of the display position of the 3D video of Mr. A and the foreground 3D video in the synthesized video, and the display range of the background 3D video. .
  • the amount of each shift is the amount of movement x of Mr. B, the camera 2 and its subject (Mr. A and its background).
  • the foreground ie, the depth distance.
  • the movement amount x of Mr. B's face is converted into a movement angle.
  • the movement angle is an angle indicating the amount of change in the line of sight of Mr. B.
  • the line of sight is a virtual straight line from the center position of Mr. B's eyes toward the center of the display 5.
  • the line illustrated by the one-dot chain line corresponds to the line of sight before the face of Mr. B moves
  • the line illustrated by the two-dot chain line corresponds to the line of sight after movement.
  • the acute angle formed by the two line-of-sight lines, that is, the angle ⁇ in FIG. 10A corresponds to the movement angle. It is assumed that the line of sight before Mr. B's face moves is a line along the normal direction of the display screen of the display 5 as shown in FIG. 10A.
  • the depth distance of each of Mr. A, the background (for example, the wall), and the foreground for example, the box in front of Mr. A
  • the depth distance of Mr. A is maintained at a position separated from the camera 2 on the Mr. A side by the reference distance d1, as shown in FIG. 10B.
  • the depth distance of the wall of the room as the background is separated from the camera 2 on the A side by a distance dw.
  • this distance dw is longer than the reference distance d1 which is the depth distance of Mr. A.
  • the depth distance of the box placed in front of Mr. A, which is the foreground is separated from the camera 2 on the Mr. A side by a distance df, as shown in FIG. 10B.
  • this distance df is shorter than the reference distance d1 which is the depth distance of Mr. A.
  • the display position of Mr. A's 3D image is shifted by the shift amount t1
  • the display range of the background 3D image is shifted by the shift amount t2
  • the 3D video of the foreground is displayed.
  • the composite image is shifted to a state where the display position is shifted in the second direction by the shift amount t3.
  • the display position of the video, the display position of the foreground 3D video, and the display range of the background 3D video are all shifted in the second direction.
  • the shift amount t2 with respect to the display range of the background 3D video is larger than the shift amount t1 with respect to the display position of the 3D video of Mr.
  • the shift amount t3 with respect to the display position of the 3D video of the foreground is smaller than the shift amount t1 with respect to the display position of the 3D video of Mr. A.
  • Mr. B actually interacts with Mr. A
  • Mr. B's face moves sideways
  • Mr. B's position looks like it's in position.
  • the closer to Mr. B the smaller the amount of deviation appears from the original position, and the farther away, the larger the amount of deviation appears from the original position. It becomes like this.
  • the display position of Mr. A's 3D image, the display position of the foreground 3D image, and The composite video is shifted so that the display range of the background video is shifted by different shift amounts.
  • the displacement amount t2 with respect to the display range of the background 3D image is larger than the displacement amount t1 with respect to the display position of Mr. A's 3D image.
  • FIG. 12 is a diagram showing a configuration example of a conventional video display system, illustrating that the display size of the video of Mr. A displayed on the display 5 becomes smaller as the depth distance of Mr. A increases. Yes.
  • Mr. A and Mr. A are actually facing each other, even if Mr. A is slightly closer to or away from Mr. B, the appearance (size) of Mr. A is as seen from Mr. B. The appearance (appearance) looks almost unchanged. Therefore, in the present system S, a process at the time of changing the depth distance is performed in order to reproduce the actual appearance when the depth distance of Mr. A changes. As a result, the display size of Mr. A's video (strictly, a three-dimensional image) displayed on Mr. B's display 5 is maintained at a life-size size even after Mr. A's depth distance has changed. become.
  • the process at the time of changing the depth distance is the depth of Mr. A during the period in which the composite image is displayed on the Mr. B side display 5 (in other words, the period in which Mr. A's camera 2 captures the image of Mr. A). This is done when the distance changes.
  • Mr. B's home server 1 (strictly, the determination unit 23 described above) determines whether there is a change in the depth distance during the period. When it is determined that the depth distance has changed, Mr. B's home server 1 uses this as a trigger to start the process when the depth distance changes.
  • Mr. B's home server 1 executes an adjustment process for adjusting the display size of Mr. A's 3D video in the composite video.
  • the adjustment process first, the depth distance d2 after the change is specified. Thereafter, based on the specified depth distance d2 after the change, the display size before the position of Mr. A in the depth direction is changed, that is, the display size of the video of Mr. A is adjusted to be a life size.
  • the display distance of the Mr. A's video (strictly speaking, the vertical size and horizontal size of the video) is changed to the depth distance.
  • the display size is corrected by multiplying by the ratio (d1 / d2).
  • Mr. B's home server 1 synthesizes the 3D video of Mr. A whose size has been corrected, and the 3D video of the background and foreground, and displays the synthesized video on the display 5. Accordingly, as shown in FIGS. 13A and 13B, even if the depth distance of A changes, the video of Mr. A is displayed at the display size before the depth distance changes. Become. When the depth distance of Mr. A changes in this way, the display size of Mr. A's 3D image is adjusted to reflect the appearance when actually facing Mr. A. As a result, this system S is used. The realism of the face-to-face dialogue will be further improved.
  • FIG. 13 is an explanatory diagram of the execution result of the adjustment process.
  • FIG. 13A shows a composite image before the depth distance changes
  • FIG. 13B shows a result after the depth distance changes.
  • the composite images that have undergone the adjustment process are respectively shown.
  • FIG. 13B for comparison of the display size, the video of Mr. A after the depth distance has changed and before the adjustment process is performed is indicated by a broken line.
  • the video display flow proceeds when the home server 1 on the side of Mr. B, which is a computer, performs the steps shown in FIGS. 14 and 15 are diagrams showing the flow of the video display flow.
  • the Mr. B's home server 1 communicates with the Mr. A's home server 1 to receive the video data of the background video and the depth data (first depth data) about the background video. (S001).
  • Mr. B's home server 1 acquires a video of a room used when Mr. A performs a face-to-face conversation as a background video.
  • Mr. B's home server 1 acquires first depth data as data (distance data) indicating the distance between the background and the camera 2.
  • this step S001 is performed while the A-side camera 2 captures only the background video, that is, during a period when there is no A in the room where the face-to-face conversation is performed. Further, the acquired background video and first depth data are stored in the hard disk drive or the like of the home server 1 on the side of Mr. B.
  • Mr. B's home server 1 reads the most recently acquired background video and first depth data out of the stored background video and first depth data, and performs processing by texture mapping as rendering processing using them. Execute. Thereby, Mr. B's home server 1 acquires a background three-dimensional image (S002).
  • Mr. A when Mr. A enters the room for face-to-face conversation and starts the face-to-face conversation, the camera 2 installed in the room picks up an image including Mr. A, the background and the foreground, that is, a real image. Then, Mr. A's home server 1 transmits the video data of the actual video captured by the camera 2, and Mr. B's home server 1 receives the video data. Thereby, Mr. B's home server 1 acquires the above-mentioned actual video. In addition, the home server 1 on the Mr. A side transmits depth data (second depth data) on the real video simultaneously with the transmission of the video data of the real video, and the home server 1 on the B side transmits the second depth data. Receive. Thereby, Mr. B's home server 1 acquires the second depth data in a state of being set with the actual video (S003). The acquired actual video and second depth data are stored in the hard disk drive or the like of Mr. B's home server 1.
  • Mr. B's home server 1 extracts a person image, specifically, Mr. A's image from the acquired actual image (S004). Specifically, Mr. B's home server 1 specifies Mr. A's skeleton model based on the second depth data acquired in the previous step S002 and the captured video of the camera 2, and then A's video is extracted from the actual video based on the model.
  • Mr. B's home server 1 executes rendering processing using the video of Mr. A extracted in the previous step S004 and the second depth data, and specifically executes processing by texture mapping. Thereby, Mr. B's home server 1 acquires a 3D image of the person (Mr. A) (S005).
  • Mr. B's home server 1 extracts the foreground video from the actual video acquired in step S003 based on the second depth data (S006). Thereafter, Mr. B's home server 1 executes a rendering process by texture mapping using the extracted foreground video and the second depth data. Thereby, Mr. B's home server 1 acquires a 3D image of the foreground (S007).
  • the home server 1 on the side of Mr. B After acquiring the 3D images of Mr. A and the foreground, the home server 1 on the side of Mr. B, the images within the predetermined range in these 3D images and the background 3D images acquired in step S002. (Display range) is synthesized (S008). Then, Mr. B's home server 1 displays the synthesized video on the Mr. B's display 5 (S009). Accordingly, Mr. A's 3D image is displayed in a life-size size in front of the background 3D image on the display 5 on the side of Mr. B, and in front of the 3D image of Mr. A. A foreground 3D image is displayed.
  • step S005 for acquiring a 3D video of a person will be described in more detail with reference to FIG.
  • FIG. 16 is a diagram illustrating a procedure for acquiring a 3D video of a person.
  • step S005 first, texture mapping is performed using the video of Mr. A extracted in the previous step S004 and the second depth data (S011). As a result, an image viewed from the position where the camera 2 on the A's side is installed is acquired as a 3D image of the A's.
  • Mr. B's home server 1 Based on the first depth data stored in Mr. B's home server 1, the height of Mr. A's eyes is detected (S012). Thereafter, the home server 1 on the B side compares the detected height of the eyes of the A with the set height of the camera 2 on the A side (S013). If the heights of the two are different, the home server 1 on the side of Mr. B performs a process for adjusting the eye height (S014). In this process, Mr. B's home server 1 performs a rendering process for acquiring a three-dimensional image of Mr. A viewed from a virtual viewpoint at the same height as the detected eye height of Mr. A.
  • the 3D image acquired in step S011 is subjected to image processing for displacing the viewpoint by a height corresponding to the angle ⁇ calculated by the above-described equation (1).
  • image processing for displacing the viewpoint by a height corresponding to the angle ⁇ calculated by the above-described equation (1).
  • the home server 1 on the B's side uses the 3D video acquired in step S011 as it is. Used in subsequent steps in state.
  • Mr. B's home server 1 acquires the actual image (Mr. B, the background and the foreground image) captured by Mr. B's camera 2 and also displays the measurement result from the infrared sensor 4. Based on this, the depth data (second depth data) of the actual video is acquired. Based on such depth data, Mr. B's home server 1 determines whether Mr. B's face has moved laterally during the period in which the composite video is displayed on the Mr. B display 5 (S021). . When it is determined that the face of Mr. B has moved sideways, the home server 1 on the side of Mr. B specifies the direction and amount of movement of the face based on the depth data before the movement and the depth data after the movement. (S022).
  • Mr. B's home server 1 specifies the depth distance of each of Mr. A, the background, and the foreground based on the second depth data acquired in step S003 (S023). Thereafter, the home server 1 on the B-side side calculates a deviation amount used in the transition process executed in the next step S025 based on the values specified in steps S022 and S023 (S024). More specifically, in step S024, the shift amount t1 with respect to the display position of Mr. A's 3D video in the composite video, and the range (display range) included in the composite video in the background 3D video. The shift amount t2 and the shift amount t3 with respect to the display position of the foreground 3D video in the composite video are respectively calculated according to the equations (2) to (4) described above.
  • Mr. B's home server 1 executes the transition process after calculating the deviation amount (S025).
  • the composite image displayed on the display 5 transitions from a state before detecting the lateral movement of Mr. B's face. Specifically, when it is detected that Mr. B's face has moved sideways in the first direction, Mr. B's home server 1 displays the display position of Mr. A's 3D image in the composite image in the transition process, The composite video is shifted to a state in which the display position of the foreground 3D video and the display range of the background 3D video are shifted in the second direction by the shift amount calculated in the previous step S024.
  • the shift amount with respect to the display range of the background 3D video is larger than the shift amount with respect to the display position of the 3D video of Mr. A. Further, the shift amount of the foreground 3D video display position is smaller than the shift amount of Mr. A with respect to the 3D video display position.
  • Mr. B's home server 1 displays the composite video after the transition processing, that is, the display position of Mr. A's 3D video, the display position of the foreground 3D video, and the background 3D video.
  • the composite image with the display range shifted from the initial state is displayed on the display 5 (S026).
  • the display 5 displays an image that reproduces the appearance when seen from the position of the face of Mr. B after the lateral movement.
  • the shift amount with respect to the display range of the background 3D video is larger than the shift amount of Mr. A with respect to the display position of the 3D video. For this reason, Mr. B can peek at the image that was not initially displayed on the display 5 among the three-dimensional images of the background by moving his / her face to the left and right.
  • the home server 1 on the B side has changed the depth distance of the A during the period in which the composite video is displayed on the display 5 on the B side based on the second depth data acquired in step S003. It is determined whether or not (S027).
  • the home server 1 on the side of Mr. B identifies the depth distance after the change based on the second depth data after the change (S028).
  • the home server 1 on the B side adjusts the display size of the 3D image of the A in accordance with the identified depth distance after the change (S029).
  • Mr. B's home server 1 adjusts the display size so that the three-dimensional image of Mr.
  • Mr. B's home server 1 synthesizes the 3D video of Mr. A after the size adjustment and the 3D video of the background and foreground, and displays the synthesized video on the display 5. (S030). Thereby, even after Mr. A's depth distance changes, Mr. A's three-dimensional image displayed on the display 5 continues to be displayed in a life-size size.
  • ⁇ Variation of video display system In the configuration of the system S described above, one camera 2 that captures each user's video is provided. That is, in the above embodiment, a single camera 2 captures a user's video, and the display 5 displays a 3D video based on the video captured by the single camera 2. did. On the other hand, if a user's image
  • FIG. 17 is a diagram schematically illustrating a state in which the image of Mr. A is captured by the two upper and lower cameras 2. Further, the upper and lower two cameras 2 respectively capture the image of Mr. A at different positions. Specifically, the upper camera 2 is installed at a position somewhat higher than Mr. A's height, and the lower camera 2 is installed slightly above the floor surface.
  • the video display screen of the display 5 (strictly, the front surface of the touch panel 5a) is used as a reference plane, and the imaging directions of the two upper and lower cameras 2 are perpendicular to the normal direction of the reference plane. Tilt in the direction.
  • the imaging direction is the optical axis direction of the lens of the camera 2, and the imaging direction of the upper camera 2 is set to a direction that descends as approaching Mr. A. That is, the upper camera 2 images Mr. A's body from above.
  • the imaging direction of the lower camera 2 is set to a direction that rises as it approaches Mr. A. That is, the lower camera 2 images Mr. A's body from below.
  • Mr. A stands at a position separated from the reference position by the reference distance d1.
  • the upper camera 2 captures an image from the head of A to the waist (hereinafter referred to as an upper body image), and the lower camera 2 captures the abdomen from the foot of Mr. A.
  • the previous video hereinafter, lower body video
  • an infrared sensor 4 is provided for each camera 2. Thereby, it is possible to individually acquire the depth data (strictly, the second depth data) for the video (actual video) captured by each of the upper and lower cameras 2.
  • Mr. B's home server 1 acquires Mr. A's video for each camera 2. More specifically, the home server 1 on the side of Mr. A has the video data of the real video including the upper body video captured by the upper camera 2 and the video data of the real video including the lower body video captured by the lower camera 2. And send. Mr. B's home server 1 acquires these video data, and extracts the video of Mr. A, specifically the upper body video and the lower body video, from the actual video indicated by each video data.
  • the Mr. B side home server 1 receives the depth data about the actual image captured by each camera 2 from the Mr. A side home server 1 for each camera. That is, in the modified example, the home server 1 on the side of Mr. B acquires the depth data for the real video including the upper body video and the lower body video of Mr. A for each camera. Further, in the modified example, Mr. B's home server 1 (strictly, the 3D video generation unit 21) generates a 3D video piece for each camera based on the actual video and depth data acquired for each camera. A process, that is, a video piece generation process is performed.
  • the home server 1 on the B side uses the upper body video of Mr. A obtained from the actual video captured by the upper camera 2 and the actual video captured by the upper camera 2.
  • the depth data is used to perform rendering processing.
  • a 3D image piece viewed from the imaging direction of the upper camera 2, specifically, a 3D image piece of Mr. A's upper body shown in FIG. 18 is acquired.
  • the home server 1 on the B side has a lower body video obtained from the real video captured by the lower camera 2 and an actual video captured by the lower camera 2.
  • a rendering process using depth data is performed.
  • a 3D image piece viewed from the imaging direction of the lower camera 2 specifically, a 3D image piece of Mr. A's lower body shown in FIG. 18 is acquired.
  • FIG. 18 is a diagram illustrating a 3D image piece generated for each camera and a 3D image of Mr. A generated in a combining step described later.
  • the installation height of the camera 2 that is, the upper camera 2 that captures the image of the part including the eyes of Mr. A is different from the height of the eyes of Mr. A.
  • a process for leveling is to be performed.
  • a rendering process for acquiring a 3D image piece of the upper body when viewed from a virtual viewpoint at the height of Mr. A's eyes is executed.
  • FIG. 19 is a diagram showing a procedure for acquiring the 3D video of Mr. A in the modification.
  • Mr. B's home server 1 first performs a video piece generation process (S041) in generating a 3D video of Mr. A.
  • Mr. B's home server 1 generates a 3D video piece for each of the upper body and the lower body of Mr. A by executing a rendering process using texture mapping (S042, S043).
  • Mr. B's home server 1 generates a 3D image piece of the upper body viewed from the upper camera 2 when generating the 3D image piece of the upper body during the image piece generating process.
  • Mr. B's home server 1 specifies the difference between the installation height of the upper camera 2 and the eyes of Mr.
  • Mr. B's home server 1 determines the distance (depth distance) between Mr. A and the upper camera 2. Identify. Furthermore, Mr. B's home server 1 obtains the rotation angle ⁇ used in the subsequent video processing based on these identification results. Then, Mr. B's home server 1 performs video processing for displacing the viewpoint by a height corresponding to the rotation angle ⁇ on the 3D video piece of the upper body generated in the previous step. As a result, a 3D image piece when viewed from a virtual viewpoint at the height of A's eyes is acquired as a 3D image piece of Mr. A's upper body. That is, a 3D image piece of the upper body of Mr. A with the line of sight facing the front is acquired.
  • the home server 1 on the B-side rotates the image with respect to the 3D image piece obtained from the actual image captured by the lower camera 2 when generating the 3D image piece of the lower body during the image piece generating process.
  • the lower camera 2 captures an image of the lower body of Mr. A from an imaging direction different from the normal direction of the display screen of the display 5 that is the reference plane.
  • Mr. B's home server 1 uses the texture using the actual video captured by the lower camera 2 (that is, the video captured in the above imaging direction) and the depth data of the actual video. Mapping is performed to generate a 3D image piece of Mr. A's lower body.
  • the 3D image piece generated at this stage is a 3D image piece when viewed from the imaging direction of the lower camera 2.
  • Mr. B's home server 1 executes a video rotation process on the 3D video fragment when viewed from the imaging direction of the lower camera 2.
  • a three-dimensional image fragment when the three-dimensional image fragment when viewed from the imaging direction of the lower camera 2 is virtually viewed from the normal direction of the display screen of the display 5 serving as a reference plane.
  • the inclination degree of the imaging direction of the lower camera 2 with respect to the normal direction is specified by an angle (inclination angle), and the 3D image piece is rotated by the inclination angle.
  • the 3D image piece when viewed from the normal direction of the reference plane is acquired as the 3D image piece of the lower half of Mr. A.
  • the above video rotation processing is realized by known video processing.
  • the Mr. B's home server 1 After acquiring the 3D image pieces of the upper and lower bodies, the Mr. B's home server 1 performs a combining step of combining the 3D image pieces to generate the 3D image of Mr. A (S044). ).
  • the 3D image pieces of the upper body and the lower body are overlapped with a common image area (specifically, an area showing the image of Mr. A's abdomen) included in each 3D image piece.
  • a common image area specifically, an area showing the image of Mr. A's abdomen
  • the 3D image of Mr. A is completed (S045).
  • Such a 3D image is a 3D image when A is viewed from the front (in other words, the normal direction of the reference plane) as shown in FIG.
  • Mr. B's home server 1 (strictly speaking, the synthesized video display unit 22) synthesizes the 3D video of Mr. A obtained by the above procedure and the 3D video of the background and the foreground.
  • the synthesized video is displayed on the display 5.
  • Mr. A's 3D image an image in the vicinity of the joined portion of the 3D image pieces (specifically, near the abdomen) is displayed without a sense of incongruity.
  • the acquired 3D image piece of the lower body is simply combined.
  • the 3D image of Mr. A obtained in this case is displayed on the display 5, it appears as if the vicinity of the portion where the 3D image pieces are joined is bent in the 3D image (that is, upright). It looks as if it is slightly bent forward with respect to the posture).
  • a process for adjusting the eye height is performed when generating a 3D image piece of the upper body.
  • the depth data is converted into data about an image viewed from the normal direction of the reference plane, and a 3D image piece is generated based on the converted depth data. .
  • Mr. A's 3D video acquired by joining the 3D video pieces it is possible to suppress a sense of incongruity that the vicinity of the joined portion of the 3D video pieces appears to be bent. .
  • a plurality of cameras 2 are arranged side by side, but the present invention is not limited to this.
  • two cameras 2 may be arranged side by side.
  • a 3D image piece (specifically, a 3D image piece for each of the left and right bodies) is generated in the same procedure as described above, and the 3D image pieces are combined with each other. 3D images will be generated.
  • a series of steps relating to video display strictly speaking, a step of generating a 3D video for each of the user (for example, Mr. A) and the background and foreground and synthesizing the 3D video.
  • the home server 1 on the side of the second user for example, Mr. B
  • the present invention is not limited to this, and the series of steps described above may be performed by the home server 1 on the user (Mr. A) side.
  • the image of the space captured when there is no user in the space corresponding to the background is used as the background image.
  • the present invention is not limited to this.
  • the camera 2 captures the user and the background at the same time, that is, the person video and the background video are separated from the actual video, and the separated background video is used. May be.
  • the background video captured when there is no user is used, there is no omission of the video as described above, so it is not necessary to perform video complementation, so the background video can be acquired more easily. It becomes.
  • the display position of the user's 3D video in the composite video and the composite video in the background 3D video It was decided to shift both the range (display range) included in the inside.
  • the present invention is not limited to this, and only one of the display position of the user's 3D video and the display range of the background 3D video may be shifted, and the other may be fixed (not shifted).
  • the shift amount increases in the order of the display position of the foreground 3D image, the display position of Mr. A's 3D image, and the display range of the background 3D image.
  • the magnitude relationship of the shift amounts may be different from the above magnitude relationship. That is, the shift amount may increase in the order of the display range of the background 3D video, the display position of Mr. A's 3D video, and the display position of the foreground video. More specifically, when Mr. B's face moves sideways while the composite video shown in FIG. 20A is initially displayed on the display 5 on the B side, the second transition process is performed. As a result, the synthesized video gradually transitions to the state shown in FIG. 20A and 20B are explanatory diagrams relating to the second transition process, in which FIG. 20A shows a composite video before the second transition process, and FIG. 20B shows a composite video after the second transition process.
  • the mode that maximizes the amount of shift in the display range of the background 3D video (corresponding to the transition process described above) and the amount of shift in the display position of the foreground 3D video It may be possible to switch between the largest mode (corresponding to the second transition process). In such a case, the transition process is appropriately executed according to the direction of the line of sight of Mr. B at that time.

Abstract

In a case where the installation height of a camera differs from the eye height of a user displayed on a display, the present invention improves the realistic sensation of a conversation to be carried out while an image of the user is being displayed on the display. This image display system obtains an image of the user captured by a camera, divides the image into a prescribed number of image pieces, obtains distance data indicating the distance between the camera and an object in an image piece for each of the image pieces, and executes a rendering process using the image of the user and the distance data to create a three-dimensional image of the user. The eye height of the user is detected, and when the detected eye height differs from the installation height of the camera, the rendering process which is for obtaining the three-dimensional image of the user when viewed from a virtual visual point located at the detected eye height is executed on the basis of the difference between the two heights and the distance between the camera and the user.

Description

映像表示システム及び映像表示方法Video display system and video display method
 本発明は、映像表示システム及び映像表示方法に係り、特に、遠隔地に居る対話相手の映像を対話者側のディスプレイに表示させる映像表示システム及び映像表示方法に関する。 The present invention relates to an image display system and an image display method, and more particularly, to an image display system and an image display method for displaying an image of a conversation partner in a remote place on a display of a conversation person.
 互いに離れた空間に居るユーザ同士がお互いの映像を見ながら対話することを実現する通信システム(以下、映像表示システム)は、既に知られている。同システムでは、一方のユーザ側から映像の映像データが送信され、他方のユーザ側で当該映像データを受信して展開する。これにより、一方のユーザの映像が他方のユーザ側のディスプレイに表示されるようになる。この結果、ディスプレイにてお互いの映像を見ているユーザ同士は、あたかも相手と対面しているかのように感じるようになる。 A communication system (hereinafter referred to as a video display system) that enables users in remote spaces to interact while watching each other's video is already known. In this system, video data of video is transmitted from one user side, and the video data is received and expanded on the other user side. Thereby, the video of one user comes to be displayed on the display of the other user. As a result, users watching each other's images on the display feel as if they are facing each other.
 また、上記の映像表示システムの中には、テクスチャマッピング等を利用して撮像映像を三次元化して表示するシステムが存在する(例えば、特許文献1参照)。このように三次元化された映像(以下、三次元映像)を表示することで、ディスプレイに相手の映像を表示しながら行う対話の臨場感を一層向上させることが可能となる。 Also, among the video display systems described above, there is a system that displays captured video in three dimensions using texture mapping or the like (see, for example, Patent Document 1). By displaying the three-dimensional video (hereinafter, three-dimensional video) in this way, it is possible to further improve the realism of the dialogue performed while displaying the other party's video on the display.
 さらに、上記の映像表示システムの中には、対話の臨場感をより一層高める目的から、ディスプレイを見ている者の目線とディスプレイに映し出された者の目線とを一致させることが可能なシステムが存在する(例えば、特許文献1乃至3参照)。具体的に説明すると、特許文献1及び2に記載のシステムでは、目線の位置が一致するようにカメラの設置位置が予め適当に決められている。また、特許文献3に記載のシステムでは、ディスプレイに映し出される者の映像を撮像するカメラの位置を、ディスプレイを見ている者の目の高さに応じて上下動させることで両者の目線を一致させる。 Furthermore, among the above video display systems, there is a system that can match the eyes of the person who is looking at the display and the eyes of the person shown on the display for the purpose of further enhancing the realism of the dialogue. Exists (for example, see Patent Documents 1 to 3). Specifically, in the systems described in Patent Documents 1 and 2, the installation position of the camera is appropriately determined in advance so that the positions of the line of sight coincide. Moreover, in the system described in Patent Document 3, the position of the camera that captures the image of the person shown on the display is moved up and down according to the height of the eyes of the person who is looking at the display, thereby matching the eyes of both. Let
特開2014-86774号公報JP 2014-86774 A 特開2000-32420号公報JP 2000-32420 A 特表2014-522622号公報Special table 2014-522622 gazette
 しかしながら、特許文献1及び2に記載のシステムでは、目線の位置が一致するようにカメラの設置位置を決めるので、目線の高さが制限されてしまうことになる。つまり、撮像カメラの設置位置が固定されているため、その設置位置とは異なる高さに目線がある者にとっては利用し難いシステムとなる(具体的には、目線の位置が一致しなくなる)。一方、特許文献3に記載のシステムでは、ディスプレイを見ている者の目の高さに応じてカメラの位置を調整可能であるため、様々な目の高さに対応し得るものの、カメラ位置の調整機構を設ける必要があるため、システム構築コストが割高となってしまう。 However, in the systems described in Patent Documents 1 and 2, the installation position of the camera is determined so that the positions of the eyes coincide with each other, so that the height of the eyes is limited. That is, since the installation position of the imaging camera is fixed, it becomes a system that is difficult for a person who has a line of sight at a height different from the installation position (specifically, the position of the line of sight does not match). On the other hand, in the system described in Patent Document 3, the position of the camera can be adjusted according to the height of the eyes of the person viewing the display. Since it is necessary to provide an adjustment mechanism, the system construction cost is expensive.
 また、映像表示システムを用いた対話の臨場感について更なる向上を図るためには、ディスプレイを見ている者の動き(特に顔の動き)やディスプレイに映っている者の位置の変化に追従させるように、ディスプレイの映像を切り替える必要がある。具体的に説明すると、ディスプレイを見ている者の顔が横移動したとき、その者が対話相手と実際に対面している場面で顔を横に動かしたときの見え方、を反映して表示映像を変えるのが望ましい。 In addition, in order to further improve the realism of dialogue using the video display system, the movement of the person watching the display (especially the movement of the face) and the change in the position of the person shown on the display are followed. Thus, it is necessary to switch the video on the display. Specifically, when the face of the person watching the display moves sideways, it reflects the appearance of the person moving the face sideways in a situation where the person is actually facing the conversation partner. It is desirable to change the image.
 また、カメラの被写体がカメラから離れるほど、ディスプレイに写る当該被写体の映像の表示サイズは、より小さくなってしまう。ところが、実際に対面しながら対話を行っている場面において、その当事者のうちの一方の者に対して他方の者が多少離れたときの当該他方の者の姿(大きさ)は、上記一方の者の見え方(見た目)では殆ど変化しないように見える。このような見え方を考慮し、被写体とカメラとの間の距離、すなわち奥行距離が変化したときには当該被写体の映像の表示サイズを調整するのが望ましい。 Also, the farther the camera subject is from the camera, the smaller the display size of the subject image on the display. However, in a situation where the conversation is actually conducted while facing each other, the figure (size) of the other person when the other person is slightly separated from one of the parties is It seems that there is almost no change in how people look (look). In consideration of such appearance, it is desirable to adjust the display size of the image of the subject when the distance between the subject and the camera, that is, the depth distance changes.
 そこで、本発明は、上記の課題に鑑みてなされたものであり、その目的とするところは、ディスプレイに映し出されるユーザの目の高さと撮像装置の設置高さとが異なる場合において、ディスプレイに上記ユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能な映像表示システム及び映像表示方法を提供することである。
 また、本発明の他の目的は、ディスプレイに映し出されるユーザの映像を見ている第二のユーザの顔が横移動したときに、実際の見え方を反映してディスプレイの表示映像を変化させることである。さらに、本発明の第三の目的は、ディスプレイに映し出されるユーザの映像の表示サイズを、当該ユーザと撮像装置との間の距離が変化した際に適切に調整することである。
Therefore, the present invention has been made in view of the above problems, and the object of the present invention is to display the user on the display when the height of the user's eyes projected on the display is different from the installation height of the imaging device. It is an object to provide a video display system and a video display method capable of improving the realistic sensation of a dialogue performed while displaying the video.
Another object of the present invention is to change the display image on the display to reflect the actual appearance when the face of the second user who is watching the user image displayed on the display moves sideways. It is. A third object of the present invention is to appropriately adjust the display size of the user's video displayed on the display when the distance between the user and the imaging device changes.
 前記課題は、本発明の映像表示システムによれば、(A)撮像装置により撮像されたユーザの映像を取得する映像取得部と、(B)前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得する距離データ取得部と、(C)前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成する三次元映像生成部と、(D)前記ユーザの目の高さを検知する高さ検知部と、を有し、(E)前記撮像装置が設置されている高さ及び前記高さ検知部が検知した前記目の高さの双方が異なるとき、前記三次元映像生成部は、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記高さ検知部が検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することにより解決される。 According to the video display system of the present invention, (A) a video acquisition unit that acquires a video of a user captured by an imaging device, and (B) when the video is divided into a predetermined number of video pieces. For each video piece, a distance data acquisition unit that acquires distance data indicating a distance between the imaging device and the object in the video piece, and (C) the user's video and the distance data are used. A 3D image generation unit that generates a 3D image of the user by executing a rendering process; and (D) a height detection unit that detects the height of the user's eyes, and (E) When both the height at which the imaging device is installed and the height of the eyes detected by the height detection unit are different, the 3D image generation unit determines the difference between the two and the imaging device and the user. Based on the distance between, the height detector It is solved by executing the rendering process for acquiring the three-dimensional image of the user when viewed from a virtual viewpoint in the eye of the height and knowledge.
 上記の構成によれば、撮像装置により撮像されたユーザの映像、及び、当該ユーザの映像について取得した距離データを用いたレンダリング処理を実行することでユーザの三次元映像を生成する。また、ユーザの目の高さと、撮像装置が設置されている高さと、が異なる場合には、ユーザの目の高さと同じ高さにある仮想的な視点から見たときのユーザの三次元映像を取得するように、レンダリング処理を実行する。このように3DCG技術としてのレンダリング処理によって、上記ユーザの目の高さと同じ高さから仮想的に見たユーザの三次元映像を得ることで、双方の高さが異なる場合にも、ディスプレイを見ている者の目線とディスプレイに映し出される者の目線とを合わせることが可能となる。これにより、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能となる。 According to the above configuration, the user's 3D video is generated by executing the rendering process using the user's video captured by the imaging device and the distance data acquired for the user's video. If the height of the user's eyes is different from the height at which the imaging device is installed, the user's three-dimensional image when viewed from a virtual viewpoint at the same height as the user's eyes Execute the rendering process so that In this way, by rendering processing as 3DCG technology, a 3D image of a user viewed virtually from the same height as the user's eyes is obtained, so that even if the heights of both are different, the display can be viewed. It is possible to match the eyes of the person who is present and the eyes of the person shown on the display. As a result, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.
 また、上記の映像表示システムにおいて、前記映像取得部は、前記撮像装置により撮像された前記ユーザの映像、及び、前記撮像装置により撮像された背景の映像をそれぞれ取得し、前記距離データ取得部は、前記ユーザの映像及び前記背景の映像のそれぞれについて、前記距離データを取得し、前記三次元映像生成部は、前記ユーザの映像及び当該ユーザの映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記ユーザの前記三次元映像を生成すると共に、前記背景の映像及び当該背景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記背景の前記三次元映像を生成し、前記ユーザの前記三次元映像と前記背景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置した合成映像をディスプレイに表示させる合成映像表示部を有すると、好適である。
 上記の構成では、ユーザの三次元映像及び背景の三次元映像を合成し、背景の手前にユーザが位置した合成映像を表示する。このような奥行感を有する合成映像が表示されることで、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感がより向上することになる。
In the video display system, the video acquisition unit acquires the user's video captured by the imaging device and the background video captured by the imaging device, and the distance data acquisition unit includes: The distance data is acquired for each of the user image and the background image, and the 3D image generation unit uses the distance data acquired for the user image and the user image. Generating the 3D image of the user by executing a process, and executing the rendering process using the background data and the distance data acquired for the background image, Generating an original video, combining the 3D video of the user and the 3D video of the background, When having a combined image display unit for displaying a combined image in which the user is positioned in front of the background to the display, which is preferable.
In the above configuration, the 3D video of the user and the 3D video of the background are synthesized, and the synthesized video in which the user is positioned in front of the background is displayed. By displaying the composite video having such a feeling of depth, the realism of the dialogue performed while displaying the user's video on the display is further improved.
 また、上記の映像表示システムにおいて、前記映像取得部は、前記撮像装置により撮像された前景の映像を更に取得し、前記距離データ取得部は、前記前景の映像についての前記距離データを更に取得し、前記三次元映像生成部は、前記前景の映像及び当該前景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記前景の前記三次元映像を更に生成し、前記合成映像表示部は、前記ユーザの前記三次元映像と前記背景の前記三次元映像と前記前景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置し、かつ、前記ユーザの手前に前記前景が位置している前記合成映像を前記ディスプレイに表示させると、より好適である。
 上記の構成では、ユーザの三次元映像及び背景の三次元映像に加えて、前景の三次元映像を更に合成し、ユーザの手前に前景が位置した合成映像を表示する。これにより、より一層奥行感を有する合成映像が表示されるようになる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感が一段と向上することになる。
In the video display system, the video acquisition unit further acquires a foreground video captured by the imaging device, and the distance data acquisition unit further acquires the distance data regarding the foreground video. The 3D image generation unit further generates the 3D image of the foreground by executing the rendering process using the distance data acquired for the foreground image and the foreground image, and the composition The video display unit synthesizes the 3D video of the user, the 3D video of the background, and the 3D video of the foreground, the user is positioned in front of the background, and It is more preferable to display the composite image in which the foreground is located on the display.
In the above configuration, in addition to the 3D video of the user and the 3D video of the background, the 3D video of the foreground is further synthesized, and the synthesized video with the foreground positioned in front of the user is displayed. As a result, a composite image having a greater sense of depth is displayed. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.
 また、上記の映像表示システムにおいて、前記距離データに基づいて、前記撮像装置と前記ユーザとの間の距離が変化したかどうかを判定する判定部を備え、前記撮像装置が前記ユーザの映像を撮像している間に、前記撮像装置と前記ユーザとの間の距離が変化したと前記判定部が判定したとき、前記合成映像表示部は、前記合成映像における前記ユーザの映像の表示サイズを、前記撮像装置と前記ユーザとの間の距離が変化する前の前記表示サイズとなるように調整すると、更に好適である。
 上記の構成によれば、撮像装置とユーザとの間の距離、すなわち奥行距離が変化したとしても、ディスプレイには、変化前の表示サイズのままでユーザの三次元映像が表示されることになる。すなわち、ユーザの奥行距離が変化した場合、変化後の合成映像は、実際にユーザと対面して当該ユーザを見たときの見え方(すなわち、自らの視覚を通じて認識したユーザの大きさ)を反映した表示サイズにてユーザの三次元映像を表示したものとなる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感がより一層向上することになる。
The video display system may further include a determination unit that determines whether the distance between the imaging device and the user has changed based on the distance data, and the imaging device captures an image of the user. When the determination unit determines that the distance between the imaging device and the user has changed, the composite video display unit displays the display size of the user video in the composite video, It is more preferable to adjust the display size before the distance between the imaging device and the user is changed.
According to the above configuration, even when the distance between the imaging device and the user, that is, the depth distance changes, the display displays the user's three-dimensional video with the display size before the change. . That is, when the user's depth distance changes, the composite image after the change reflects how it looks when the user actually sees the user (ie, the size of the user recognized through his / her own vision). The 3D image of the user is displayed at the display size. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.
 また、上記の映像表示システムにおいて、前記ディスプレイに表示された前記合成映像を見る第二のユーザの顔が前記ディスプレイの幅方向に移動したことを検知する顔移動検知部を有し、該顔移動検知部が前記顔の移動を検知したとき、前記合成映像表示部は、前記ディスプレイに表示されている前記合成映像を、前記顔移動検知部が前記顔の移動を検知する前の状態から遷移させる遷移処理を実行し、該遷移処理では、前記合成映像における前記ユーザの前記三次元映像の表示位置、及び、前記背景の前記三次元映像の中で前記合成映像中に含まれる範囲のうちの一方を、他方のずれ量よりも大きいずれ量だけ前記幅方向に沿ってずらした状態へ前記合成映像を遷移させると、より一層好適である。
 上記の構成によれば、ユーザの映像及び背景の映像を合成して得られる合成映像において、ユーザの映像及び背景の映像のそれぞれの表示位置や表示サイズ等を個別に調整することが可能である。そして、第二のユーザの顔が横移動したときには、ユーザの三次元映像の表示位置、及び、背景の三次元映像の中で合成映像中に含まれる範囲のうちの一方を、他方のずれ量よりも大きいずれ量だけ横方向にずらした状態へ合成映像を遷移させることとしている。これにより、第二のユーザの顔が横移動した後のディスプレイには、移動後の顔の位置から実際にユーザと対面して当該ユーザを見たときの見え方、を再現した映像が表示されるようになる。この結果、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感が、更に向上することとなる。
The video display system may further include a face movement detection unit that detects that the face of a second user who views the composite video displayed on the display has moved in the width direction of the display. When the detection unit detects the movement of the face, the composite video display unit transitions the composite video displayed on the display from a state before the face movement detection unit detects the movement of the face. A transition process is performed, and in the transition process, one of a display position of the 3D video of the user in the composite video and a range included in the composite video in the background 3D video It is even more preferable that the composite video is shifted to a state in which it is shifted along the width direction by any amount larger than the other shift amount.
According to the above configuration, in the synthesized video obtained by synthesizing the user video and the background video, it is possible to individually adjust the display position, the display size, and the like of each of the user video and the background video. . Then, when the second user's face moves laterally, one of the display position of the user's 3D video and the range included in the synthesized video in the background 3D video is changed to the other shift amount. The synthesized video is transitioned to a state that is shifted in the horizontal direction by an amount larger than that. As a result, on the display after the second user's face has moved sideways, an image reproducing the appearance when the user is actually seen from the position of the face after moving is displayed. Become so. As a result, the realism of the dialogue performed while displaying the user's video on the display is further improved.
 また、上記の映像表示システムにおいて、前記映像取得部は、互いに異なる撮像方向にて前記ユーザの映像を撮像する複数の前記撮像装置により撮像された前記ユーザの映像を、前記撮像装置別に取得し、前記距離データ取得部は、前記ユーザの映像についての前記距離データを前記撮像装置別に取得し、前記三次元映像生成部は、前記撮像装置別に取得された前記ユーザの映像と、前記撮像装置別に取得された前記距離データと、に基づいて、前記撮像装置別の前記ユーザの三次元映像片を生成する映像片生成工程と、前記ユーザの前記三次元映像を生成するために、前記撮像装置別の前記ユーザの前記三次元映像片の各々を、当該各々に含まれる共通の映像領域同士が重なり合うように結合する結合工程と、を行い、前記映像片生成工程において前記ユーザの目を含む部分の前記三次元映像片を生成する際、前記双方が異なるときには、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記仮想的な視点から見たときの前記三次元映像片を取得するための前記レンダリング処理を実行すると、尚好適である。
 上記の構成によれば、互いに撮像方向が異なる複数の撮像装置によってユーザの映像を撮像する場合に、撮像装置別に三次元映像片を生成し、最終的に三次元映像片同士を結合してユーザの三次元映像を取得する。一方、撮像装置別に生成される三次元映像片のうち、ユーザの目を含む部分の三次元映像片を生成する際には、ユーザの目の高さにある仮想的な視点から見たときの三次元映像を取得するためのレンダリング処理を実行する。これにより、三次元映像片同士を結合してなるユーザの三次元映像をディスプレイに表示すれば、当該ユーザの目線とディスプレイを見ている者の目線とを合わせることが可能となる。
In the video display system, the video acquisition unit acquires the video of the user captured by the plurality of imaging devices that capture the video of the user in different imaging directions, for each imaging device, The distance data acquisition unit acquires the distance data regarding the video of the user for each imaging device, and the 3D video generation unit acquires the video of the user acquired for the imaging device and the imaging device. An image piece generating step for generating the user's 3D image piece for each image pickup device based on the distance data, and for generating the 3D image of the user for each image pickup device. A step of combining each of the user's three-dimensional video pieces so that common video regions included in the respective pieces overlap each other, and the video piece generation step When the three-dimensional image piece of the portion including the user's eyes is generated, if the two are different, the virtual image is based on the difference between the two and the distance between the imaging device and the user. It is more preferable that the rendering process for acquiring the 3D video piece when viewed from the viewpoint is executed.
According to the above configuration, when a user's video is captured by a plurality of imaging devices having different imaging directions, a 3D video piece is generated for each imaging device, and finally the 3D video pieces are connected to each other. Get 3D video. On the other hand, among the 3D image pieces generated for each imaging device, when generating a 3D image piece of a part including the user's eyes, it is as seen from a virtual viewpoint at the height of the user's eyes. A rendering process for acquiring a 3D image is executed. As a result, if a 3D video of a user formed by combining 3D video pieces is displayed on the display, the user's line of sight and the line of sight of the person watching the display can be matched.
 また、上記の映像表示システムにおいて、前記撮像方向が基準面の法線方向と異なるとき、前記三次元映像生成部は、前記映像片生成工程において、前記撮像方向にて撮像した映像に基づいて生成した前記ユーザの前記三次元映像片を、前記法線方向から仮想的に見た場合の前記三次元映像片へ変換すると、益々好適である。
 上記の構成では、基準面の法線方向と異なる撮像方向にてユーザの映像を撮像し、その映像から三次元映像片を生成する場合に、上記の撮像方向にて撮像した映像に基づいて生成したユーザの三次元映像片を、上記の法線方向から仮想的に見た場合の三次元映像片へ変換する。そして、変換後の三次元映像片を用いてユーザの三次元映像を取得する。このようにして得られた三次元映像は、上記の法線方向から見たときの映像となっており、ディスプレイに表示した際に適切に表示されるようになる。具体的に説明すると、ユーザの三次元映像中、三次元映像片同士を結合した部分付近が屈曲しているかのように見えてしまうのを抑制することが可能となる。
In the video display system, when the imaging direction is different from the normal direction of the reference plane, the 3D video generation unit generates the video piece based on the video captured in the imaging direction in the video piece generation step. It is more preferable to convert the 3D image piece of the user into the 3D image piece when virtually viewed from the normal direction.
In the above configuration, when a user's video is captured in an imaging direction different from the normal direction of the reference plane, and a 3D video piece is generated from the video, it is generated based on the video captured in the imaging direction. The user's 3D video piece is converted into a 3D video piece viewed virtually from the normal direction. And a user's 3D image | video is acquired using the converted 3D image piece. The three-dimensional image thus obtained is an image when viewed from the normal direction, and is appropriately displayed when displayed on the display. More specifically, it becomes possible to suppress the vicinity of the portion where the 3D video pieces are joined in the 3D video of the user from appearing to be bent.
 また、前述した課題は、本発明の映像表示方法によれば、(A)コンピュータが、撮像装置により撮像されたユーザの映像を取得することと、(B)コンピュータが、前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得することと、(C)コンピュータが、前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成することと、(D)コンピュータが、前記ユーザの目の高さを検知することと、を有し、(E)前記撮像装置が設置されている高さ及び検知した前記目の高さの双方が異なるとき、コンピュータは、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することにより解決される。
 上記の方法によれば、ユーザの目の高さと撮像装置の設置高さとが異なっていても、ディスプレイを見ている者の目線とディスプレイに映し出される者(すなわち、ユーザ)の目線とを合わせることが可能となる。これにより、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感を向上させることが可能となる。
In addition, according to the video display method of the present invention, the above-described problem is that (A) the computer acquires the video of the user captured by the imaging device, and (B) the computer displays the video for a predetermined number of times. Obtaining distance data indicating a distance from the object in the video piece from the imaging device for each video piece when divided into video pieces; And generating a 3D video of the user by executing a rendering process using the distance data, and (D) a computer detecting the eye height of the user, E) When both the height at which the imaging device is installed and the detected eye height are different, the computer detects based on the difference between the two and the distance between the imaging device and the user. did It is solved by executing the rendering process for acquiring the three-dimensional image of the user when viewed from a virtual viewpoint at the height of the serial eyes.
According to the above method, even if the eye height of the user is different from the installation height of the imaging device, the line of sight of the person watching the display and the line of sight of the person shown on the display (that is, the user) are matched. Is possible. As a result, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.
 本発明の映像表示システム及び映像表示方法によれば、ユーザの目の高さと撮像装置の設置高さとが異なっていても、ディスプレイを見ている者(すなわち、第二のユーザ)の目線とディスプレイに映し出される者(すなわち、ユーザ)の目線とを合わせることが可能である。また、第二のユーザの顔が横移動したときに、ディスプレイに表示されている映像を、移動後の顔の位置から実際にユーザと対面して当該ユーザを見たときの見え方を再現した映像へ遷移させることが可能である。さらに、また、撮像装置とユーザとの間の距離(奥行距離)が変化した際、ディスプレイに表示されている合成映像中、ユーザの三次元映像の表示サイズを、奥行距離変化前の表示サイズとなるように調整する。これにより、ユーザと実際に対面して当該対話相手を見たときに感じる大きさ(すなわち、ユーザが自分の視覚を通じて認識する対話相手の大きさ)にてユーザの三次元映像を表示することが可能となる。
 以上の作用により、本発明の映像表示システムや映像表示方法によれば、ディスプレイにユーザの映像を表示させながら行われる対話の臨場感(リアリティ)を向上させることが可能となる。
According to the video display system and the video display method of the present invention, even if the eye height of the user is different from the installation height of the imaging device, the eye and the display of the person viewing the display (that is, the second user) It is possible to match the line of sight of the person (that is, the user) projected on the screen. In addition, when the face of the second user moved sideways, the image displayed on the display was reproduced from the position of the face after moving, actually facing the user and looking at the user Transition to video is possible. Furthermore, when the distance (depth distance) between the imaging device and the user changes, the display size of the user's 3D image in the composite image displayed on the display is changed to the display size before the change of the depth distance. Adjust so that Thus, it is possible to display the 3D video of the user at a size that is felt when the user actually faces the user and sees the conversation partner (that is, the size of the conversation partner recognized by the user through his / her own vision). It becomes possible.
With the above operation, according to the video display system and the video display method of the present invention, it is possible to improve the realism of the dialogue performed while displaying the user's video on the display.
本発明の一実施形態に係る映像表示システムの構成を示した図である。It is the figure which showed the structure of the video display system which concerns on one Embodiment of this invention. 各ユーザの部屋内に設置されたシステム構成機器の配置位置を示した図である。It is the figure which showed the arrangement position of the system component apparatus installed in the room of each user. 図3の(A)及び(B)は、本発明のディスプレイの一例を示した図である。3A and 3B are diagrams showing an example of the display of the present invention. 映像合成の手順についての説明図である。It is explanatory drawing about the procedure of an image composition. 実映像から人物映像を抽出する手順についての説明図である。It is explanatory drawing about the procedure which extracts a person image | video from a real image | video. 三次元映像を生成する手順についての説明図である。It is explanatory drawing about the procedure which produces | generates a three-dimensional image | video. 各ユーザが保有するホームサーバの構成を機能面から示した図である。It is the figure which showed the structure of the home server which each user holds from a functional surface. ユーザの三次元映像について目線の高さを合わせる手順についての説明図であり、(A)は、実際のカメラ位置から撮像したときの映像を、(B)は、カメラとユーザの目線との位置関係を、(C)は、仮想的なカメラ位置から撮像したときの映像を、それぞれ示している。It is explanatory drawing about the procedure which matches the height of a eyes | visual_axis about a user's three-dimensional image, (A) is an image | video when it images from an actual camera position, (B) is a position of a camera and a user's eyes | visual_axis. (C) shows the relationship between the images taken from the virtual camera position. 従来の映像表示システムの構成例を示した図であり、ディスプレイを見ている者の移動に連動して表示映像が変化する様子を図示している。It is the figure which showed the structural example of the conventional video display system, and shows a mode that a display video changes in response to the movement of the person who is looking at a display. 第二のユーザの顔が横移動した状況を模式的に示した図である。It is the figure which showed typically the condition where the 2nd user's face moved sideways. ユーザ、背景及び前景の各々の奥行距離についての説明図である。It is explanatory drawing about the depth distance of each of a user, a background, and a foreground. 遷移処理を実行したときの合成映像の変化を示した説明図であり、(A)は、遷移処理前の合成映像を、(B)は、遷移処理後の合成映像を、それぞれ示している。It is explanatory drawing which showed the change of the synthetic | combination video when a transition process is performed, (A) has shown the synthetic | combination video before a transition process, (B) has each shown the synthetic | combination video after a transition process. 従来の映像表示システムの構成例を示した図であり、ユーザの奥行距離に応じて当該ユーザの映像の表示サイズが変わる様子を図示している。It is the figure which showed the structural example of the conventional video display system, and has shown a mode that the display size of the said user's image | video changes according to a user's depth distance. 映像表示サイズの調整についての説明図であり、(A)は、ユーザの奥行距離が変化する前の合成映像を、(B)は、奥行距離が変化した後にサイズ調整が行われた段階の合成映像を、それぞれ示している。It is explanatory drawing about adjustment of a video display size, (A) is a synthetic | combination image | video before a user's depth distance changes, (B) is the synthesis | combination of the stage in which size adjustment was performed after the depth distance changed. Each video is shown. 映像表示フローの流れを示した図である(その1)。It is the figure which showed the flow of the image | video display flow (the 1). 映像表示フローの流れを示した図である(その2)。It is the figure which showed the flow of the video display flow (the 2). 人物の三次元映像を取得する手順を示した図である。It is the figure which showed the procedure which acquires the three-dimensional image | video of a person. 複数のカメラにてユーザの映像を撮像する様子を模式的に示した図である。It is the figure which showed typically a mode that a user's image | video is imaged with a some camera. カメラ別に生成した三次元映像片と、三次元映像片同士を結合してなる三次元映像と、を示した図である。It is the figure which showed the 3D image piece produced | generated for each camera, and the 3D image formed by combining 3D image pieces. 変形例において人物の三次元映像を取得する手順を示した図である。It is the figure which showed the procedure which acquires the three-dimensional image | video of a person in a modification. 第二の遷移処理に関する説明図であり、(A)が第二の遷移処理前の合成映像を、(B)が第二の遷移処理後の合成映像を、それぞれ示している。It is explanatory drawing regarding a 2nd transition process, (A) has shown the synthetic | combination image | video before a 2nd transition process, (B) has each shown the synthetic | combination image | video after a 2nd transition process.
 以下、本発明の一実施形態(以下、本実施形態)について図面を参照しながら説明する。本実施形態に係る映像表示システム(以下、本システムS)は、互いに離れた部屋に居るユーザ同士が互いの姿(映像)を見ながら対話するために用いられる。より具体的に説明すると、各ユーザが居る部屋内には映像表示器としてのディスプレイが設置されており、このディスプレイに相手の映像が映し出される(表示される)。これにより、各ユーザは、ディスプレイをガラス(例えば、窓ガラスやドアガラス)と見立て、あたかもガラス越しに相手と対面しながら対話しているように感じる。 Hereinafter, an embodiment of the present invention (hereinafter, this embodiment) will be described with reference to the drawings. The video display system according to the present embodiment (hereinafter, system S) is used for users who are in rooms separated from each other to interact with each other while watching each other's appearance (video). More specifically, a display as a video display is installed in a room where each user is present, and the other party's video is displayed (displayed) on this display. As a result, each user feels that the display is viewed as glass (for example, window glass or door glass) and interacts with the other party through the glass.
 なお、本システムSは、各ユーザが各自の自宅に居るときに利用されることになっている。つまり、本システムSは、各ユーザが自宅に居ながらにして対話相手と対話(擬似的な対面対話であって、以下、単に「対面対話」という)を行うために利用される。ただし、これに限定されるものではなく、本システムSは、ユーザが自宅以外の場所、例えば、集会所や商業施設、あるいは学校の教室や学習塾、病院等の公共施設、会社や事務所等に居るときに用いられてもよい。また、同じ建物内で互いに離れた部屋に居る者同士が対面対話をするために本システムSを用いてもよい。 The system S is to be used when each user is at his / her home. That is, this system S is used for each user to have a conversation with a conversation partner (a pseudo face-to-face conversation, hereinafter simply referred to as “face-to-face conversation”) while at home. However, the present system S is not limited to this, and the system S is a place where the user is not at home, such as a meeting place, a commercial facility, a school classroom, a school, a public facility such as a hospital, a company, an office, etc. May be used when in Moreover, you may use this system S in order for the person who is in the room apart from each other in the same building to have a face-to-face conversation.
 以降、本システムSについて分かり易く説明するために、二人のユーザが本システムSを利用して対面対話するケースを例に挙げて説明することとし、一方のユーザをAさん、他方のユーザをBさんとする。また、以下では、Bさん側の視点、すなわち、Aさんの映像を見る立場から本システムSの構成等を説明することとする。つまり、Aさんが「ユーザ」に相当し、Bさんが「第二のユーザ」に相当する。ただし、「ユーザ」及び「第二のユーザ」は、映像を見る者及び見られる者の関係に応じて切り替わる相対的な概念である。したがって、Aさんの視点を基準としたときには、Bさんが「ユーザ」に相当し、Aさんが「第二のユーザ」に相当することとなる。 Hereinafter, in order to explain the system S in an easy-to-understand manner, a case where two users have a face-to-face conversation using the system S will be described as an example. One user is Mr. A and the other user is Let's say B. In the following, the configuration of the system S will be described from the viewpoint of Mr. B, that is, from the viewpoint of viewing Mr. A's video. That is, Mr. A corresponds to a “user”, and Mr. B corresponds to a “second user”. However, “user” and “second user” are relative concepts that are switched according to the relationship between the person who sees the image and the person who sees the image. Therefore, when the viewpoint of Mr. A is used as a reference, Mr. B corresponds to “user” and Mr. A corresponds to “second user”.
 <<本システムの基本構成>>
 先ず、本システムSの基本構成について説明する。本システムSは、二人のユーザ(すなわち、Aさん及びBさん)がお互いの映像を見ながら対面対話をするために用いられ、より具体的には、各ユーザに対して対話相手の等身大の映像を表示し、対話相手の音声を再生するものである。このような視聴覚的効果を得るために、各ユーザは、通信ユニット100を保有している。つまり、本システムSは、各ユーザが保有する通信ユニット100によって構成されている。
<< Basic configuration of this system >>
First, the basic configuration of the system S will be described. This system S is used for two users (namely, Mr. A and Mr. B) to conduct a face-to-face conversation while watching each other's images, and more specifically, the life-size of the conversation partner for each user. Is displayed, and the other party's voice is played back. In order to obtain such an audiovisual effect, each user has a communication unit 100. That is, this system S is comprised by the communication unit 100 which each user possesses.
 次に、図1を参照しながら通信ユニット100の構成について説明する。図1は、本システムSの構成、より具体的には各通信ユニット100の構成を示した図である。各通信ユニット100は、ホームサーバ1、撮像装置としてのカメラ2、集音装置としてのマイク3、赤外線センサ4、映像表示器としてのディスプレイ5、及び、スピーカ6を主な構成機器として有する。これらの機器のうち、カメラ2、マイク3、赤外線センサ4、ディスプレイ5及びスピーカ6は、各ユーザの自宅における所定部屋(例えば、対面対話を行う際に利用する部屋)内に配置されている。 Next, the configuration of the communication unit 100 will be described with reference to FIG. FIG. 1 is a diagram showing the configuration of the system S, more specifically, the configuration of each communication unit 100. Each communication unit 100 includes a home server 1, a camera 2 as an imaging device, a microphone 3 as a sound collection device, an infrared sensor 4, a display 5 as a video display, and a speaker 6 as main components. Among these devices, the camera 2, the microphone 3, the infrared sensor 4, the display 5, and the speaker 6 are arranged in a predetermined room at the home of each user (for example, a room used when performing a face-to-face conversation).
 ホームサーバ1は、本システムSの中枢をなす装置であり、コンピュータ、具体的にはホームゲートウェイを構成するサーバコンピュータからなる。このホームサーバ1の構成については公知であり、CPU、ROMやRAM等のメモリ、通信用インタフェース及びハードディスクドライブ等によって構成されている。 The home server 1 is a central device of the system S, and includes a computer, specifically, a server computer constituting a home gateway. The configuration of the home server 1 is publicly known and includes a CPU, a memory such as a ROM and a RAM, a communication interface, a hard disk drive, and the like.
 また、ホームサーバ1には、対面対話の実現に必要なデータ処理を実行するためのプログラム(以下、対話用プログラム)がインストールされている。この対話用プログラムには、三次元映像表示用のプログラムが組み込まれている。このプログラムは、三次元コンピュータグラフィックス(以下、3DCG)により三次元映像を構築して表示するためのプログラムであり、所謂レンダラーである。また、上記の3DCGレンダラーは、複数の三次元映像を合成する機能を有する。そして、複数の三次元映像を合成してなる映像、すなわち、合成映像がディスプレイ5に表示されると、合成された個々の三次元映像がディスプレイ5の奥行方向において互いに異なる位置に配置されているように映る。 The home server 1 is installed with a program for executing data processing necessary for realizing the face-to-face conversation (hereinafter referred to as a dialogue program). This interactive program incorporates a program for displaying 3D images. This program is a program for constructing and displaying a 3D image by 3D computer graphics (hereinafter, 3DCG), and is a so-called renderer. The 3DCG renderer has a function of synthesizing a plurality of 3D images. Then, when an image formed by combining a plurality of 3D images, that is, a combined image is displayed on the display 5, the combined 3D images are arranged at different positions in the depth direction of the display 5. It looks like this.
 また、ホームサーバ1は、インターネット等の外部通信ネットワークGNを介して通信機器と通信可能な状態で接続されている。つまり、Aさんが保有する通信ユニット100に属するホームサーバ1は、外部通信ネットワークGNを介して、Bさんが保有する通信ユニット100に属するホームサーバ1と通信し、両サーバ間で各種データの送受信を行う。なお、ホームサーバ1が送受信するデータは、対面対話に必要なデータであり、例えば、各ユーザの映像を示す映像データや音声を示す音声データである。 Further, the home server 1 is connected in a communicable state with a communication device via an external communication network GN such as the Internet. That is, the home server 1 belonging to the communication unit 100 owned by Mr. A communicates with the home server 1 belonging to the communication unit 100 owned by Mr. B via the external communication network GN, and transmits and receives various data between the two servers. I do. Note that the data transmitted and received by the home server 1 is data necessary for a face-to-face conversation, for example, video data indicating video of each user and audio data indicating audio.
 カメラ2は、公知のネットワークカメラであり、撮像範囲(画角)内にある被写体の映像を撮像する。ここで、「映像」とは、連続している複数のフレーム画像(RGB画像)の集合体によって構成されるものであるが、以下の説明では、フレーム画像の集合体を含む他、個々のフレーム画像をも含むものとする。また、本実施形態では、カメラ2の撮像範囲が固定されている。このため、カメラ2は、その起動中、常に当該カメラ2が設置された空間の所定領域の映像を撮像することになる。 The camera 2 is a known network camera and captures an image of a subject within an imaging range (angle of view). Here, the “video” is constituted by an aggregate of a plurality of continuous frame images (RGB images), but in the following description, in addition to including an aggregate of frame images, individual frames Including images. In the present embodiment, the imaging range of the camera 2 is fixed. For this reason, the camera 2 always captures an image of a predetermined area of the space in which the camera 2 is installed during its activation.
 カメラ2は、撮像映像を示す信号(映像信号)を、当該カメラ2が所属する通信ユニット100と同一のユニットに属するホームサーバ1に対して出力する。なお、カメラ2の設置台数については、特に制限されるものではないが、本実施形態ではコスト面を考慮し、各通信ユニット100においてカメラ2を1台のみ備えることとした。 The camera 2 outputs a signal indicating the captured video (video signal) to the home server 1 belonging to the same unit as the communication unit 100 to which the camera 2 belongs. Although the number of cameras 2 installed is not particularly limited, in the present embodiment, only one camera 2 is provided in each communication unit 100 in consideration of cost.
 また、カメラ2のレンズは、ディスプレイ5における表示画面の形成面に面している。ここで、形成面を構成するディスプレイ5のパネル(厳密には、タッチパネル5aであり、鏡面部分に相当)は、透明なガラスによって構成されている。したがって、カメラ2は、図2に示すように、パネル越しで当該パネルの前に位置する人物の映像を撮像することになる。図2は、本システムSの構成機器としてAさん及びBさんのそれぞれの部屋内に配置されている各種機器の配置位置を示した図である。なお、カメラ2の配置位置については、ディスプレイ5から離れた位置であってもよい。 Further, the lens of the camera 2 faces the display screen forming surface of the display 5. Here, the panel of the display 5 (strictly speaking, it is the touch panel 5a and corresponds to the mirror surface portion) constituting the formation surface is made of transparent glass. Therefore, as shown in FIG. 2, the camera 2 captures an image of a person located in front of the panel through the panel. FIG. 2 is a diagram showing arrangement positions of various devices arranged in the rooms of Mr. A and Mr. B as the components of the system S. Note that the position of the camera 2 may be a position away from the display 5.
 ここで、被写体である人物がディスプレイ5の前方位置でディスプレイ5から所定距離だけ離れているとき、カメラ2は、当該人物の顔から足までの全身像を撮像することが可能である。「全身像」とは、起立姿勢での全身像であってもよく、あるいは着座姿勢での全身像であってもよい。また、「全身像の映像」には、前方に配置された物によって身体の一部分が隠れた状態の映像が含まれる。 Here, when the person who is the subject is away from the display 5 by a predetermined distance at the front position of the display 5, the camera 2 can capture the whole body image from the person's face to the foot. The “whole body image” may be a whole body image in a standing posture or a whole body image in a sitting posture. The “whole body image” includes an image in which a part of the body is hidden by an object placed in front.
 なお、本システムSにおいて、カメラ2は、床面から約1m上がった高さに設置されている。このため、ディスプレイ5の前方位置に立つ人物の身長(厳密には、目の高さ)がカメラ2の設置位置よりも高くなっているとき、カメラ2は、被写体である人物の顔を下方から撮像することになる。ここで、カメラ2が設置される高さ(換言すると、鉛直方向におけるカメラ2の位置)については特に制限されるものではなく、任意の高さに設定可能である。 In this system S, the camera 2 is installed at a height of about 1 m above the floor. For this reason, when the height (strictly speaking, the height of the eyes) of the person standing in front of the display 5 is higher than the installation position of the camera 2, the camera 2 moves the face of the person who is the subject from below. The image will be taken. Here, the height at which the camera 2 is installed (in other words, the position of the camera 2 in the vertical direction) is not particularly limited, and can be set to an arbitrary height.
 マイク3は、マイク3が設置された部屋内の音声を集音し、その音声信号をホームサーバ1(厳密には、マイク3が所属する通信ユニット100と同一のユニットに属するホームサーバ1)に対して出力する。なお、本実施形態において、マイク3は、図2に示すようにディスプレイ5の直上位置に設置されている。 The microphone 3 collects sound in the room in which the microphone 3 is installed, and the sound signal is sent to the home server 1 (strictly, the home server 1 belonging to the same unit as the communication unit 100 to which the microphone 3 belongs). Output. In the present embodiment, the microphone 3 is installed at a position directly above the display 5 as shown in FIG.
 赤外線センサ4は、所謂デプスセンサであり、赤外線方式にて計測対象物(対象物に相当)の深度を計測するためのセンサである。具体的に説明すると、赤外線センサ4は、計測対象物に向けて発光部4aから赤外線を照射し、その反射光を受光部4bにて受光することにより深度を計測する。より具体的に説明すると、赤外線センサ4の発光部4a及び受光部4bは、ディスプレイ5における表示画面の形成面に面している。その一方で、形成面を構成するディスプレイ5のタッチパネル5aのうち、赤外線センサ4の直前位置にある部分には、赤外線が透過することが可能なフィルムが貼られている。発光部4aから照射された後に計測対象物にて反射された赤外光は、上記のフィルムを通過した上で受光部4bにて受光される。 The infrared sensor 4 is a so-called depth sensor, and is a sensor for measuring the depth of a measurement object (corresponding to the object) by an infrared method. Specifically, the infrared sensor 4 irradiates infrared rays from the light emitting unit 4a toward the measurement object, and measures the depth by receiving the reflected light at the light receiving unit 4b. More specifically, the light emitting unit 4 a and the light receiving unit 4 b of the infrared sensor 4 face the display screen forming surface of the display 5. On the other hand, a film capable of transmitting infrared light is attached to a portion of the touch panel 5a of the display 5 constituting the forming surface at a position immediately before the infrared sensor 4. The infrared light reflected from the measurement object after being irradiated from the light emitting unit 4a passes through the film and is received by the light receiving unit 4b.
 なお、本システムSでは、「深度」として、カメラ2(厳密には、カメラ2のレンズ表面)から計測対象物までの距離、すなわち、奥行距離を計測することとしている。このため、本システムSでは、赤外線センサ4の受光部4bによる受光位置が、ディスプレイ5の奥行方向(厳密には、表示画面の法線方向)において、カメラ2のレンズの表面位置と同一位置となるように設定されている。 In this system S, as the “depth”, the distance from the camera 2 (strictly, the lens surface of the camera 2) to the measurement object, that is, the depth distance is measured. For this reason, in the present system S, the light receiving position by the light receiving unit 4b of the infrared sensor 4 is the same position as the surface position of the lens of the camera 2 in the depth direction of the display 5 (strictly, the normal direction of the display screen). It is set to be.
 また、本システムSでは、深度の計測結果を、カメラ2が撮像した映像を所定数の映像片(画素)に分割した際の当該画素毎に得る。そして、画素毎に得た深度の計測結果を映像単位でまとめると、その映像についての深度データ(距離データに相当)が得られるようになる。この深度データは、カメラ2の撮像映像(厳密には、各フレーム画像)について画素別に赤外線センサ4の計測結果、すなわち深度を規定したものである。つまり、映像についての深度データとは、当該映像のデプスマップであり、当該深度データのうち、カメラ2が撮像した映像中にある対象物の映像と対応する画素群、には当該対象物の奥行距離(深度の値)が規定されている。具体的に説明すると、後述する図5のように、背景の映像と、その前方の映像とでは奥行距離が異なるため、それぞれに対応する画素は、同図に示すように明らかに異なるようになる。なお、図5中、黒色の画素は、背景映像と対応し、斜線ハッチングの画素は、背景よりも前方に在る物の映像と対応し、白抜きの画素は、さらに前方に在る人物の映像と対応している。 Further, in the present system S, the depth measurement result is obtained for each pixel when the image captured by the camera 2 is divided into a predetermined number of image pieces (pixels). When the depth measurement results obtained for each pixel are collected in units of video, depth data (corresponding to distance data) for the video can be obtained. This depth data defines the measurement result of the infrared sensor 4, that is, the depth, for each pixel of the captured image (strictly, each frame image) of the camera 2. In other words, the depth data for the video is a depth map of the video, and the depth of the target is included in the pixel group corresponding to the video of the target in the video captured by the camera 2 in the depth data. Distance (depth value) is specified. Specifically, as shown in FIG. 5 to be described later, since the depth distance is different between the background image and the image in front thereof, the corresponding pixels are clearly different as shown in FIG. . In FIG. 5, the black pixels correspond to the background image, the hatched pixels correspond to the image of the object in front of the background, and the white pixels are those of the person in front. Corresponds to video.
 以上のような深度データを利用することで、映像の中から人物の映像を抽出することが可能である。なお、深度データを利用した人物映像の抽出方法については、後述する。また、本システムSでは、深度データから人物の位置を特定することが可能である。ただし、これに限定されるものではなく、例えば、位置検知用のセンサが赤外線センサ4とは別に設置されており、かかる位置検知用のセンサの検知結果から人物の位置を特定してもよい。 By using the depth data as described above, it is possible to extract a person's video from the video. A person video extraction method using depth data will be described later. In the present system S, the position of a person can be specified from the depth data. However, the present invention is not limited to this. For example, a position detection sensor may be installed separately from the infrared sensor 4, and the position of the person may be specified from the detection result of the position detection sensor.
 スピーカ6は、ホームサーバ1が音声データを展開することで再生される音声(再生音)を発するものであり、公知のスピーカによって構成されている。なお、本実施形態において、スピーカ6は、図2に示すように、ディスプレイ5の横幅方向においてディスプレイ5を挟む位置に複数(図2では4個)設置されている。 The speaker 6 emits sound (reproduced sound) that is reproduced when the home server 1 develops the sound data, and is configured by a known speaker. In the present embodiment, as shown in FIG. 2, a plurality of speakers 4 (four in FIG. 2) are installed at positions sandwiching the display 5 in the horizontal width direction of the display 5.
 ディスプレイ5は、映像の表示画面を形成するものである。具体的に説明すると、ディスプレイ5は、透明なガラスによって構成されたパネルを有し、当該パネルの前面に表示画面を形成する。なお、本システムSにおいて、上記のパネルは、タッチパネル5aであり、ユーザが行う操作(タッチ操作)を受け付ける。 The display 5 forms a video display screen. More specifically, the display 5 has a panel made of transparent glass, and forms a display screen on the front surface of the panel. In the present system S, the above-described panel is the touch panel 5a and receives an operation (touch operation) performed by the user.
 さらに、上記のパネルは、人の全身映像を表示するのに十分なサイズを有している。そして、本システムSによる対面対話では、上記のパネルの前面に形成された表示画面に、対話相手の全身映像が等身大のサイズで表示されることになっている。つまり、Bさん側のディスプレイ5には、Aさんの全身映像を等身大サイズにて表示することが可能である。これにより、表示画面を見ているBさんは、あたかもAさんと会っている感覚、特に、ガラス越しで対面している感覚を感じるようになる。 Furthermore, the above panel has a size sufficient to display a whole body image of a person. In the face-to-face conversation by the system S, the whole body image of the conversation partner is displayed in a life-size size on the display screen formed on the front surface of the panel. That is, it is possible to display Mr. A's whole body image in a life-size size on the display 5 on the Mr. B side. As a result, Mr. B who is looking at the display screen feels as if he is meeting Mr. A, in particular, the feeling of facing through the glass.
 さらにまた、本システムSのディスプレイ5は、通常時には部屋内に配置された家具、具体的には姿見として機能し、対面対話時にのみ表示画面を形成するものとなっている。以下、図3の(A)及び(B)を参照しながらディスプレイ5の構成について詳しく説明する。図3の(A)及び(B)は、本システムSで用いられているディスプレイ5の構成例を示した図であり、(A)が非対話時の状態を、(B)が対面対話時の状態をそれぞれ示している。 Furthermore, the display 5 of the present system S normally functions as furniture arranged in the room, specifically as a look, and forms a display screen only during face-to-face conversation. Hereinafter, the configuration of the display 5 will be described in detail with reference to FIGS. FIGS. 3A and 3B are diagrams showing a configuration example of the display 5 used in the present system S. FIG. 3A shows a non-interactive state, and FIG. 3B shows a face-to-face conversation. Each state is shown.
 ディスプレイ5が有するタッチパネル5aは、対面対話が行われる部屋内に配置された姿見の一部分、具体的には鏡面部分を構成する。そして、上記のタッチパネル5aは、図3の(A)に示すように、対話が行われていない非対話時、すなわち映像が表示されていない間には表示画面を形成しない。すなわち、本システムSのディスプレイ5は、非対話時には姿見としての外観を現すことになる。一方、対面対話が開始されると、上記のタッチパネル5aは、その前面に表示画面を形成する。これにより、ディスプレイ5は、図3の(B)に示すように、タッチパネル5aの前面にて対話相手及びその背景の映像を表示するようになる。 The touch panel 5a included in the display 5 constitutes a part of the appearance arranged in the room where the face-to-face conversation is performed, specifically, a specular part. Then, as shown in FIG. 3A, the touch panel 5a does not form a display screen during a non-dialogue when no dialogue is performed, that is, while no video is displayed. In other words, the display 5 of the present system S shows an appearance as a figure at the time of non-dialogue. On the other hand, when the face-to-face conversation is started, the touch panel 5a forms a display screen on the front surface. Thereby, as shown in FIG. 3B, the display 5 displays the conversation partner and the video of the background on the front surface of the touch panel 5a.
 ちなみに、表示画面のオンオフは、ホームサーバ1が赤外線センサ4の計測結果に応じて切り替えることになっている。より詳しく説明すると、対面対話を開始するにあたってユーザがディスプレイ5の正面位置に立つと、カメラ2が上記ユーザを含んだ映像(以下、実映像)を撮像すると共に、赤外線センサ4が深度を計測する。これにより、実映像についての深度データが取得され、ホームサーバ1は、当該深度データに基づいてユーザとカメラ2との間の距離、すなわち奥行距離を特定する。そして、上記の奥行距離が所定距離以下であるとき、ホームサーバ1は、ディスプレイ5を制御してタッチパネル5aの前面に表示画面を形成させる。この結果、それまで姿見として機能していたディスプレイ5のタッチパネル5aが映像表示用のスクリーンとして機能するようになる。反対に、上記の奥行距離が所定の距離以上となると、ホームサーバ1がディスプレイ5を制御し、それまで形成されていた表示画面をオフするようになる。これにより、ディスプレイ5は、再び姿見として機能するようになる。 Incidentally, the home server 1 is to switch the display screen on and off according to the measurement result of the infrared sensor 4. More specifically, when the user stands at the front position of the display 5 in starting the face-to-face conversation, the camera 2 captures an image including the user (hereinafter referred to as an actual image) and the infrared sensor 4 measures the depth. . Thereby, the depth data about the actual video is acquired, and the home server 1 specifies the distance between the user and the camera 2, that is, the depth distance based on the depth data. When the depth distance is equal to or less than the predetermined distance, the home server 1 controls the display 5 to form a display screen on the front surface of the touch panel 5a. As a result, the touch panel 5a of the display 5 that has been functioning as a figure until then functions as a screen for displaying images. On the other hand, when the depth distance is equal to or greater than the predetermined distance, the home server 1 controls the display 5 and turns off the display screen that has been formed so far. As a result, the display 5 functions as a figure again.
 以上のように、本システムSでは、ディスプレイ5が非対話時には姿見として利用される。これにより、非対話時には表示画面の存在が気付かれ難くなる。その一方で、対面対話時には、表示画面が形成されて対話相手の映像が表示されるようになり、ユーザは、あたかも対話相手とガラス越しに対話しているような視覚的演出効果を得るようになる。なお、映像表示スクリーンと姿見とを兼用する構成については、例えば国際公開第2009/122716号に記載された構成のように公知の構成が利用可能である。また、ディスプレイ5については、姿見として兼用される構成に限定されるものではない。ディスプレイ5として用いられる機器については、対話相手の全身映像を表示するのに十分なサイズを有しているものであればよい。そして、非対話時に表示画面の存在を気付き難くする観点からは、対面対話用の部屋内に設置された家具や建築材料であって鏡面部分を有するものが好適であり、例えば扉(ガラス戸)や窓(ガラス窓)をディスプレイ5として利用してもよい。なお、ディスプレイ5については、扉や窓等の建築材料若しくは姿見等の家具として兼用されるものに限定されず、起動中、表示画面を常時形成する通常の表示器であってもよい。 As described above, in the present system S, the display 5 is used as an appearance when non-interactive. This makes it difficult to notice the presence of the display screen during non-interaction. On the other hand, during a face-to-face conversation, a display screen is formed and the image of the conversation partner is displayed, so that the user can obtain a visual effect as if he is interacting with the conversation partner through the glass. Become. In addition, about the structure which uses both a video display screen and appearance, a well-known structure can be utilized like the structure described in the international publication 2009/122716, for example. Further, the display 5 is not limited to a configuration that is also used as a figure. The device used as the display 5 only needs to have a size sufficient to display the whole body image of the conversation partner. And from the viewpoint of making it difficult to notice the presence of the display screen during non-interaction, furniture and building materials installed in a room for face-to-face interaction and having a mirror surface portion are suitable. For example, a door (glass door) A window (glass window) may be used as the display 5. In addition, about the display 5, it is not limited to what is used also as furniture, such as building materials, such as a door and a window, or a figure, etc., The normal display which forms a display screen constantly during starting may be used.
 <<映像合成について>>
 本システムSを用いた対面対話では、Bさん側のディスプレイ5にAさんの映像及びその背景の映像が表示され、Aさん側のディスプレイ5にBさんの映像及びその背景の映像が表示される。ここで、各ディスプレイ5に表示される人物映像及び背景映像は、カメラ2が同時に撮像したものではなく、異なるタイミングで撮像されたものとなっている。すなわち、各ディスプレイ5には、異なるタイミングで撮像された人物映像及び背景映像を合成した合成映像が表示されることになる。また、本システムSでは、人物映像及び背景映像に加えて、前景の映像を更に合成した合成映像を表示することとしている。
<< About video composition >>
In the face-to-face conversation using the system S, Mr. A's video and its background video are displayed on the Mr. B's display 5, and Mr. B's video and his background video are displayed on the Mr. A's display 5. . Here, the person image and the background image displayed on each display 5 are not captured simultaneously by the camera 2 but are captured at different timings. That is, each display 5 displays a composite video obtained by synthesizing a human video and a background video captured at different timings. Further, in the present system S, in addition to the person video and the background video, a synthesized video obtained by further synthesizing the foreground video is displayed.
 以下、映像合成の手順について図4を参照しながら概説する。図4は、映像合成の手順についての説明図である。なお、以下の説明では、Aさんの映像、背景映像及び前景映像を合成するケースを具体例に挙げて説明することとする。 Hereinafter, the video composition procedure will be outlined with reference to FIG. FIG. 4 is an explanatory diagram of the video composition procedure. In the following description, a case where Mr. A's video, background video, and foreground video are combined will be described as a specific example.
 合成される映像のうち、背景映像(図4中、記号Pbにて表記)は、Aさんが対面対話を行う際に利用する部屋のうち、カメラ2の撮像範囲内にある領域の映像である。そして、本実施形態では、Aさんが上記部屋に居ないときにカメラ2が背景映像を撮像することになっている。すなわち、背景映像は、単独で撮像されることになっている。なお、背景映像の撮像タイミングについては、Aさんが上記の部屋に居ない期間内であれば任意に設定することが可能である。 Among the synthesized videos, the background video (indicated by the symbol Pb in FIG. 4) is a video of an area within the imaging range of the camera 2 in the room used when Mr. A performs a face-to-face conversation. . And in this embodiment, when Mr. A is not in the said room, the camera 2 is supposed to image a background image | video. That is, the background video is to be taken alone. In addition, about the imaging | photography timing of a background image | video, it is possible to set arbitrarily if it is in the period when Mr. A is not in the said room.
 一方、人物映像(具体的にはAさんの映像であって、図4中、記号Puにて表記)は、Aさんが上記部屋内、厳密にはカメラ2の撮像範囲内に居るときに撮像される。ここで、カメラ2が撮像する映像(すなわち、実映像)には、人物映像の他に背景映像及び前景映像が含まれている。そして、本システムSでは、実映像の中から人物映像を抽出して用いることとしている。実映像の中から人物映像を抽出する方法については特に限定されるものではないが、一例としては、上述した深度データに用いて人物映像を抽出する方法が挙げられる。以下、図5を参照しながら、深度データを用いた人物映像の抽出方法について説明する。図5は、実映像から人物映像を抽出する手順についての説明図である。なお、図5では、図示の都合上、深度データを構成する画素が実際の画素よりも粗くなっている。 On the other hand, a person image (specifically, Mr. A's image, represented by the symbol Pu in FIG. 4) is captured when Mr. A is in the room, strictly speaking, within the imaging range of the camera 2. Is done. Here, the video (that is, the real video) captured by the camera 2 includes a background video and a foreground video in addition to a human video. In the system S, a person video is extracted from the actual video and used. A method for extracting a person image from a real image is not particularly limited, and an example is a method for extracting a person image using the above-described depth data. Hereinafter, a person video extraction method using depth data will be described with reference to FIG. FIG. 5 is an explanatory diagram of a procedure for extracting a person video from a real video. In FIG. 5, for the convenience of illustration, the pixels constituting the depth data are coarser than the actual pixels.
 カメラ2が映像を撮像している期間中、赤外線センサ4が、カメラ2の画角内にある計測対象物の深度を計測する。この結果、実映像についての深度データが得られる。実映像についての深度データとは、実映像を構成するフレーム画像を所定数の画素に分割したときの当該画素毎に赤外線センサ4の計測結果、すなわち深度を規定したものである。なお、実映像についての深度データでは、図5に示すように、人物映像に属する画素(図中、白抜きの画素)とそれ以外の映像に属する画素(図中、黒色の画素や斜線ハッチングの画素)とでは明らかに深度が異なる。 During the period when the camera 2 is capturing an image, the infrared sensor 4 measures the depth of the measurement object within the angle of view of the camera 2. As a result, depth data about the actual video is obtained. The depth data for the real video is obtained by defining the measurement result of the infrared sensor 4, that is, the depth for each pixel when the frame image constituting the real video is divided into a predetermined number of pixels. In the depth data for the actual video, as shown in FIG. 5, the pixels belonging to the human video (the white pixels in the figure) and the pixels belonging to the other video (the black pixels and the hatched hatching in the figure). The depth is clearly different from that of (pixel).
 そして、深度データ及びカメラ2の撮像映像(厳密には、撮像映像におけるAさんの顔の映像の位置を特定するための情報)に基づいてAさんの骨格モデルを特定する。骨格モデルとは、図5に示すようにAさんの骨格(具体的には身体中、頭部、肩、肘、手首、上半身中心、腰、膝、足首)に関する位置情報を簡易的にモデル化したものである。なお、骨格モデルを取得する方法については、公知の方法が利用可能であり、例えば、特開2014-155693号公報や特開2013-116311号公報に記載された発明において採用されている方法と同様の方法を利用してもよい。 Then, Mr. A's skeleton model is identified based on the depth data and the captured image of the camera 2 (strictly, information for identifying the position of the image of Mr. A's face in the captured image). The skeletal model is a simple model of position information about Mr. A's skeleton (specifically, in the body, head, shoulders, elbows, wrists, upper body center, waist, knees, ankles) as shown in FIG. It is what. As a method for acquiring the skeleton model, a known method can be used, for example, the same as the method employed in the invention described in Japanese Patent Application Laid-Open No. 2014-155893 and Japanese Patent Application Laid-Open No. 2013-116311. The method may be used.
 そして、骨格モデルを特定した後、当該骨格モデルに基づいて実映像の中から人物映像を抽出する。本明細書では、骨格モデルに基づいて実映像の中から人物映像を抽出する技法に関する詳細については説明を省略するが、大まかな手順を説明すると、特定した骨格モデルに基づいて深度データ中、Aさんの人物映像に属する画素群を特定する。その後、特定した画素群と対応する領域を実映像の中から抽出する。このような手順によって抽出された映像が実映像中のAさんの人物映像に該当する。 Then, after specifying the skeleton model, the person image is extracted from the actual image based on the skeleton model. In the present specification, a description of a technique for extracting a person image from a real image based on a skeleton model will be omitted. However, a rough procedure will be described. In the depth data based on the identified skeleton model, A Identify the pixel group that belongs to the person's video. Thereafter, an area corresponding to the specified pixel group is extracted from the actual video. The image extracted by such a procedure corresponds to the person image of Mr. A in the actual image.
 また、本システムSでは、前景映像(図4中、記号Pfにて表記)を、人物映像の場合と同様に実映像の中から抽出して用いることとしている。実映像の中から前景映像を抽出する方法については特に限定されるものではないが、一例を挙げて説明すると、人物映像と同様に深度データを用いて前景映像を抽出する方法が考えられる。具体的に説明すると、実映像についての深度データ中、人物映像に属する画素よりも奥行距離が小さい画素群を特定する。そして、実映像中、特定した画素群と対応する部分の映像を前景映像として抽出してくることになる。 Further, in the present system S, the foreground video (indicated by the symbol Pf in FIG. 4) is extracted from the actual video and used in the same manner as the person video. The method for extracting the foreground video from the actual video is not particularly limited. However, for example, a method for extracting the foreground video using the depth data as in the case of the person video can be considered. More specifically, a pixel group having a depth depth smaller than the pixels belonging to the person video is specified in the depth data of the real video. Then, in the actual video, the video corresponding to the specified pixel group is extracted as the foreground video.
 以上までに説明してきた手順により実映像から人物映像及び前景映像を抽出した後、背景映像、人物映像及び前景映像を合成する。具体的に説明すると、カメラ2が撮像した背景映像中、実際にディスプレイ5に表示される部分の映像(図4中、破線にて囲まれた範囲であって、以下、表示範囲)を設定する。ここで、表示範囲は、カメラ2が撮像した背景映像のうち、合成映像中に含まれる部分に相当する。なお、表示範囲の大きさについては、ディスプレイ5の大きさに応じて決定される。また、本実施形態において、初期(デフォルト)の表示範囲は、背景映像の中央部分に設定されている。ただし、初期の表示範囲については特に限定されるものではなく、背景映像の中央部分以外の部分でもよい。 After extracting the person image and the foreground image from the actual image according to the procedure described above, the background image, the person image and the foreground image are synthesized. More specifically, in the background image captured by the camera 2, an image of a portion actually displayed on the display 5 (a range surrounded by a broken line in FIG. 4, hereinafter referred to as a display range) is set. . Here, the display range corresponds to a portion included in the synthesized video in the background video captured by the camera 2. Note that the size of the display range is determined according to the size of the display 5. In the present embodiment, the initial (default) display range is set at the center of the background video. However, the initial display range is not particularly limited, and may be a portion other than the central portion of the background video.
 そして、背景映像における上記の表示範囲と、抽出された人物映像と、抽出された前景映像を合成して合成映像(図4中、記号Pmにて表記)を取得する。この結果、Bさん側のディスプレイ5には、図4に示すように、背景の手前にAさんが位置し、かつ、Aさんの手前に前景が位置した映像が表示されるようになる。 Then, the above-described display range in the background video, the extracted person video, and the extracted foreground video are synthesized to obtain a synthesized video (indicated by symbol Pm in FIG. 4). As a result, as shown in FIG. 4, an image in which Mr. A is positioned in front of the background and the foreground is positioned in front of Mr. A is displayed on the display 5 on the Mr. B side.
 以上のように本システムSではディスプレイ5の表示映像として合成映像を表示する。そして、合成映像を表示する構成では、人物映像、背景映像及び前景映像の各々について、表示位置や表示サイズ等を個別に調整することが可能となる。具体的に説明すると、例えば、人物映像であるAさんの映像の表示サイズについては、背景映像や前景映像の表示サイズを変えずに調整することが可能である。 As described above, in this system S, the composite video is displayed as the display video on the display 5. In the configuration for displaying the composite video, the display position, the display size, and the like can be individually adjusted for each of the human video, the background video, and the foreground video. Specifically, for example, the display size of the video of Mr. A, which is a person video, can be adjusted without changing the display size of the background video and the foreground video.
 なお、本システムSでは、Aさんの映像の表示サイズをAさんの実際のサイズ(等身大サイズ)と一致するように調整する。この結果、Bさん側のディスプレイ5にはAさんの映像が等身大サイズにて表示されるようになり、本システムSを用いた対面対話の臨場感がより一層向上する。ただし、人物映像の表示サイズについては、等身大サイズに限定されるものではない。ここで、等身大サイズとは、カメラ2の前方位置でカメラ2から所定距離(具体的には、後述する図10B中の距離d1であり、以下、基準距離)だけ離れた位置にいるときに撮像された人物映像をそのままのサイズにて表示したときのサイズを意味する。また、上記の基準距離d1については、予め設定されており、ホームサーバ1のメモリに記憶されている。 In this system S, the display size of Mr. A's video is adjusted to coincide with Mr. A's actual size (life size). As a result, Mr. A's video is displayed in a life-size size on the display 5 on the Mr. B side, and the realism of the face-to-face conversation using the system S is further improved. However, the display size of the person video is not limited to the life size. Here, the life-size size means that the camera 2 is located at a position away from the camera 2 by a predetermined distance (specifically, a distance d1 in FIG. 10B to be described later, hereinafter referred to as a reference distance). It means the size when the captured human image is displayed as it is. The reference distance d1 is set in advance and stored in the memory of the home server 1.
 <<三次元映像の生成について>>
 本システムSにおいて、ディスプレイ5には三次元映像が表示されることになっている。より具体的に説明すると、前節にて説明したように、ディスプレイ5には、背景映像、人物映像及び前景映像を合成した合成映像が表示されることになっているが、合成される各映像は、三次元化された映像(三次元映像)となっている。この三次元映像は、カメラ2が撮像した2次元映像(具体的には、RGB形式のフレーム画像からなる映像)と、その映像についての深度データと、を用いて3DCGによるレンダリング処理を実行することで得られる。ここで、レンダリング処理とは、厳密にはサーフェスレンダリング方式の映像表示処理であり、仮想的に設定された視点から見た際の三次元映像を生成するための処理である。
<< About 3D image generation >>
In the system S, a 3D image is to be displayed on the display 5. More specifically, as described in the previous section, the display 5 is configured to display a composite video obtained by combining the background video, the human video, and the foreground video. 3D video (3D video). This 3D image is obtained by executing a rendering process by 3DCG using a 2D image (specifically, an image composed of frame images in RGB format) captured by the camera 2 and depth data about the image. It is obtained by. Here, strictly speaking, the rendering process is a surface rendering video display process, which is a process for generating a 3D video when viewed from a virtually set viewpoint.
 そして、本システムSでは、レンダリング処理としてテクスチャマッピングを採用した処理を実行する。以下、図6を参照しながら、三次元映像を生成する手順について説明する。図6は、三次元映像を生成する手順についての説明図である。なお、図中のメッシュモデルは、図示の都合上、実際のメッシュサイズよりも粗くなっている。また、以下では、Aさんの三次元映像を生成するケースを例に挙げて説明することとする。 And in this system S, the process which adopted the texture mapping as a rendering process is performed. Hereinafter, a procedure for generating a 3D image will be described with reference to FIG. FIG. 6 is an explanatory diagram of a procedure for generating a 3D video. Note that the mesh model in the figure is coarser than the actual mesh size for convenience of illustration. In the following, a case where a 3D video of Mr. A is generated will be described as an example.
 カメラ2が撮像したAさんの映像(厳密には、実映像から抽出されたAさんの映像)は、二次元映像であり、テクスチャマッピングにおいてテクスチャとして用いられる。一方、Aさんの映像を含む実映像について取得された深度データ(すなわち、デプスマップ)は、三次元映像の骨格をなすメッシュモデルを構築するために用いられる。ここで、メッシュモデルは、ポリゴンメッシュにて人物(Aさん)を表現したものである。なお、深度データ(デプスマップ)からメッシュモデルを構築する方法については、公知の方法を利用することが可能である。 The video of Mr. A taken by the camera 2 (strictly speaking, the video of Mr. A extracted from the actual video) is a two-dimensional video and is used as a texture in texture mapping. On the other hand, the depth data (that is, the depth map) acquired for the actual video including the video of Mr. A is used to construct a mesh model that forms the skeleton of the 3D video. Here, the mesh model represents a person (Mr. A) with a polygon mesh. As a method of constructing a mesh model from depth data (depth map), a known method can be used.
 そして、メッシュモデルが得られた後、図6に示すように、当該メッシュモデルにテクスチャとしての二次元映像(具体的にはAさんの映像)を貼り付けることで立体的なAさんの映像、すなわち、奥行感を有する三次元映像を生成することが可能となる。このようなテクスチャマッピングにて三次元映像が生成され、さらに移動や回転等のプロセッシングを行うことで視点を変えたときの三次元映像を取得することが可能となる。これにより、Aさんの顔を下方から見たときの三次元映像や、Aさんの顔を側方から見たときの三次元映像を取得することも可能となる。 Then, after the mesh model is obtained, as shown in FIG. 6, a stereoscopic image of Mr. A is obtained by pasting a two-dimensional image (specifically, an image of Mr. A) as a texture to the mesh model. That is, it becomes possible to generate a 3D image having a sense of depth. A 3D image is generated by such texture mapping, and further, by performing processing such as movement and rotation, it is possible to acquire a 3D image when the viewpoint is changed. As a result, it is also possible to acquire a 3D image when the face of Mr. A is viewed from below and a 3D image when the face of Mr. A is viewed from the side.
 また、背景や前景についても、人物の場合と同様の手順により、三次元映像を生成することが可能である。つまり、カメラ2が撮像した背景映像と、背景映像について取得された深度データと、を用いてテクスチャマッピングによるレンダリング処理を実行することで、背景の三次元映像が取得される。また、カメラ2が撮像した前景映像(厳密には、実映像から抽出した前景映像)と、前景映像について取得された深度データ(厳密には、前景映像を含む実映像についての深度データ)と、を用いてテクスチャマッピングによるレンダリング処理を実行することで、前景の三次元映像が取得される。 Also, for the background and foreground, it is possible to generate a 3D image by the same procedure as for a person. That is, a background three-dimensional image is acquired by executing a rendering process by texture mapping using the background image captured by the camera 2 and the depth data acquired for the background image. In addition, the foreground video captured by the camera 2 (strictly, the foreground video extracted from the real video), the depth data acquired for the foreground video (strictly, the depth data for the real video including the foreground video), A foreground 3D image is acquired by executing a rendering process by texture mapping using.
 なお、本システムSでは、テクスチャマッピングを利用しているが、三次元映像を取得するためのレンダリング処理については、テクスチャマッピングを利用したものに限られず、例えばバンプマッピングを利用したレンダリング処理であってもよい。 In this system S, texture mapping is used. However, the rendering process for acquiring a 3D image is not limited to the one using texture mapping. For example, the rendering process uses bump mapping. Also good.
 また、深度データにおいては、欠損部分、すなわち、何らかの理由によって深度の計測結果が得られない画素が生じる虞がある。特に、人物映像と背景映像との境界付近(エッジ付近)では欠損部分が発生し易い。このように欠損部分が生じた場合には、欠損部分の位置が特定できるのであれば、テクスチャマッピングにおいて当該欠損部分に対してテクスチャである二次元映像をそのまま貼ればよい。あるいは、その周辺の映像を貼ってもよい。また、深度データを構成する画素のうち、人物映像と対応している画素群において、そのエッジ付近に欠損部分が生じた場合には、テクスチャマッピングにおいて上記の画素群よりも一回り大きい画素群を抽出し、当該画素群に対応する二次元映像を貼ればよい。 Further, in the depth data, there is a possibility that a defective portion, that is, a pixel for which a depth measurement result cannot be obtained for some reason may occur. In particular, a missing portion is likely to occur near the boundary between the person image and the background image (near the edge). In this way, when a missing part is generated, if the position of the missing part can be specified, a two-dimensional image that is a texture may be pasted as it is on the missing part in texture mapping. Alternatively, the surrounding video may be pasted. In addition, among the pixels constituting the depth data, in the pixel group corresponding to the person video, when a missing portion is generated near the edge, a pixel group that is slightly larger than the above pixel group in texture mapping is selected. What is necessary is just to extract and paste the two-dimensional image | video corresponding to the said pixel group.
 <<ホームサーバの機能について>>
 次に、ホームサーバ1の機能、特に、映像表示処理に関する機能について説明する。なお、Aさん側のホームサーバ1及びBさん側のホームサーバ1の双方は、同様の機能を有しており、対面対話の実施にあたり双方向通信して同様のデータ処理を実行する。このため、以下では、一方のホームサーバ1(例えば、Bさん側のホームサーバ1)の機能のみを説明することとする。
<< About home server functions >>
Next, functions of the home server 1, particularly functions related to video display processing will be described. Note that both the Mr. A's home server 1 and Mr. B's home server 1 have the same function, and execute the same data processing through two-way communication when performing the face-to-face conversation. For this reason, only the function of one home server 1 (for example, Mr. B's home server 1) will be described below.
 ホームサーバ1は、同装置のCPUが対話用プログラムを実行することでホームサーバ1としての機能を発揮し、具体的には、対面対話に関する一連のデータ処理を実行する。ここで、図7を参照しながら、ホームサーバ1の構成をその機能面、特に映像表示機能の観点から説明する。図7は、ホームサーバ1の構成を機能面から示した図である。 The home server 1 functions as the home server 1 when the CPU of the apparatus executes a dialogue program, and specifically executes a series of data processing related to a face-to-face dialogue. Here, with reference to FIG. 7, the configuration of the home server 1 will be described from the viewpoint of its function, particularly the video display function. FIG. 7 is a diagram showing the configuration of the home server 1 in terms of functions.
 ホームサーバ1は、図7に示すように、データ送信部11、データ受信部12、背景映像記憶部13、第1深度データ記憶部14、実映像記憶部15、人物映像抽出部16、骨格モデル記憶部17、第2深度データ記憶部18、前景映像抽出部19、高さ検知部20、三次元映像生成部21、合成映像表示部22、判定部23及び顔移動検知部24を備える。これらのデータ処理部は、それぞれ、ホームサーバ1のハードウェア機器(具体的には、CPU、メモリ、通信用インタフェース及びハードディスクドライブ等)がソフトウェアとしての対話用プログラムと協働することによって実現される。以下、各データ処理部について説明する。 As shown in FIG. 7, the home server 1 includes a data transmission unit 11, a data reception unit 12, a background video storage unit 13, a first depth data storage unit 14, a real video storage unit 15, a human video extraction unit 16, a skeleton model. A storage unit 17, a second depth data storage unit 18, a foreground video extraction unit 19, a height detection unit 20, a 3D video generation unit 21, a composite video display unit 22, a determination unit 23, and a face movement detection unit 24 are provided. Each of these data processing units is realized by a hardware device (specifically, CPU, memory, communication interface, hard disk drive, etc.) of the home server 1 cooperating with a dialogue program as software. . Hereinafter, each data processing unit will be described.
 データ送信部11は、Bさん側のカメラ2が撮像した映像の信号をデジタル化し、映像データとしてAさん側のホームサーバ1へ送信する。ここで、データ送信部11が送信する映像データの種類は、2種類に分類される。一つは、背景映像の映像データであり、具体的には、背景に相当する部屋内にBさんが居ないときに撮像された同室の映像(厳密には、カメラ2の撮像範囲内にある領域の映像)を示すデータである。もう一つは、実映像の映像データであり、Bさんが上記部屋に在室している間に撮像された映像、より具体的にはBさん及びその背景や前景の映像を示すデータである。 The data transmission unit 11 digitizes the video signal captured by the B-side camera 2 and transmits it to the A-side home server 1 as video data. Here, the types of video data transmitted by the data transmission unit 11 are classified into two types. One is the video data of the background video, specifically, the video of the same room captured when Mr. B is not in the room corresponding to the background (strictly, it is within the imaging range of the camera 2). This is data indicating the image of the area. The other is actual video data, which is an image captured while Mr. B is in the room, more specifically, data indicating Mr. B and its background and foreground images. .
 また、データ送信部11は、背景映像の映像データを送信するにあたり、赤外線センサ4の計測結果に基づいて、背景映像についての深度データを生成し、当該深度データを背景映像の映像データとともに送信する。この深度データは、背景の三次元映像を取得するためのレンダリング処理を実行する際に用いられると共に、背景とカメラ2との間の距離(奥行距離)を特定する際にも用いられる。同様に、データ送信部11は、実映像の映像データを送信するにあたり、赤外線センサ4の計測結果に基づいて、実映像についての深度データを生成し、当該深度データを実映像の映像データとともに送信する。この深度データは、実映像から人物映像(具体的にはBさんの映像)や前景映像を抽出する際に用いられる。また、上記の深度データは、Bさんの三次元映像を取得するためのレンダリング処理、及び、前景の三次元映像を取得するためのレンダリング処理のそれぞれの実行時に用いられる。さらに、上記の深度データは、Bさんとカメラ2との間の距離(奥行距離)を特定する際にも用いられる。 Further, when transmitting the video data of the background video, the data transmission unit 11 generates depth data about the background video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the background video. . This depth data is used when executing a rendering process for acquiring a 3D image of the background, and also when specifying a distance (depth distance) between the background and the camera 2. Similarly, when transmitting the video data of the real video, the data transmission unit 11 generates depth data for the real video based on the measurement result of the infrared sensor 4 and transmits the depth data together with the video data of the real video. To do. This depth data is used when extracting a person image (specifically, an image of Mr. B) and a foreground image from an actual image. The depth data is used at the time of each of the rendering process for acquiring Mr. B's 3D video and the rendering process for acquiring the 3D video of the foreground. Furthermore, the depth data is also used when specifying the distance (depth distance) between Mr. B and the camera 2.
 データ受信部12は、Aさん側のホームサーバ1から送信されてくる各種データを受信する。データ受信部12が受信するデータの中には、背景映像の映像データ及び背景映像についての深度データ、並びに、実映像の映像データ及び実映像についての深度データが含まれている。ここで、データ受信部12が受信する背景映像の映像データは、背景に相当する部屋内にAさんが居ないときに撮像された同室の映像を示すデータである。このようにデータ受信部12は、背景映像の映像データを受信することで、Aさん側のカメラ2が撮像した背景の映像を取得する。かかる意味で、データ受信部12は、映像取得部に該当すると言える。 The data receiving unit 12 receives various data transmitted from the home server 1 on the A side. The data received by the data receiving unit 12 includes video data of the background video, depth data about the background video, and video data of the real video and depth data about the real video. Here, the video data of the background video received by the data receiving unit 12 is data indicating the video of the same room captured when Mr. A is not in the room corresponding to the background. In this way, the data receiving unit 12 receives the background video data, and thereby acquires the background video captured by the camera A's camera 2. In this sense, it can be said that the data receiving unit 12 corresponds to a video acquisition unit.
 また、データ受信部12が受信する背景映像についての深度データは、背景の三次元映像を取得するためのレンダリング処理を実行する際に用いられると共に、背景とカメラ2との間の距離(奥行距離)を特定する際にも用いられる。なお、以下では、データ受信部12が受信する背景映像についての深度データを「第1深度データ」と呼ぶこととする。 Further, the depth data regarding the background video received by the data receiving unit 12 is used when performing rendering processing for acquiring the background three-dimensional video, and the distance (depth distance) between the background and the camera 2 is used. ) Is also used to specify. Hereinafter, the depth data regarding the background video received by the data receiving unit 12 is referred to as “first depth data”.
 また、データ受信部12が受信する実映像の映像データは、Aさんが上記部屋に在室している間に撮像されたAさん、背景及び前景の映像を示すデータである。また、データ受信部12が受信する実映像についての深度データは、実映像からAさんの映像や前景映像を抽出する際に用いられる。また、上記の深度データは、Aさんの三次元映像を取得するためのレンダリング処理、及び、前景の三次元映像を取得するためのレンダリング処理のそれぞれの実行時に用いられる。さらに、上記の深度データは、Aさんとカメラ2との間の距離(奥行距離)、及び、前景とカメラ2との間の距離(奥行距離)を特定する際にも用いられる。なお、以下では、データ受信部12が受信する実映像についての深度データを「第2深度データ」と呼ぶこととする。 The video data of the actual video received by the data receiving unit 12 is data indicating the video of Mr. A, the background, and the foreground captured while Mr. A is in the room. The depth data about the actual video received by the data receiving unit 12 is used when extracting the video of A and the foreground video from the actual video. The depth data is used at the time of each of the rendering process for acquiring Mr. A's 3D video and the rendering process for acquiring the 3D video of the foreground. Further, the depth data is also used when specifying the distance between A and the camera 2 (depth distance) and the distance between the foreground and the camera 2 (depth distance). Hereinafter, the depth data regarding the actual video received by the data receiving unit 12 is referred to as “second depth data”.
 以上のようにデータ受信部12は、第1深度データと第2深度データとをAさん側のホームサーバ1から受信することで、背景映像についての深度データ、人物映像についての深度データ、及び前景映像についての深度データをそれぞれ取得する。かかる意味で、データ受信部12は、距離データである深度データを取得する距離データ取得部に該当すると言える。 As described above, the data receiving unit 12 receives the first depth data and the second depth data from the home server 1 on the A side, so that the depth data for the background video, the depth data for the person video, and the foreground Obtain depth data for each video. In this sense, it can be said that the data receiving unit 12 corresponds to a distance data acquiring unit that acquires depth data that is distance data.
 背景映像記憶部13は、データ受信部12が受信した背景映像の映像データを記憶する。第1深度データ記憶部14は、データ受信部12が受信した背景映像についての深度データ、すなわち、第1深度データを記憶する。実映像記憶部15は、データ受信部12が受信した実映像の映像データを記憶する。 The background video storage unit 13 stores the video data of the background video received by the data receiving unit 12. The first depth data storage unit 14 stores depth data regarding the background video received by the data receiving unit 12, that is, first depth data. The real video storage unit 15 stores the video data of the real video received by the data receiving unit 12.
 人物映像抽出部16は、データ受信部12が受信した実映像の映像データを展開し、当該実映像から人物映像(すなわち、Aさんの映像)を抽出する。骨格モデル記憶部17は、人物映像抽出部16が人物映像を抽出する際に用いる骨格モデル(具体的には、Aさんの骨格モデル)を記憶する。第2深度データ記憶部18は、データ受信部12が受信した実映像についての深度データ、すなわち第2深度データを記憶する。 The person video extracting unit 16 expands the video data of the real video received by the data receiving unit 12, and extracts the human video (that is, the video of Mr. A) from the real video. The skeleton model storage unit 17 stores a skeleton model (specifically, Mr. A's skeleton model) used when the person video extraction unit 16 extracts a person video. The second depth data storage unit 18 stores depth data on the actual video received by the data receiving unit 12, that is, second depth data.
 人物映像抽出部16は、実映像からAさんの映像を抽出するにあたり、実映像記憶部15から実映像を、第2深度データ記憶部18から実映像についての第2深度データを、それぞれ読み出す。そして、人物映像抽出部16は、読み出した第2深度データ及びカメラ2の撮像映像からAさんの骨格モデルを特定する。特定されたAさんの骨格モデルは、骨格モデル記憶部17に記憶される。その後、人物映像抽出部16は、骨格モデル記憶部17からAさんの骨格モデルを読み出し、当該骨格モデルに基づいて実映像から人物映像、すなわちAさんの映像を抽出する。このように人物映像抽出部16は、実映像から人物映像を抽出することで、Aさん側のカメラ2が撮像したAさんの映像を取得する。かかる意味で、人物映像抽出部16は、映像取得部に該当すると言える。 The person video extraction unit 16 reads the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18 in extracting the video of Mr. A from the real video. Then, the person video extraction unit 16 specifies Mr. A's skeleton model from the read second depth data and the captured video of the camera 2. The identified skeleton model of Mr. A is stored in the skeleton model storage unit 17. Thereafter, the person video extraction unit 16 reads out Mr. A's skeleton model from the skeleton model storage unit 17 and extracts a person video, that is, Mr. A's video from the actual video based on the skeleton model. In this way, the person video extraction unit 16 extracts the person video from the actual video, thereby acquiring the video of Mr. A captured by the camera 2 on the A side. In this sense, it can be said that the person video extraction unit 16 corresponds to a video acquisition unit.
 前景映像抽出部19は、データ受信部12が受信した実映像の映像データを展開し、当該実映像から前景映像を抽出する。具体的に説明すると、前景映像抽出部19は、実映像から前景映像を抽出するにあたり、実映像記憶部15から実映像を、第2深度データ記憶部18から当該実映像についての第2深度データを、それぞれ読み出す。そして、前景映像抽出部19は、読み出した第2深度データ中、前景映像と対応する画素群を抽出する。ここで、前景映像と対応する画素群とは、人物映像抽出部16によって第2深度データから抽出された画素群(すなわち、人物映像と対応する画素群)よりも奥行距離が小さい画素群のことである。その後、前景映像抽出部19は、実映像記憶部15から読み出した実映像中、上記の画素群と対応する部分の映像を前景映像として抽出する。このように前景映像抽出部19は、実映像から前景映像を抽出することで、Aさん側のカメラ2が撮像した前景映像を取得する。かかる意味で、前景映像抽出部19は、映像取得部に該当すると言える。 The foreground video extracting unit 19 develops the video data of the real video received by the data receiving unit 12 and extracts the foreground video from the real video. More specifically, the foreground video extraction unit 19 extracts the real video from the real video storage unit 15 and the second depth data about the real video from the second depth data storage unit 18 when extracting the foreground video from the real video. Are read out respectively. Then, the foreground video extraction unit 19 extracts a pixel group corresponding to the foreground video from the read second depth data. Here, the pixel group corresponding to the foreground image is a pixel group having a depth distance smaller than the pixel group extracted from the second depth data by the person image extraction unit 16 (that is, the pixel group corresponding to the person image). It is. Thereafter, the foreground video extraction unit 19 extracts a video of a portion corresponding to the pixel group from the real video read from the real video storage unit 15 as a foreground video. In this way, the foreground video extraction unit 19 extracts the foreground video from the actual video, thereby acquiring the foreground video captured by the camera A's camera 2. In this sense, it can be said that the foreground video extraction unit 19 corresponds to a video acquisition unit.
 高さ検知部20は、Aさん側のホームサーバ1から受信したデータに基づいて、Aさんの目の高さを検知する。具体的に説明すると、高さ検知部20は、第2深度データ記憶部18から第2深度データを読み出し、読み出した第2深度データ中、人物映像と対応する画素群を抽出する。その後、高さ検知部20は、抽出した画素群の中から目に相当する画素を特定し、その特定した画素の位置から目の高さを割り出す。そして、目の高さに関する検知結果については、三次元映像生成部21に引き渡され、三次元映像生成部21は、当該検知結果に応じた三次元映像(特に、人物の三次元映像)を生成するようになる。かかる内容については、次節にて詳しく説明する。 The height detection unit 20 detects the eye height of Mr. A based on the data received from the home server 1 on the Mr. A side. More specifically, the height detection unit 20 reads the second depth data from the second depth data storage unit 18, and extracts a pixel group corresponding to the person video from the read second depth data. Thereafter, the height detection unit 20 specifies a pixel corresponding to the eye from the extracted pixel group, and calculates the eye height from the position of the specified pixel. Then, the detection result regarding the eye height is transferred to the 3D video generation unit 21. The 3D video generation unit 21 generates a 3D video (particularly, a 3D video of a person) according to the detection result. Will come to do. This will be explained in detail in the next section.
 なお、目の高さを特定する方法については、特に制限されるものではなく、公知の方法を利用することが可能である。具体的に説明すると、本システムSでは第2深度データに基づいて目の高さを検知することとしたが、これに限定されず、例えば、Aさんの映像を含む実映像を解析して目の高さを検知してもよい。 It should be noted that the method for specifying the eye height is not particularly limited, and a known method can be used. Specifically, in the present system S, the eye height is detected based on the second depth data. However, the present invention is not limited to this. For example, the system S analyzes the actual video including the video of Mr. May be detected.
 三次元映像生成部21は、3DCGのレンダリング処理を実行して三次元映像を取得する。具体的に説明すると、三次元映像生成部21は、背景映像記憶部13に記憶された背景映像と、第1深度データ記憶部14に記憶された背景映像についての第1深度データと、を用いたレンダリング処理を実行して背景の三次元映像を生成する。なお、三次元映像生成部21は、背景の三次元映像を生成する際、背景映像記憶部13に記憶された背景映像のうち、直近で取得された背景映像を用いることになっている。同様に、第1深度データ記憶部14に記憶された第1深度データについても、直近で取得された第1深度データを用いることになっている。 The 3D video generation unit 21 executes 3DCG rendering processing to acquire a 3D video. Specifically, the 3D video generation unit 21 uses the background video stored in the background video storage unit 13 and the first depth data regarding the background video stored in the first depth data storage unit 14. The 3D image of the background is generated by executing the rendering process. Note that the 3D video generation unit 21 uses the most recently acquired background video among the background videos stored in the background video storage unit 13 when generating the background 3D video. Similarly, the first depth data acquired most recently is used for the first depth data stored in the first depth data storage unit 14.
 また、三次元映像生成部21は、人物映像抽出部16が抽出した人物映像(具体的にはAさんの映像)と、第2深度データ記憶部18に記憶された第2深度データ(厳密には、第2深度データ中、人物映像と対応する画素群のデータ)とを用いたレンダリング処理を実行して人物(Aさん)の三次元映像を生成する。同様に、三次元映像生成部21は、前景映像抽出部19が抽出した前景映像と、第2深度データ記憶部18に記憶された第2深度データ(厳密には、第2深度データ中、前景映像に相当する画素群のデータ)とを用いたレンダリング処理を実行して前景の三次元映像を生成する。なお、本システムSでは、上述したように、レンダリング処理としてテクスチャマッピングを採用した処理を実行する。 In addition, the 3D video generation unit 21 extracts the person video extracted by the person video extraction unit 16 (specifically, the video of Mr. A) and the second depth data (strictly, stored in the second depth data storage unit 18). Performs the rendering process using the pixel data corresponding to the person image in the second depth data to generate a 3D image of the person (Mr. A). Similarly, the 3D video generation unit 21 uses the foreground video extracted by the foreground video extraction unit 19 and the second depth data stored in the second depth data storage unit 18 (strictly speaking, the foreground in the second depth data). A foreground 3D image is generated by executing a rendering process using pixel group data corresponding to the image). Note that, in the present system S, as described above, processing using texture mapping is executed as rendering processing.
 合成映像表示部22は、三次元映像生成部21によって生成された背景、人物及び前景のそれぞれの三次元映像を合成し、その合成映像をBさん側のディスプレイ5に表示させる。なお、合成映像表示部22は、三次元映像生成部21によって生成された背景の三次元映像の中から合成映像の中に含める映像、すなわち、表示範囲を選定する。そして、合成映像表示部22は、選定した表示範囲の手前にAさんが位置し、且つAさんの手前に前景が位置した合成映像を、Bさん側のディスプレイ5に表示させる。 The synthesized video display unit 22 synthesizes the 3D videos of the background, the person, and the foreground generated by the 3D video generation unit 21 and displays the synthesized video on the display 5 on the B side. The composite video display unit 22 selects a video to be included in the composite video, that is, a display range, from the background 3D video generated by the 3D video generation unit 21. Then, the composite video display unit 22 displays the composite video in which Mr. A is positioned in front of the selected display range and the foreground is positioned in front of Mr. A on the B-side display 5.
 判定部23は、合成映像表示部22が合成映像をディスプレイ5に表示している期間中(換言すると、Aさん側のカメラ2がAさんの映像を撮像している期間中)、Aさん側のカメラ2とAさんと間の距離(すなわち、Aさんの奥行距離)が変化したかどうかを判定する。かかる判定は、第2深度データ記憶部18に記憶された第2深度データに基づいて行われる。そして、奥行距離が変化したと判定部23が判定すると、その判定結果が合成映像表示部22に引き渡され、合成映像表示部22は、当該判定結果に応じた合成映像をディスプレイ5に表示させる。かかる内容については、次節にて詳しく説明する。 During the period when the composite video display unit 22 displays the composite video on the display 5 (in other words, during the period when the camera 2 on the A side captures the video of Mr. A), It is determined whether the distance between the camera 2 and Mr. A (that is, the depth distance of Mr. A) has changed. Such a determination is made based on the second depth data stored in the second depth data storage unit 18. When the determination unit 23 determines that the depth distance has changed, the determination result is transferred to the composite video display unit 22, and the composite video display unit 22 displays a composite video corresponding to the determination result on the display 5. This will be explained in detail in the next section.
 顔移動検知部24は、赤外線センサ4の計測結果に基づいて、Bさん側のカメラ2が撮像した実映像についての深度データを生成するとともに、当該深度データから、Bさんの顔の横移動の有無を検知する。具体的に説明すると、合成映像表示部22によって合成映像がディスプレイ5に表示されている期間中、顔移動検知部24は、上記の深度データからBさんの映像に相当する画素群を特定し、当該画素群の位置の変化を監視する。そして、顔移動検知部24は、当該画素群の位置の変化を認識したとき、Bさんの顔が横移動したことを検知する。なお、横移動とは、Bさんの顔がBさん側のディスプレイ5に対して左右方向(ディスプレイ5の幅方向)に移動することである。 Based on the measurement result of the infrared sensor 4, the face movement detection unit 24 generates depth data about the actual image captured by the Mr. B-side camera 2, and from this depth data, the face movement of Mr. B's face is laterally moved. Detect the presence or absence. More specifically, during the period in which the composite video is displayed on the display 5 by the composite video display unit 22, the face movement detection unit 24 specifies a pixel group corresponding to the video of Mr. B from the depth data, A change in the position of the pixel group is monitored. The face movement detection unit 24 detects that the face of Mr. B has moved sideways when recognizing the change in the position of the pixel group. The lateral movement means that the face of Mr. B moves in the left-right direction (the width direction of the display 5) with respect to the display 5 on the Mr. B side.
 Bさんの顔が横移動したことの検知結果については、合成映像表示部22に引き渡され、合成映像表示部22は、当該検知結果に応じた合成映像をディスプレイ5に表示させる。かかる内容については、次節にて詳しく説明する。 About the detection result that Mr. B's face moved sideways, it is handed over to the synthetic | combination video display part 22, and the synthetic | combination video display part 22 displays the synthetic | combination video according to the said detection result on the display 5. FIG. This will be explained in detail in the next section.
 <<対面対話の臨場感を向上させるためのプロセスについて>>
 本システムSでは、同システムを用いた対面対話の臨場感を向上させるために、各ユーザの目線や顔の位置に応じて、ディスプレイ5に表示させる映像やその表示サイズを調整・変更することとしている。具体的には、下記(R1)~(R3)の映像表示プロセスを行う。
  (R1)目線高さ合わせ用のプロセス
  (R2)顔移動時のプロセス
  (R3)奥行距離変化時のプロセス
<< About the process to improve the realism of face-to-face dialogue >>
In this system S, in order to improve the realism of the face-to-face conversation using the system, the video displayed on the display 5 and the display size thereof are adjusted / changed according to each user's line of sight and the position of the face. Yes. Specifically, the following video display processes (R1) to (R3) are performed.
(R1) Eye height adjustment process (R2) Face movement process (R3) Depth distance change process
 以下、上記3つの映像表示プロセスの各々について個別に説明することとする。なお、以下では、Aさんの三次元映像を含む合成映像をBさん側のディスプレイ5にて表示するケースを例に挙げて説明することとする。 Hereinafter, each of the above three video display processes will be described individually. In the following description, a case in which a composite image including a 3D image of Mr. A is displayed on the display 5 on the side of Mr. B will be described as an example.
 <目線高さ合わせ用のプロセスについて>
 本システムSでは、前述したように、カメラ2が床から約1mの高さに設置されている。したがって、Aさんの身長次第では、Aさんの目の高さとカメラ2が設置されている高さとが異なってしまう。かかる場合、Bさん側のディスプレイ5に表示されるAさんの映像が、実際にAさんと対面した場合に見えるAさんの姿(像)とは異なったものとなる。
<About the process for adjusting the eye height>
In the present system S, as described above, the camera 2 is installed at a height of about 1 m from the floor. Therefore, depending on the height of Mr. A, the height of Mr. A's eyes differs from the height at which the camera 2 is installed. In such a case, the image of Mr. A displayed on the display 5 on the Mr. B side is different from the appearance (image) of Mr. A seen when actually facing Mr. A.
 具体的に説明すると、Aさんの目の高さがカメラ2の設置高さよりも高くなっている場合、そのカメラ2は、Aさんの顔の映像を下方から撮像することになる。この間、Aさんは、Aさん側のディスプレイ5を正面視しているため、Aさんの目線は正面を向いていることになる。以上の状況下では、図8の(A)に示すように、Bさん側のディスプレイ5に表示されるAさんの映像(厳密には三次元映像であるが、図8の(A)では簡略化して図示)が、Aさんの顔を仰視したような映像となってしまう。図8は、目線高さ合わせ用のプロセスについての説明図であり、図中の(A)は、実際のカメラ位置から撮像したAさんの映像を示している。 More specifically, when the eye height of Mr. A is higher than the installation height of the camera 2, the camera 2 takes an image of the face of Mr. A from below. During this time, since Mr. A is looking at the display 5 on the side of Mr. A in front, Mr. A's eyes are facing the front. Under the above circumstances, as shown in FIG. 8A, Mr. A's video displayed on the Mr. B side display 5 (strictly, it is a three-dimensional video, but is simplified in FIG. 8A). Becomes an image as if looking up at Mr. A's face. FIG. 8 is an explanatory diagram of the process for adjusting the eye height, and (A) in the figure shows an image of Mr. A taken from the actual camera position.
 以上のようにAさんの顔を仰視したような映像がディスプレイ5に表示された場合、その表示映像においてAさんの顔は、図8の(A)に示すように、目線が正面を向いておらず幾分上方を向いた状態で映し出されることになる。かかる場合には、ディスプレイ5に表示されたAさんの目線と、ディスプレイ5を見ているBさんの目線と、を一致させ難くなり、対面対話の臨場感が損なわれてしまう虞がある。 As described above, when an image as if looking up at Mr. A's face is displayed on the display 5, the face of Mr. A in the displayed image has a line of sight toward the front as shown in FIG. It will be projected in a state of facing upwards. In such a case, it is difficult to match Mr. A's line of sight displayed on the display 5 with Mr. B's line of sight looking at the display 5, and there is a possibility that the sense of reality of the face-to-face dialogue will be impaired.
 そこで、本システムSでは、Aさんの目の高さとカメラ2の設置高さとが異なるとき、ディスプレイ5に表示されるAさんの目線とディスプレイ5を見ているBさんの目線とを一致させるために、目線高さ合わせ用のプロセスを行うこととしている。当該プロセスについて説明すると、3DCGのレンダリング処理を実行し、Aさんの目の高さと同じ高さにある仮想的な視点から見たAさんの三次元映像を取得することとしている。具体的に説明すると、目線高さ合わせ用のプロセスを行うにあたり、Bさん側のホームサーバ1(厳密には、前述の高さ検知部20)がAさんの目の高さを検知する。一方、Bさん側のホームサーバ1は、Aさん側のカメラ2が設置されている高さに関する情報を記憶している。そして、Aさんの目の高さ及びAさん側のカメラ2の設置高さの双方が異なっているとき、Bさん側のホームサーバ1(厳密には、前述の三次元映像生成部21)は、検知した目の高さにある仮想的な視点から見たときのAさんの三次元映像を取得するためのレンダリング処理を実行する。 Therefore, in the present system S, when the height of Mr. A's eyes and the installation height of the camera 2 are different, the eyes of Mr. A displayed on the display 5 and the eyes of Mr. B looking at the display 5 are matched. In addition, a process for adjusting the eye height is performed. The process will be described. A 3DCG rendering process is executed to acquire a three-dimensional image of Mr. A viewed from a virtual viewpoint at the same height as the height of Mr. A's eyes. More specifically, when performing the eye height adjustment process, Mr. B's home server 1 (strictly speaking, the height detection unit 20 described above) detects Mr. A's eye height. On the other hand, Mr. B's home server 1 stores information about the height at which Mr. A's camera 2 is installed. When both the height of Mr. A's eyes and the installation height of the camera 2 on the A's side are different, the home server 1 on the B's side (strictly speaking, the 3D video generation unit 21 described above) Then, a rendering process for acquiring a 3D video of Mr. A when viewed from a virtual viewpoint at the detected eye level is executed.
 上記のレンダリング処理について図8の(B)を参照しながら説明する。図8の(B)は、カメラ2とAさんの目線との位置関係を示した図である。Bさん側のホームサーバ1は、Aさんの目の高さ及びAさん側のカメラ2の設置高さの双方が異なっているとき、当該双方の差(図8の(B)では記号Hにて表記)を特定する。また、Bさん側のホームサーバ1は、記憶されている第2深度データに基づいて、AさんとAさん側のカメラ2との間の距離(すなわち、Aさんの奥行距離であり、図8の(B)では記号Lにて表記)を特定する。その上で、Bさん側のホームサーバ1は、検知したAさんの目の高さと同じ高さに設置された仮想的なカメラ(図8の(B)において破線にて示す)の撮像方向と、実際のカメラ2の撮像方向と、の間の相違を特定する。具体的には、下記の式(1)にて求められる角度αを上記の相違として算出する。
    α=arctan(H/L)            (1)
The rendering process will be described with reference to FIG. FIG. 8B is a diagram showing the positional relationship between the camera 2 and the eyes of Mr. A. When the height of the eyes of Mr. A and the installation height of the camera 2 of Mr. A are different, the home server 1 on the side of Mr. B has a difference between the two (the symbol H in FIG. 8B). Specified). Further, the home server 1 on the B side is based on the stored second depth data, and the distance between the A and the A camera 2 (that is, the depth distance of the A, FIG. (B) is indicated by the symbol L). Then, Mr. B's home server 1 determines the imaging direction of a virtual camera (shown by a broken line in FIG. 8B) installed at the same height as the detected eye of Mr. A. The difference between the actual imaging direction of the camera 2 is specified. Specifically, the angle α obtained by the following equation (1) is calculated as the difference.
α = arctan (H / L) (1)
 そして、Bさん側のホームサーバ1は、角度αの算出結果を用いて、上記の仮想的なカメラから撮像したAさんの映像(三次元映像)を取得するためのレンダリング処理を実行する。具体的には、カメラ2が撮像したAさんの映像(厳密には、実映像から抽出したAさんの映像)と、実映像についての深度データである第2深度データと、を用いたテクスチャマッピングを行い、さらに、算出した角度αに相当する高さだけ視点を変位させる映像処理を行う。これにより、仮想的なカメラから撮像したときのAさんの三次元映像、換言すると、図8の(C)のように目線が正面を向いたAさんの三次元映像が取得されるようになる。図8の(C)は、仮想的なカメラ位置から撮像したAさんの映像を示している。 Then, Mr. B's home server 1 executes a rendering process for acquiring Mr. A's video (three-dimensional video) captured from the virtual camera using the calculation result of the angle α. Specifically, the texture mapping using the video of Mr. A taken by the camera 2 (strictly speaking, the video of Mr. A extracted from the real video) and the second depth data which is depth data about the real video. Furthermore, video processing is performed for displacing the viewpoint by a height corresponding to the calculated angle α. As a result, Mr. A's 3D image captured from a virtual camera, in other words, Mr. A's 3D image with his eyes facing the front, as shown in FIG. 8C, is acquired. . FIG. 8C shows a video of Mr. A taken from a virtual camera position.
 その後、Bさん側のホームサーバ1(厳密には、前述の合成映像表示部22)は、上記の手順により取得したAさんの三次元映像と、背景及び前景のそれぞれの三次元映像とを合成し、その合成映像をBさん側のディスプレイ5に表示させる。 Thereafter, Mr. B's home server 1 (strictly speaking, the above-described synthesized video display unit 22) synthesizes the 3D video of Mr. A acquired by the above procedure and the 3D video of the background and foreground. Then, the synthesized video is displayed on the display 5 on the B side.
 <顔移動時のプロセスについて>
 AさんとBさんとが実際に対面している場面においてBさんの顔が横移動したとき、Bさんの視界(Bさんの目に映る像)は、顔移動に伴って変化する。このような顔移動に伴う見え方の変化を映像表示システムで再現するには、ディスプレイ5に表示される映像を、ディスプレイ5を見ている者の顔の移動に連動させて変化させる必要がある。このため、従来の映像表示システムでは、図9に示すように、ディスプレイ5を見ている者(例えば、Bさん)の顔が横移動すると、ディスプレイ5に表示されている映像が鉛直軸を中心に回転するように切り替わるようになっていた。具体的には、同図に示すように、表示映像として、左部と右部との間で奥行距離が異なった映像がディスプレイ5に表示されていた。図9は、従来の映像表示システムの構成例を示した図であり、ディスプレイ5を見ているBさんの移動に連動して表示映像が変化する様子を図示している。
<About the process of moving the face>
When Mr. B's face moves sideways in a scene where Mr. A and Mr. B are actually facing each other, Mr. B's field of view (image reflected in Mr. B's eyes) changes as the face moves. In order to reproduce such a change in appearance caused by the movement of the face by the video display system, it is necessary to change the video displayed on the display 5 in conjunction with the movement of the face of the person who is viewing the display 5. . For this reason, in the conventional video display system, as shown in FIG. 9, when the face of a person (for example, Mr. B) who is watching the display 5 moves sideways, the video displayed on the display 5 is centered on the vertical axis. It was supposed to switch to rotate. Specifically, as shown in the figure, as the display image, an image in which the depth distance is different between the left part and the right part is displayed on the display 5. FIG. 9 is a diagram illustrating a configuration example of a conventional video display system, and illustrates how the display video changes in conjunction with the movement of Mr. B who is looking at the display 5.
 しかしながら、AさんとBさんが実際に対面して対話を行っている場面においてBさんの顔が横移動したとき、Bさんが見ているAさんの姿は、上記のように回転することはなく、水平移動するに過ぎない。また、図9に図示の映像表示システムでは、Bさんの顔が横移動したときに、Aさんの映像及び背景映像の双方を同じ回転量(回転角度)だけ回転させることとしている。このため、図9に図示の映像表示システムでは、Bさんの顔が横移動した際にディスプレイ5に表示されているAさんの映像が、実際に対面しているときの見え方とは異なる映像となってしまう。 However, when Mr. B's face moves sideways in the scene where Mr. A and Mr. B are actually facing each other, Mr. A's appearance of Mr. B is rotated as described above. There is nothing but horizontal movement. In the video display system shown in FIG. 9, when Mr. B's face moves sideways, both Mr. A's image and the background image are rotated by the same rotation amount (rotation angle). For this reason, in the video display system illustrated in FIG. 9, when the face of Mr. B moves sideways, the picture of Mr. A displayed on the display 5 is different from the way it looks when actually facing each other. End up.
 これに対して、本システムSでは、Bさんの顔が横移動した際に顔移動時のプロセスを行うこととし、実際にAさんと対面してAさんを見ているときの見え方を正確に反映して、ディスプレイ5に表示される映像(合成映像)を遷移させることとしている。以下、顔移動時のプロセスについて図10A、図10B及び図11を参照しながら説明する。図10Aは、Bさんの顔が横移動した状況を模式的に示した図である。図10Bは、Aさん、背景及び前景の各々の奥行距離についての説明図である。図11は、後述の遷移処理を実行したときの合成映像の変化を示した説明図であり、(A)は、遷移処理前の合成映像を、(B)は、遷移処理後の合成映像を、それぞれ示している。 On the other hand, in this system S, when Mr. B's face moves sideways, the process at the time of the face movement is performed, and the appearance when actually looking at Mr. A while facing Mr. A is accurate. As a result, the video (synthesized video) displayed on the display 5 is changed. Hereafter, the process at the time of a face movement is demonstrated, referring FIG. 10A, FIG. 10B, and FIG. FIG. 10A is a diagram schematically illustrating a situation in which Mr. B's face has moved laterally. FIG. 10B is an explanatory diagram regarding the depth distance of each of Mr. A, the background, and the foreground. FIG. 11 is an explanatory diagram showing changes in the composite video when a transition process described later is executed. (A) shows the composite video before the transition process, and (B) shows the composite video after the transition process. , Respectively.
 なお、以下では、当初ディスプレイ5の略中央位置に立っていたBさんが横移動したケースを例に挙げて説明することとする。また、以下の説明中、ディスプレイ5の幅方向(すなわち、左右方向)において互いに反対向きである2つの向きの一方を「第一向き」と呼び、他方を「第二向き」と呼ぶ。ここで、第一向きと第二向きの関係は、相対的なものであり、左右方向における一方の向きを第一向きとしたときに、他方の向きが第二向きとなる。したがって、ディスプレイ5を正面視したときに左向きを第一向きとしたときには、右向きが第二向きとなり、反対に、右向きを第一向きとしたときには、左向きが第二向きとなる。 In the following, the case where Mr. B who was standing at the approximate center of the display 5 has moved laterally will be described as an example. In the following description, one of the two directions opposite to each other in the width direction (that is, the left-right direction) of the display 5 is referred to as “first direction”, and the other is referred to as “second direction”. Here, the relationship between the first direction and the second direction is relative, and when one direction in the left-right direction is the first direction, the other direction is the second direction. Therefore, when the left direction is the first direction when the display 5 is viewed from the front, the right direction is the second direction. Conversely, when the right direction is the first direction, the left direction is the second direction.
 Bさん側のホームサーバ1(厳密には、前述の顔移動検知部24)は、Bさん側のディスプレイ5に合成映像を表示している間、Bさんの顔の移動の有無を検知する。そして、Bさんの顔の横移動を検知すると、Bさん側のホームサーバ1は、移動の向き及び移動量を同時に検知する。さらに、Bさん側のホームサーバ1(厳密には、前述の合成映像表示部22)は、Bさんの顔移動に関する検知結果に応じて遷移処理を実行する。遷移処理とは、Bさん側のディスプレイ5に表示されている合成映像を、Bさんの顔の横移動を検知する前の状態から遷移させる処理である。具体的には、合成映像におけるAさんの三次元映像及び前景の三次元映像の表示位置、並びに、背景の三次元映像の中で合成映像中に含まれる範囲(すなわち、表示範囲)の双方を左右方向にずらした状態へ合成映像を遷移させる。 Mr. B's home server 1 (strictly speaking, the aforementioned face movement detection unit 24) detects the presence or absence of movement of Mr. B's face while displaying the composite video on the display 5 of Mr. B. When the lateral movement of Mr. B's face is detected, Mr. B's home server 1 simultaneously detects the direction of movement and the amount of movement. Furthermore, Mr. B's home server 1 (strictly speaking, the above-described composite video display unit 22) executes a transition process according to the detection result related to Mr. B's face movement. The transition process is a process for transitioning the composite video displayed on the display 5 on the Mr. B side from the state before detecting the lateral movement of the Mr. B face. Specifically, both the display position of Mr. A's 3D video and foreground 3D video in the composite video, and the range (ie, display range) included in the composite video in the background 3D video are displayed. The composite video is shifted to a state shifted in the horizontal direction.
 遷移処理について詳しく説明すると、本処理では、先ず、合成映像におけるAさんの三次元映像及び前景の三次元映像の表示位置、並びに、背景の三次元映像の表示範囲の各々についてずれ量を設定する。ここで、Bさんの顔が第一向きに移動量xだけ移動した場合を想定すると、各々のずれ量は、Bさんの顔の移動量xと、カメラ2とその被写体(Aさんとその背景及び前景)との間の距離(すなわち、奥行距離)と、に応じて設定される。なお、本システムSでは、ずれ量を設定するにあたり、Bさんの顔の移動量xを移動角度に換算する。移動角度とは、Bさんの視線ラインの変化量を角度にて示したものである。また、視線ラインとは、Bさんの両眼の中央位置からディスプレイ5の中心に向かう仮想直線である。 The transition process will be described in detail. In this process, first, a shift amount is set for each of the display position of the 3D video of Mr. A and the foreground 3D video in the synthesized video, and the display range of the background 3D video. . Here, assuming that Mr. B's face has moved in the first direction by the amount of movement x, the amount of each shift is the amount of movement x of Mr. B, the camera 2 and its subject (Mr. A and its background). And the foreground) (ie, the depth distance). In this system S, when setting the shift amount, the movement amount x of Mr. B's face is converted into a movement angle. The movement angle is an angle indicating the amount of change in the line of sight of Mr. B. The line of sight is a virtual straight line from the center position of Mr. B's eyes toward the center of the display 5.
 図10Aを参照しながら説明すると、一点鎖線にて図示したラインが、Bさんの顔が移動する前の視線ラインに相当し、二点鎖線にて図示したラインが、移動後の視線ラインに相当する。そして、両視線ラインがなす鋭角、すなわち、図10A中の角度θが移動角度に相当する。なお、Bさんの顔が移動する前の視線ラインについては、図10Aに示すように、ディスプレイ5の表示画面の法線方向に沿ったラインとなっているものとする。 Referring to FIG. 10A, the line illustrated by the one-dot chain line corresponds to the line of sight before the face of Mr. B moves, and the line illustrated by the two-dot chain line corresponds to the line of sight after movement. To do. The acute angle formed by the two line-of-sight lines, that is, the angle θ in FIG. 10A corresponds to the movement angle. It is assumed that the line of sight before Mr. B's face moves is a line along the normal direction of the display screen of the display 5 as shown in FIG. 10A.
 また、ずれ量を設定するにあたっては、Aさん、背景(例えば、壁)、前景(例えば、Aさんの前にある箱)の各々の奥行距離を特定する。ここで、対面対話中、Aさんの奥行距離は、図10Bに示すように、Aさん側のカメラ2から基準距離d1だけ離れた位置に維持されるものとする。一方、背景である部屋の壁の奥行距離は、図10Bに示すように、Aさん側のカメラ2から距離dwだけ離れている。この距離dwは、当然ながら、Aさんの奥行距離である基準距離d1よりも長い距離となっている。また、前景であるAさんの前方に置かれた箱の奥行距離は、図10Bに示すように、Aさん側のカメラ2から距離dfだけ離れている。この距離dfは、当然ながら、Aさんの奥行距離である基準距離d1よりも短い距離となっている。 Also, when setting the amount of displacement, the depth distance of each of Mr. A, the background (for example, the wall), and the foreground (for example, the box in front of Mr. A) is specified. Here, during the face-to-face conversation, the depth distance of Mr. A is maintained at a position separated from the camera 2 on the Mr. A side by the reference distance d1, as shown in FIG. 10B. On the other hand, as shown in FIG. 10B, the depth distance of the wall of the room as the background is separated from the camera 2 on the A side by a distance dw. Naturally, this distance dw is longer than the reference distance d1 which is the depth distance of Mr. A. Further, the depth distance of the box placed in front of Mr. A, which is the foreground, is separated from the camera 2 on the Mr. A side by a distance df, as shown in FIG. 10B. Naturally, this distance df is shorter than the reference distance d1 which is the depth distance of Mr. A.
 そして、移動角度θ、並びにAさん、背景及び前景の各々の奥行距離d1、dw、dfが特定された後、合成映像におけるAさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲の各々に対してずれ量を設定する。具体的に説明すると、Aさんの三次元映像の表示位置に対するずれ量をt1とすると、当該ずれ量t1は、下記の式(2)によって算出される。
    t1=d1×sinθ               (2)
Then, after the movement angle θ and the depth distances d1, dw, df of each of Mr. A, the background, and the foreground are specified, the display position of Mr. A's 3D image in the composite image, the display position of the foreground 3D image And a deviation amount are set for each of the display ranges of the background 3D video. More specifically, assuming that the shift amount with respect to the display position of Mr. A's 3D image is t1, the shift amount t1 is calculated by the following equation (2).
t1 = d1 × sin θ (2)
 また、背景の三次元映像の表示範囲に対するずれ量をt2とすると、当該ずれ量t2は、下記の式(3)によって算出される。
    t2=dw×sinθ               (3)
Also, assuming that the amount of deviation from the background 3D video display range is t2, the amount of deviation t2 is calculated by the following equation (3).
t2 = dw × sin θ (3)
 また、前景の三次元映像の表示位置に対するずれ量をt3とすると、当該ずれ量t3は、下記の式(4)によって算出される。
    t3=df×sinθ               (4)
Also, assuming that the amount of deviation with respect to the display position of the foreground 3D image is t3, the amount of deviation t3 is calculated by the following equation (4).
t3 = df × sin θ (4)
 上記のずれ量t1、t2、t3を設定した後には、Aさんの三次元映像の表示位置をずれ量t1だけ、背景の三次元映像の表示範囲をずれ量t2だけ、前景の三次元映像の表示位置をずれ量t3だけ、それぞれ第二向きにずらした状態へ合成映像を遷移させる。これにより、Bさん側のディスプレイ5には、当初、図11の(A)に図示した合成映像が表示されていたところ、Bさんの顔の横移動に連動して、合成映像が図11の(B)に図示した状態へ徐々に遷移するようになる。 After setting the above-described shift amounts t1, t2, and t3, the display position of Mr. A's 3D image is shifted by the shift amount t1, the display range of the background 3D image is shifted by the shift amount t2, and the 3D video of the foreground is displayed. The composite image is shifted to a state where the display position is shifted in the second direction by the shift amount t3. Thus, when the synthesized video shown in FIG. 11A was initially displayed on the display 5 on the B-san side, the synthesized video is linked to the lateral movement of Mr. B's face. The state gradually changes to the state shown in FIG.
 以上までに説明したように、本システムSでは、Bさん側のディスプレイ5に合成映像が表示されている期間中にBさんの顔が第一向きへ移動すると、合成映像におけるAさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲が、ともに第二向きにずれるようになる。また、Aさんの三次元映像の表示位置に対するずれ量t1よりも、背景の三次元映像の表示範囲に対するずれ量t2の方が大きくなっている。また、Aさんの三次元映像の表示位置に対するずれ量t1よりも、前景の三次元映像の表示位置に対するずれ量t3の方が小さくなっている。このようにAさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲を、それぞれ互いに異なるずれ量だけずらした状態に合成映像を遷移させることにより、Bさん側のディスプレイ5には、Bさんが移動後の顔の位置から実際にAさんを見たときの見え方を反映した映像が表示されるようになる。 As described above, in the present system S, when Mr. B's face moves in the first direction during the period in which the synthesized video is displayed on the Mr. B display 5, the three-dimensional image of Mr. A in the synthesized video is displayed. The display position of the video, the display position of the foreground 3D video, and the display range of the background 3D video are all shifted in the second direction. Further, the shift amount t2 with respect to the display range of the background 3D video is larger than the shift amount t1 with respect to the display position of the 3D video of Mr. A. Further, the shift amount t3 with respect to the display position of the 3D video of the foreground is smaller than the shift amount t1 with respect to the display position of the 3D video of Mr. A. In this way, by transitioning the composite video to a state in which the display position of Mr. A's 3D video, the display position of the 3D video in the foreground, and the display range of the background 3D video are shifted from each other by different shift amounts. On the display 5 on the side of Mr. B, an image reflecting the appearance when Mr. A actually sees Mr. A from the position of the face after the movement is displayed.
 分かり易く説明すると、仮にBさんが実際にAさんと対面して対話している場合、Bさんの顔が横移動すると、移動後のBさんの位置から見えるものは、当初の位置からずれた位置にあるように見える。ここで、Bさんに対してより近くにあるものほど小さなずれ量だけ当初の位置からずれた位置に見えるようになり、より遠くにあるものほど大きなずれ量だけ当初の位置からずれた位置に見えるようになる。本システムSでは、以上のような見え方を再現すべく、Bさんの顔が横移動したことを検知したとき、Aさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景映像の表示範囲をそれぞれ異なるずれ量だけずらすように合成映像を遷移させる。この際、Aさんの三次元映像の表示位置に対するずれ量t1よりも、背景の三次元映像の表示範囲に対するずれ量t2の方が大きくなっている。この結果、遷移処理後の合成映像では、背景のうち、当初の合成映像(Bさんの顔が移動する前の合成映像)では表示されていなかった範囲の映像を見ること、いわゆる覗き込みが可能となる。 To explain in an easy-to-understand manner, if Mr. B actually interacts with Mr. A, when Mr. B's face moves sideways, what is visible from Mr. B's position after the movement is shifted from the original position. Looks like it's in position. Here, the closer to Mr. B, the smaller the amount of deviation appears from the original position, and the farther away, the larger the amount of deviation appears from the original position. It becomes like this. In this system S, when it is detected that Mr. B's face has moved sideways in order to reproduce the above-described appearance, the display position of Mr. A's 3D image, the display position of the foreground 3D image, and The composite video is shifted so that the display range of the background video is shifted by different shift amounts. At this time, the displacement amount t2 with respect to the display range of the background 3D image is larger than the displacement amount t1 with respect to the display position of Mr. A's 3D image. As a result, in the composite video after the transition process, it is possible to see the video in the range that was not displayed in the original composite video (composite video before Mr. B's face moved) in the background, so-called peeking is possible It becomes.
 <奥行距離変化時のプロセスについて>
 対面対話の実行時、Aさんは、通常、Aさん側のカメラ2から基準距離d1だけ離れた位置に立っている。このとき、カメラ2が撮像したAさんの映像をディスプレイ5にて表示すると、当該映像は図12に示すように等身大サイズで表示される。一方、Aさんが上記の位置よりも後方に移動したとき、カメラ2の撮像映像をそのままのサイズにてディスプレイ5にて表示すると、当該映像は、図12に示すように等身大サイズよりも幾分小さいサイズで表示されるようになる。このような表示サイズの変化は、カメラ2のレンズの光学的特性に起因して不可避的に生じる。なお、図12は、従来の映像表示システムの構成例を示した図であり、Aさんの奥行距離が大きくなるほどディスプレイ5に表示されるAさんの映像の表示サイズが小さくなる様子を図示している。
<About the process when the depth distance changes>
When performing the face-to-face conversation, Mr. A usually stands at a position separated from the camera 2 on the Mr. A side by the reference distance d1. At this time, when Mr. A's image captured by the camera 2 is displayed on the display 5, the image is displayed in a life-size size as shown in FIG. On the other hand, when Mr. A moves rearward from the above position, when the captured image of the camera 2 is displayed on the display 5 in the same size, the image is smaller than the life size as shown in FIG. A small size is displayed. Such a change in display size is unavoidably caused by the optical characteristics of the lens of the camera 2. FIG. 12 is a diagram showing a configuration example of a conventional video display system, illustrating that the display size of the video of Mr. A displayed on the display 5 becomes smaller as the depth distance of Mr. A increases. Yes.
 しかし、BさんとAさんとが実際に対面している場面においてAさんがBさんに対して多少近接又は離間したとしても、Aさんの姿(大きさ)は、Bさんから見たときの見え方(見た目)では殆ど変化しないように見える。そこで、本システムSでは、Aさんの奥行距離が変化したときの実際の見え方を再現すべく、奥行距離変化時のプロセスを行うようにしている。これにより、Bさん側のディスプレイ5に表示されるAさんの映像(厳密には三次元映像)の表示サイズは、Aさんの奥行距離が変化した後にも等身大サイズのままで維持されるようになる。 However, even when Mr. A and Mr. A are actually facing each other, even if Mr. A is slightly closer to or away from Mr. B, the appearance (size) of Mr. A is as seen from Mr. B. The appearance (appearance) looks almost unchanged. Therefore, in the present system S, a process at the time of changing the depth distance is performed in order to reproduce the actual appearance when the depth distance of Mr. A changes. As a result, the display size of Mr. A's video (strictly, a three-dimensional image) displayed on Mr. B's display 5 is maintained at a life-size size even after Mr. A's depth distance has changed. become.
 以下、奥行距離変化時のプロセスについて説明する。なお、以下では、Aさんの奥行距離が基準距離d1から、基準距離d1よりも大きい距離d2に変化したケースを想定して説明することとする。奥行距離変化時のプロセスは、Bさん側のディスプレイ5に合成映像が表示されている期間(換言すると、Aさん側のカメラ2がAさんの映像を撮像している期間)においてAさんの奥行距離が変化したときに行われる。具体的には、Bさん側のホームサーバ1(厳密には、前述の判定部23)が、上記期間中、奥行距離の変化の有無を判定する。そして、奥行距離が変化したと判定したとき、Bさん側のホームサーバ1は、これをトリガーとして奥行距離変化時のプロセスを開始する。 Hereinafter, the process when the depth distance is changed will be described. In the following description, it is assumed that the depth distance of Mr. A has changed from the reference distance d1 to a distance d2 that is larger than the reference distance d1. The process at the time of changing the depth distance is the depth of Mr. A during the period in which the composite image is displayed on the Mr. B side display 5 (in other words, the period in which Mr. A's camera 2 captures the image of Mr. A). This is done when the distance changes. Specifically, Mr. B's home server 1 (strictly, the determination unit 23 described above) determines whether there is a change in the depth distance during the period. When it is determined that the depth distance has changed, Mr. B's home server 1 uses this as a trigger to start the process when the depth distance changes.
 奥行距離変化時のプロセスにおいて、Bさん側のホームサーバ1(厳密には、合成映像表示部22)は、合成映像におけるAさんの三次元映像の表示サイズを調整する調整処理を実行する。調整処理では、先ず、変化後の奥行距離d2を特定する。その後、特定した変化後の奥行距離d2に基づき、奥行方向においてAさんの位置が変化する前の表示サイズ、すなわち、等身大サイズとなるようにAさんの映像の表示サイズを調整する。具体的に説明すると、Aさんの奥行距離がd1からd2へ変化したとき、調整処理では、Aさんの映像の表示サイズ(厳密には、映像の縦サイズ及び横サイズの各々)に奥行距離の比(d1/d2)を乗じて上記表示サイズを補正する。 In the process of changing the depth distance, Mr. B's home server 1 (strictly speaking, the composite video display unit 22) executes an adjustment process for adjusting the display size of Mr. A's 3D video in the composite video. In the adjustment process, first, the depth distance d2 after the change is specified. Thereafter, based on the specified depth distance d2 after the change, the display size before the position of Mr. A in the depth direction is changed, that is, the display size of the video of Mr. A is adjusted to be a life size. Specifically, when the depth distance of Mr. A changes from d1 to d2, in the adjustment process, the display distance of the Mr. A's video (strictly speaking, the vertical size and horizontal size of the video) is changed to the depth distance. The display size is corrected by multiplying by the ratio (d1 / d2).
 その後、Bさん側のホームサーバ1は、サイズ補正されたAさんの三次元映像と、背景及び前景の三次元映像とを合成し、その合成映像をディスプレイ5に表示させる。これにより、図13(A)及び(B)に示すように、Aさんの奥行距離が変化したとしても、当該奥行距離が変化する前の表示サイズにてAさんの映像が表示されるようになる。このようにAさんの奥行距離が変化したときに、実際にAさんと対面しているときの見え方を反映してAさんの三次元映像の表示サイズを調整する結果、本システムSを用いた対面対話の臨場感(リアル感)がより一層向上することとなる。なお、図13は、調整処理の実行結果についての説明図であり、同図の(A)は、奥行距離が変化する前の合成映像を、同図の(B)は、奥行距離の変化後に調整処理が行われた合成映像を、それぞれ示している。また、図13の(B)には、表示サイズの比較のために、奥行距離が変化した後であって調整処理が行われる前段階のAさんの映像を破線にて示している。 Thereafter, Mr. B's home server 1 synthesizes the 3D video of Mr. A whose size has been corrected, and the 3D video of the background and foreground, and displays the synthesized video on the display 5. Accordingly, as shown in FIGS. 13A and 13B, even if the depth distance of A changes, the video of Mr. A is displayed at the display size before the depth distance changes. Become. When the depth distance of Mr. A changes in this way, the display size of Mr. A's 3D image is adjusted to reflect the appearance when actually facing Mr. A. As a result, this system S is used. The realism of the face-to-face dialogue will be further improved. FIG. 13 is an explanatory diagram of the execution result of the adjustment process. FIG. 13A shows a composite image before the depth distance changes, and FIG. 13B shows a result after the depth distance changes. The composite images that have undergone the adjustment process are respectively shown. Further, in FIG. 13B, for comparison of the display size, the video of Mr. A after the depth distance has changed and before the adjustment process is performed is indicated by a broken line.
 <<映像表示フローについて>>
 次に、本システムSを用いた対面対話のうち、映像表示に係る一連のデータ処理、すなわち映像表示フローについて、その流れを説明する。ここで、以下に説明する映像表示フォローにおいては本発明の映像表示方法が適用されている。すなわち、以下では、本発明の映像表示方法に関する説明として、当該映像表示方法を適用した映像表示フローの流れを説明することとする。換言すると、以下に述べる映像表示フロー中の各ステップは、本発明の映像表示方法の構成要素に相当する。
<< About the video display flow >>
Next, in the face-to-face conversation using the system S, a flow of a series of data processing related to video display, that is, a video display flow will be described. Here, in the video display follow described below, the video display method of the present invention is applied. That is, in the following, the flow of a video display flow to which the video display method is applied will be described as an explanation regarding the video display method of the present invention. In other words, each step in the video display flow described below corresponds to a component of the video display method of the present invention.
 なお、以下では、Aさんの三次元映像を含む合成映像をBさん側のディスプレイ5にて表示するケースを例に挙げて説明する。ちなみに、Bさんの三次元映像を含む合成映像をAさん側のディスプレイ5に表示する際の手順についても、下記の手順と略同様となる。 In the following, a case where a composite image including a 3D image of Mr. A is displayed on the display 5 on the side of Mr. B will be described as an example. Incidentally, the procedure for displaying the synthesized video including the 3D video of Mr. B on the display 5 on the side of Mr. A is substantially the same as the following procedure.
 映像表示フローは、コンピュータであるBさん側のホームサーバ1が、図14及び15に示す各ステップを実施することにより進行する。図14及び15は、映像表示フローの流れを示す図である。具体的に説明すると、先ず、Bさん側のホームサーバ1が、Aさん側のホームサーバ1と通信することで背景映像の映像データ及び背景映像についての深度データ(第1深度データ)を受信する(S001)。これにより、Bさん側のホームサーバ1は、背景映像として、Aさんが対面対話を行う際に利用する部屋の映像を取得する。これと同時に、Bさん側のホームサーバ1は、背景とカメラ2との距離を示すデータ(距離データ)としての第1深度データを取得する。なお、本ステップS001は、Aさん側のカメラ2が背景映像のみを撮像している間、すなわち、対面対話が行われる部屋にAさんが居ない期間中に行われる。また、取得した背景映像及び第1深度データについては、Bさん側のホームサーバ1のハードディスクドライブ等に記憶される。 The video display flow proceeds when the home server 1 on the side of Mr. B, which is a computer, performs the steps shown in FIGS. 14 and 15 are diagrams showing the flow of the video display flow. Specifically, first, the Mr. B's home server 1 communicates with the Mr. A's home server 1 to receive the video data of the background video and the depth data (first depth data) about the background video. (S001). Thereby, Mr. B's home server 1 acquires a video of a room used when Mr. A performs a face-to-face conversation as a background video. At the same time, Mr. B's home server 1 acquires first depth data as data (distance data) indicating the distance between the background and the camera 2. Note that this step S001 is performed while the A-side camera 2 captures only the background video, that is, during a period when there is no A in the room where the face-to-face conversation is performed. Further, the acquired background video and first depth data are stored in the hard disk drive or the like of the home server 1 on the side of Mr. B.
 そして、Bさん側のホームサーバ1は、記憶された背景映像及び第1深度データのうち、直近で取得された背景映像及び第1深度データを読み出し、これらを用いたレンダリング処理としてテクスチャマッピングによる処理を実行する。これにより、Bさん側のホームサーバ1は、背景の三次元映像を取得する(S002)。 Then, Mr. B's home server 1 reads the most recently acquired background video and first depth data out of the stored background video and first depth data, and performs processing by texture mapping as rendering processing using them. Execute. Thereby, Mr. B's home server 1 acquires a background three-dimensional image (S002).
 一方、Aさんが対面対話用の部屋に入室して対面対話を開始すると、同室内に設置されたカメラ2が、Aさんとその背景及び前景を含む映像、すなわち、実映像を撮像する。そして、Aさん側のホームサーバ1が、カメラ2が撮像した実映像の映像データを送信し、Bさん側のホームサーバ1が当該映像データを受信する。これにより、Bさん側のホームサーバ1は、上記の実映像を取得する。また、Aさん側のホームサーバ1は、実映像の映像データの送信と同時に、実映像についての深度データ(第2深度データ)を送信し、Bさん側のホームサーバ1が当該第2深度データを受信する。これにより、Bさん側のホームサーバ1は、上記の第2深度データを、実映像とセットにした状態で取得する(S003)。なお、取得した実映像及び第2深度データについては、Bさん側のホームサーバ1のハードディスクドライブ等に記憶される。 On the other hand, when Mr. A enters the room for face-to-face conversation and starts the face-to-face conversation, the camera 2 installed in the room picks up an image including Mr. A, the background and the foreground, that is, a real image. Then, Mr. A's home server 1 transmits the video data of the actual video captured by the camera 2, and Mr. B's home server 1 receives the video data. Thereby, Mr. B's home server 1 acquires the above-mentioned actual video. In addition, the home server 1 on the Mr. A side transmits depth data (second depth data) on the real video simultaneously with the transmission of the video data of the real video, and the home server 1 on the B side transmits the second depth data. Receive. Thereby, Mr. B's home server 1 acquires the second depth data in a state of being set with the actual video (S003). The acquired actual video and second depth data are stored in the hard disk drive or the like of Mr. B's home server 1.
 その後、Bさん側のホームサーバ1は、取得した実映像から人物映像、具体的にはAさんの映像を抽出する(S004)。具体的に説明すると、Bさん側のホームサーバ1は、前ステップS002で取得した第2深度データと、カメラ2の撮像映像と、に基づいてAさんの骨格モデルを特定した上で、当該骨格モデルに基づいて実映像からAさんの映像を抽出する。 Thereafter, Mr. B's home server 1 extracts a person image, specifically, Mr. A's image from the acquired actual image (S004). Specifically, Mr. B's home server 1 specifies Mr. A's skeleton model based on the second depth data acquired in the previous step S002 and the captured video of the camera 2, and then A's video is extracted from the actual video based on the model.
 そして、Bさん側のホームサーバ1は、前ステップS004にて抽出されたAさんの映像と第2深度データとを用いたレンダリング処理を実行し、具体的にはテクスチャマッピングによる処理を実行する。これにより、Bさん側のホームサーバ1は、人物(Aさん)の三次元映像を取得する(S005)。 Then, Mr. B's home server 1 executes rendering processing using the video of Mr. A extracted in the previous step S004 and the second depth data, and specifically executes processing by texture mapping. Thereby, Mr. B's home server 1 acquires a 3D image of the person (Mr. A) (S005).
 また、Bさん側のホームサーバ1は、第2深度データに基づいて、ステップS003にて取得した実映像から前景映像を抽出する(S006)。その後、Bさん側のホームサーバ1は、抽出された前景映像と第2深度データとを用いたテクスチャマッピングによるレンダリング処理を実行する。これにより、Bさん側のホームサーバ1は、前景の三次元映像を取得する(S007)。 Further, Mr. B's home server 1 extracts the foreground video from the actual video acquired in step S003 based on the second depth data (S006). Thereafter, Mr. B's home server 1 executes a rendering process by texture mapping using the extracted foreground video and the second depth data. Thereby, Mr. B's home server 1 acquires a 3D image of the foreground (S007).
 Aさん及び前景の各々の三次元映像を取得した後、Bさん側のホームサーバ1は、これらの三次元映像と、ステップS002にて取得した背景の三次元映像中の所定範囲内にある映像(表示範囲)と、を合成する(S008)。そして、Bさん側のホームサーバ1は、Bさん側のディスプレイ5に合成映像を表示させる(S009)。これにより、Bさん側のディスプレイ5には、背景の三次元映像よりも手前位置にAさんの三次元映像が等身大サイズにて表示され、また、Aさんの三次元映像よりも手前位置に前景の三次元映像が表示されるようになる。 After acquiring the 3D images of Mr. A and the foreground, the home server 1 on the side of Mr. B, the images within the predetermined range in these 3D images and the background 3D images acquired in step S002. (Display range) is synthesized (S008). Then, Mr. B's home server 1 displays the synthesized video on the Mr. B's display 5 (S009). Accordingly, Mr. A's 3D image is displayed in a life-size size in front of the background 3D image on the display 5 on the side of Mr. B, and in front of the 3D image of Mr. A. A foreground 3D image is displayed.
 ここで、以上までに述べた映像表示フローに係る一連のステップのうち、人物の三次元映像を取得するステップS005について、図16を参照しながら、より詳細に説明する。図16は、人物の三次元映像を取得する手順を示した図である。本ステップS005では、先ず、前ステップS004にて抽出されたAさんの映像と第2深度データとを用いたテクスチャマッピングを行う(S011)。これにより、Aさんの三次元映像として、Aさん側のカメラ2が設置されている位置から見た映像が取得される。 Here, among the series of steps related to the video display flow described above, step S005 for acquiring a 3D video of a person will be described in more detail with reference to FIG. FIG. 16 is a diagram illustrating a procedure for acquiring a 3D video of a person. In step S005, first, texture mapping is performed using the video of Mr. A extracted in the previous step S004 and the second depth data (S011). As a result, an image viewed from the position where the camera 2 on the A's side is installed is acquired as a 3D image of the A's.
 次に、Bさん側のホームサーバ1が記憶している第1深度データに基づいて、Aさんの目の高さを検知する(S012)。その後、Bさん側のホームサーバ1は、検知したAさんの目の高さとAさん側のカメラ2の設定高さとを対比する(S013)。そして、双方の高さが異なる場合、Bさん側のホームサーバ1は、目線高さ合わせ用のプロセスを行う(S014)。同プロセスにおいて、Bさん側のホームサーバ1は、検知したAさんの目の高さと同じ高さにある仮想的な視点から見たAさんの三次元映像を取得するためのレンダリング処理を行う。厳密には、ステップS011にて取得した三次元映像に対して、前述した式(1)にて算出した角度αに相当する高さだけ視点を変位させる映像処理を施す。これにより、上記の仮想的な視点から見たAさんの三次元映像、すなわち、目線が正面を向いたAさんの三次元映像を取得することが可能となる(S015)。 Next, based on the first depth data stored in Mr. B's home server 1, the height of Mr. A's eyes is detected (S012). Thereafter, the home server 1 on the B side compares the detected height of the eyes of the A with the set height of the camera 2 on the A side (S013). If the heights of the two are different, the home server 1 on the side of Mr. B performs a process for adjusting the eye height (S014). In this process, Mr. B's home server 1 performs a rendering process for acquiring a three-dimensional image of Mr. A viewed from a virtual viewpoint at the same height as the detected eye height of Mr. A. Strictly speaking, the 3D image acquired in step S011 is subjected to image processing for displacing the viewpoint by a height corresponding to the angle α calculated by the above-described equation (1). As a result, it is possible to acquire the 3D video of Mr. A viewed from the above-described virtual viewpoint, that is, the 3D video of Mr. A with the line of sight facing the front (S015).
 一方、検知したAさんの目の高さとAさん側のカメラ2の設定高さとが一致している場合、Bさん側のホームサーバ1は、ステップS011にて取得した三次元映像を、そのままの状態で以降のステップに用いる。 On the other hand, when the detected eye height of Mr. A coincides with the set height of the camera 2 on the A's side, the home server 1 on the B's side uses the 3D video acquired in step S011 as it is. Used in subsequent steps in state.
 ところで、映像表示フローにおいて、Bさん側のホームサーバ1は、Bさん側のカメラ2が撮像した実映像(Bさん、背景及び前景の映像)を取得すると共に、赤外線センサ4からの計測結果に基づいて上記実映像の深度データ(第2深度データ)を取得する。かかる深度データに基づいて、Bさん側のホームサーバ1は、Bさん側のディスプレイ5に合成映像が表示されている期間中にBさんの顔が横移動したか否かを判定する(S021)。そして、Bさんの顔が横移動したと判定した場合、Bさん側のホームサーバ1は、当該顔の移動の向き及び移動量を、移動前の深度データ及び移動後の深度データに基づいて特定する(S022)。 By the way, in the video display flow, Mr. B's home server 1 acquires the actual image (Mr. B, the background and the foreground image) captured by Mr. B's camera 2 and also displays the measurement result from the infrared sensor 4. Based on this, the depth data (second depth data) of the actual video is acquired. Based on such depth data, Mr. B's home server 1 determines whether Mr. B's face has moved laterally during the period in which the composite video is displayed on the Mr. B display 5 (S021). . When it is determined that the face of Mr. B has moved sideways, the home server 1 on the side of Mr. B specifies the direction and amount of movement of the face based on the depth data before the movement and the depth data after the movement. (S022).
 さらに、Bさん側のホームサーバ1は、ステップS003で取得した第2深度データに基づいて、Aさん、背景及び前景の各々の奥行距離を特定する(S023)。その後、Bさん側のホームサーバ1は、ステップS022及びS023において特定した各値に基づいて、次のステップS025で実行する遷移処理において用いるずれ量を算出する(S024)。より具体的に説明すると、本ステップS024では、合成映像におけるAさんの三次元映像の表示位置に対するずれ量t1、背景の三次元映像の中で合成映像の中に含まれる範囲(表示範囲)に対するずれ量t2、及び、合成映像における前景の三次元映像の表示位置に対するずれ量t3を、それぞれ、既述の式(2)~(4)に従って算出する。 Furthermore, Mr. B's home server 1 specifies the depth distance of each of Mr. A, the background, and the foreground based on the second depth data acquired in step S003 (S023). Thereafter, the home server 1 on the B-side side calculates a deviation amount used in the transition process executed in the next step S025 based on the values specified in steps S022 and S023 (S024). More specifically, in step S024, the shift amount t1 with respect to the display position of Mr. A's 3D video in the composite video, and the range (display range) included in the composite video in the background 3D video. The shift amount t2 and the shift amount t3 with respect to the display position of the foreground 3D video in the composite video are respectively calculated according to the equations (2) to (4) described above.
 そして、Bさん側のホームサーバ1は、ずれ量を算出した後に遷移処理を実行する(S025)。この遷移処理の実行により、ディスプレイ5に表示されている合成映像が、Bさんの顔の横移動を検知する前の状態から遷移する。具体的に説明すると、Bさんの顔が第一向きに横移動したことを検知したとき、Bさん側のホームサーバ1は、遷移処理において、合成映像におけるAさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び、背景の三次元映像の表示範囲を、それぞれ前ステップS024で算出したずれ量だけ第二向きにずらした状態へ合成映像を遷移させる。この際、Aさんの三次元映像の表示位置に対するずれ量よりも、背景の三次元映像の表示範囲に対するずれ量の方がより大きくなっている。また、Aさんの三次元映像の表示位置に対するずれ量よりも、前景の三次元映像の表示位置に対するずれ量の方がより小さくなっている。 Then, Mr. B's home server 1 executes the transition process after calculating the deviation amount (S025). By executing this transition process, the composite image displayed on the display 5 transitions from a state before detecting the lateral movement of Mr. B's face. Specifically, when it is detected that Mr. B's face has moved sideways in the first direction, Mr. B's home server 1 displays the display position of Mr. A's 3D image in the composite image in the transition process, The composite video is shifted to a state in which the display position of the foreground 3D video and the display range of the background 3D video are shifted in the second direction by the shift amount calculated in the previous step S024. At this time, the shift amount with respect to the display range of the background 3D video is larger than the shift amount with respect to the display position of the 3D video of Mr. A. Further, the shift amount of the foreground 3D video display position is smaller than the shift amount of Mr. A with respect to the 3D video display position.
 遷移処理が完了すると、Bさん側のホームサーバ1は、遷移処理後の合成映像、すなわち、Aさんの三次元映像の表示位置、前景の三次元映像の表示位置、及び背景の三次元映像の表示範囲を当初の状態からずらした状態の合成映像をディスプレイ5に表示させる(S026)。これにより、ディスプレイ5には、横移動後のBさんの顔の位置から見たときの見え方を再現した映像が表示されるようになる。なお、前述したように、遷移処理後の合成映像では、Aさんの三次元映像の表示位置に対するずれ量よりも、背景の三次元映像の表示範囲に対するずれ量の方が大きくなっている。このため、Bさんは、背景の三次元映像のうち、当初ディスプレイ5に表示されていなかった映像を左右に顔を動かして覗き見ることが可能となる。 When the transition processing is completed, Mr. B's home server 1 displays the composite video after the transition processing, that is, the display position of Mr. A's 3D video, the display position of the foreground 3D video, and the background 3D video. The composite image with the display range shifted from the initial state is displayed on the display 5 (S026). As a result, the display 5 displays an image that reproduces the appearance when seen from the position of the face of Mr. B after the lateral movement. As described above, in the synthesized video after the transition process, the shift amount with respect to the display range of the background 3D video is larger than the shift amount of Mr. A with respect to the display position of the 3D video. For this reason, Mr. B can peek at the image that was not initially displayed on the display 5 among the three-dimensional images of the background by moving his / her face to the left and right.
 また、Bさん側のホームサーバ1は、ステップS003で取得した第2深度データに基づいて、Bさん側のディスプレイ5に合成映像が表示されている期間中にAさんの奥行距離が変化したか否かを判定する(S027)。そして、Aさんの奥行距離が変化したと判定したとき、Bさん側のホームサーバ1は、変化後の第2深度データに基づいて、変化後の奥行距離を特定する(S028)。その後、Bさん側のホームサーバ1は、特定した変化後の奥行距離に応じて、Aさんの三次元映像の表示サイズを調整する(S029)。この際、Bさん側のホームサーバ1は、奥行距離変化後のAさんの三次元映像が奥行距離変化前の表示サイズ、すなわち等身大サイズにて表示されるように表示サイズを調整する。表示サイズの調整が完了した後、Bさん側のホームサーバ1は、サイズ調整後のAさんの三次元映像と、背景及び前景のそれぞれの三次元映像とを合成し、その合成映像をディスプレイ5に表示させる(S030)。これにより、Aさんの奥行距離が変化した後にも、引き続き、ディスプレイ5に表示されるAさんの三次元映像が等身大サイズで表示されるようになる。 In addition, the home server 1 on the B side has changed the depth distance of the A during the period in which the composite video is displayed on the display 5 on the B side based on the second depth data acquired in step S003. It is determined whether or not (S027). When it is determined that the depth distance of Mr. A has changed, the home server 1 on the side of Mr. B identifies the depth distance after the change based on the second depth data after the change (S028). Thereafter, the home server 1 on the B side adjusts the display size of the 3D image of the A in accordance with the identified depth distance after the change (S029). At this time, Mr. B's home server 1 adjusts the display size so that the three-dimensional image of Mr. A after the depth distance change is displayed in the display size before the depth distance change, that is, the life-size size. After the adjustment of the display size is completed, Mr. B's home server 1 synthesizes the 3D video of Mr. A after the size adjustment and the 3D video of the background and foreground, and displays the synthesized video on the display 5. (S030). Thereby, even after Mr. A's depth distance changes, Mr. A's three-dimensional image displayed on the display 5 continues to be displayed in a life-size size.
 <<映像表示システムの変形例>>
 上述した本システムSの構成では、各ユーザの映像を撮像するカメラ2が一台ずつ設けられていることとした。すなわち、上記の実施形態では、単一のカメラ2にてユーザの映像を撮像し、ディスプレイ5には、単一のカメラ2にて撮像された映像を元にした三次元映像を表示することとした。これに対して、互いに撮像方向から異なる複数のカメラ2にてユーザの映像を撮像すれば、より多くの視点からユーザの映像を取得することが可能となる。この結果、カメラ2の撮像映像を用いたレンダリング処理によって生成されるユーザの三次元映像については、単一のカメラ2のみでは視認され得ない死角領域をより少なくし、三次元映像を見る際の視点(仮想的な視点)の設定位置に対する自由度についても高くなる。
<< Variation of video display system >>
In the configuration of the system S described above, one camera 2 that captures each user's video is provided. That is, in the above embodiment, a single camera 2 captures a user's video, and the display 5 displays a 3D video based on the video captured by the single camera 2. did. On the other hand, if a user's image | video is imaged with the some camera 2 from which an imaging direction mutually differs, it will become possible to acquire a user's image | video from more viewpoints. As a result, for the user's 3D image generated by the rendering process using the captured image of the camera 2, the blind spot area that cannot be visually recognized only by the single camera 2 is reduced, and the 3D image is viewed. The degree of freedom with respect to the setting position of the viewpoint (virtual viewpoint) is also increased.
 以下、複数のカメラ2によってユーザの映像を撮像する構成(以下、変形例)を説明することとする。なお、以下の説明では、先に説明した構成と同様の構成についての説明を省略し、異なる構成のみについて説明することとする。また、以下では、Aさんの映像を上下2台のカメラ2にて撮像するケースを例に挙げて説明することとする。なお、カメラ2の台数、設置箇所及びそれぞれの撮像方向については、以下に説明する内容に限定されず、任意に設定することが可能である。 Hereinafter, a configuration (hereinafter, modified example) in which a plurality of cameras 2 capture a user's video will be described. In the following description, description of the same configuration as that described above will be omitted, and only a different configuration will be described. In the following, a case where the image of Mr. A is captured by the upper and lower two cameras 2 will be described as an example. In addition, about the number of cameras 2, an installation location, and each imaging direction, it is not limited to the content demonstrated below, It is possible to set arbitrarily.
 変形例では、図17に示すように、Aさんの映像を上下2台のカメラ2にて撮像する。図17は、上下2台のカメラ2にてAさんの映像を撮像する様子を模式的に示した図である。また、上下2台のカメラ2は、それぞれ、互いに異なる位置にてAさんの映像を撮像する。具体的に説明すると、上側のカメラ2は、Aさんの身長よりも幾分高い位置に設置されており、下側のカメラ2は、床面よりも若干上方に設置されている。 In the modified example, as shown in FIG. 17, the image of Mr. A is picked up by the upper and lower two cameras 2. FIG. 17 is a diagram schematically illustrating a state in which the image of Mr. A is captured by the two upper and lower cameras 2. Further, the upper and lower two cameras 2 respectively capture the image of Mr. A at different positions. Specifically, the upper camera 2 is installed at a position somewhat higher than Mr. A's height, and the lower camera 2 is installed slightly above the floor surface.
 また、変形例では、ディスプレイ5の映像表示画面(厳密にはタッチパネル5aの前面)を基準面としており、上下2台のカメラ2のそれぞれの撮像方向は、基準面の法線方向に対して鉛直方向に傾いている。撮像方向とは、カメラ2のレンズの光軸方向のことであり、上側のカメラ2の撮像方向は、Aさんに近付くにつれて下降する方向に設定されている。つまり、上側のカメラ2は、Aさんの身体を上方から撮像する。他方、下側のカメラ2の撮像方向は、Aさんに近付くにつれて上昇する方向に設定されている。つまり、下側のカメラ2は、Aさんの身体を下方から撮像する。 In the modification, the video display screen of the display 5 (strictly, the front surface of the touch panel 5a) is used as a reference plane, and the imaging directions of the two upper and lower cameras 2 are perpendicular to the normal direction of the reference plane. Tilt in the direction. The imaging direction is the optical axis direction of the lens of the camera 2, and the imaging direction of the upper camera 2 is set to a direction that descends as approaching Mr. A. That is, the upper camera 2 images Mr. A's body from above. On the other hand, the imaging direction of the lower camera 2 is set to a direction that rises as it approaches Mr. A. That is, the lower camera 2 images Mr. A's body from below.
 また、変形例に係る対面対話において、Aさんは、上記の基準位置から基準距離d1だけ離れた位置に立っている。かかる位置にAさんが立っているとき、上側のカメラ2は、Aさんの頭部から腰部までの映像(以下、上半身映像)を撮像し、下側のカメラ2は、Aさんの足から腹部までの映像(以下、下半身映像)を撮像する。さらに、変形例では、カメラ2毎に赤外線センサ4が設けられている。これにより、上下2台のカメラ2の各々が撮像する映像(実映像)について、深度データ(厳密には第2深度データ)を個別に取得することが可能となる。 Also, in the face-to-face conversation according to the modification, Mr. A stands at a position separated from the reference position by the reference distance d1. When Mr. A stands at such a position, the upper camera 2 captures an image from the head of A to the waist (hereinafter referred to as an upper body image), and the lower camera 2 captures the abdomen from the foot of Mr. A. The previous video (hereinafter, lower body video) is captured. Furthermore, in the modification, an infrared sensor 4 is provided for each camera 2. Thereby, it is possible to individually acquire the depth data (strictly, the second depth data) for the video (actual video) captured by each of the upper and lower cameras 2.
 一方、変形例において、Bさん側のホームサーバ1は、カメラ2別にAさんの映像を取得する。具体的に説明すると、Aさん側のホームサーバ1は、上側のカメラ2が撮像した上半身映像を含む実映像の映像データと、下側のカメラ2が撮像した下半身映像を含む実映像の映像データと、を送信する。Bさん側のホームサーバ1は、これらの映像データを取得し、それぞれの映像データが示す実映像の中からAさんの映像、具体的には上半身映像や下半身映像を抽出する。 On the other hand, in the modified example, Mr. B's home server 1 acquires Mr. A's video for each camera 2. More specifically, the home server 1 on the side of Mr. A has the video data of the real video including the upper body video captured by the upper camera 2 and the video data of the real video including the lower body video captured by the lower camera 2. And send. Mr. B's home server 1 acquires these video data, and extracts the video of Mr. A, specifically the upper body video and the lower body video, from the actual video indicated by each video data.
 また、変形例において、Bさん側のホームサーバ1は、各カメラ2が撮像した実映像についての深度データを、Aさん側のホームサーバ1からカメラ別に受信する。すなわち、変形例において、Bさん側のホームサーバ1は、Aさんの上半身映像や下半身映像を含む実映像についての深度データを、カメラ別に取得することになる。さらに、変形例において、Bさん側のホームサーバ1(厳密には、三次元映像生成部21)は、カメラ別に取得した実映像及び深度データに基づいて、カメラ別の三次元映像片を生成する工程、すなわち映像片生成工程を行う。 Further, in the modification, the Mr. B side home server 1 receives the depth data about the actual image captured by each camera 2 from the Mr. A side home server 1 for each camera. That is, in the modified example, the home server 1 on the side of Mr. B acquires the depth data for the real video including the upper body video and the lower body video of Mr. A for each camera. Further, in the modified example, Mr. B's home server 1 (strictly, the 3D video generation unit 21) generates a 3D video piece for each camera based on the actual video and depth data acquired for each camera. A process, that is, a video piece generation process is performed.
 具体的に説明すると、映像片生成工程において、Bさん側のホームサーバ1は、上側のカメラ2が撮像した実映像から得られるAさんの上半身映像と、上側のカメラ2が撮像した実映像についての深度データと、を用いてレンダリング処理を行う。これにより、上側のカメラ2の撮像方向から見た三次元映像片、具体的には、図18に図示したAさんの上半身の三次元映像片が取得される。同様に、映像片生成工程において、Bさん側のホームサーバ1は、下側のカメラ2が撮像した実映像から得られるAさんの下半身映像と、下側のカメラ2が撮像した実映像についての深度データと、を用いたレンダリング処理を行う。これにより、下側のカメラ2の撮像方向から見た三次元映像片、具体的には、図18に図示したAさんの下半身の三次元映像片が取得される。図18は、カメラ別に生成した三次元映像片と、後述する結合工程において生成されるAさんの三次元映像と、を示した図である。 More specifically, in the video piece generation process, the home server 1 on the B side uses the upper body video of Mr. A obtained from the actual video captured by the upper camera 2 and the actual video captured by the upper camera 2. The depth data is used to perform rendering processing. As a result, a 3D image piece viewed from the imaging direction of the upper camera 2, specifically, a 3D image piece of Mr. A's upper body shown in FIG. 18 is acquired. Similarly, in the video piece generation process, the home server 1 on the B side has a lower body video obtained from the real video captured by the lower camera 2 and an actual video captured by the lower camera 2. A rendering process using depth data is performed. As a result, a 3D image piece viewed from the imaging direction of the lower camera 2, specifically, a 3D image piece of Mr. A's lower body shown in FIG. 18 is acquired. FIG. 18 is a diagram illustrating a 3D image piece generated for each camera and a 3D image of Mr. A generated in a combining step described later.
 また、変形例では、Aさんの目を含む部分の映像を撮像するカメラ2(すなわち、上側のカメラ2)の設置高さとAさんの目の高さとが異なっている。このために、変形例では、上述の映像片生成工程中、Aさんの目を含む部分の三次元映像片(具体的には、上半身の三次元映像片)を生成する際に、前述の目線高さ合わせ用のプロセスを行うことになっている。つまり、変形例では、Aさんの目の高さにある仮想的な視点から見たときの上半身の三次元映像片を取得するためのレンダリング処理を実行する。以下、図19を参照しながら、変形例においてAさんの三次元映像を取得する手順について説明する。図19は、変形例においてAさんの三次元映像を取得する手順を示した図である。 In addition, in the modification, the installation height of the camera 2 (that is, the upper camera 2) that captures the image of the part including the eyes of Mr. A is different from the height of the eyes of Mr. A. For this reason, in the modified example, during the above-described video piece generation process, when generating the 3D video piece of the portion including Mr. A's eyes (specifically, the 3D video piece of the upper body), A process for leveling is to be performed. In other words, in the modified example, a rendering process for acquiring a 3D image piece of the upper body when viewed from a virtual viewpoint at the height of Mr. A's eyes is executed. Hereinafter, a procedure for acquiring the 3D video of Mr. A in the modification will be described with reference to FIG. FIG. 19 is a diagram showing a procedure for acquiring the 3D video of Mr. A in the modification.
 変形例に係る映像表示フローにおいて、Bさん側のホームサーバ1は、Aさんの三次元映像を生成するにあたり、先ず、映像片生成工程を行う(S041)。映像片生成工程において、Bさん側のホームサーバ1は、テクスチャマッピングによるレンダリング処理を実行することで、Aさんの上半身及び下半身のそれぞれの三次元映像片を生成する(S042、S043)。具体的に説明すると、Bさん側のホームサーバ1は、映像片生成工程中、上半身の三次元映像片を生成する際に、上側のカメラ2から見た上半身の三次元映像片を生成する。その後、Bさん側のホームサーバ1は、上側のカメラ2の設置高さとAさんの目の高さとの差を特定すると共に、Aさんと上側のカメラ2との間の距離(奥行距離)を特定する。さらに、Bさん側のホームサーバ1は、これらの特定結果に基づき、その後に行う映像処理で用いる回転角度αを求める。そして、Bさん側のホームサーバ1は、前ステップで生成された上半身の三次元映像片に対して、回転角度αに相当する高さだけ視点を変位させる映像処理を施す。これにより、Aさんの上半身の三次元映像片として、Aさんの目の高さにある仮想的な視点から見たときの三次元映像片が取得されるようになる。すなわち、目線が正面を向いたAさんの上半身の三次元映像片が取得される。 In the video display flow according to the modified example, Mr. B's home server 1 first performs a video piece generation process (S041) in generating a 3D video of Mr. A. In the video piece generation step, Mr. B's home server 1 generates a 3D video piece for each of the upper body and the lower body of Mr. A by executing a rendering process using texture mapping (S042, S043). Specifically, Mr. B's home server 1 generates a 3D image piece of the upper body viewed from the upper camera 2 when generating the 3D image piece of the upper body during the image piece generating process. After that, Mr. B's home server 1 specifies the difference between the installation height of the upper camera 2 and the eyes of Mr. A, and determines the distance (depth distance) between Mr. A and the upper camera 2. Identify. Furthermore, Mr. B's home server 1 obtains the rotation angle α used in the subsequent video processing based on these identification results. Then, Mr. B's home server 1 performs video processing for displacing the viewpoint by a height corresponding to the rotation angle α on the 3D video piece of the upper body generated in the previous step. As a result, a 3D image piece when viewed from a virtual viewpoint at the height of A's eyes is acquired as a 3D image piece of Mr. A's upper body. That is, a 3D image piece of the upper body of Mr. A with the line of sight facing the front is acquired.
 また、Bさん側のホームサーバ1は、映像片生成工程中、下半身の三次元映像片を生成するにあたり、下側のカメラ2が撮像した実映像から得られる三次元映像片に対して映像回転処理を実行する。具体的に説明すると、下側のカメラ2は、基準面であるディスプレイ5の表示画面の法線方向とは異なる撮像方向からAさんの下半身の映像を撮像する。そして、Bさん側のホームサーバ1は、下側のカメラ2が撮像した実映像(すなわち、上記の撮像方向にて撮像された映像)と、当該実映像についての深度データと、を用いたテクスチャマッピングを行い、Aさんの下半身の三次元映像片を生成する。この段階で生成される三次元映像片は、下側のカメラ2の撮像方向から見たときの三次元映像片である。 In addition, the home server 1 on the B-side rotates the image with respect to the 3D image piece obtained from the actual image captured by the lower camera 2 when generating the 3D image piece of the lower body during the image piece generating process. Execute the process. More specifically, the lower camera 2 captures an image of the lower body of Mr. A from an imaging direction different from the normal direction of the display screen of the display 5 that is the reference plane. Then, Mr. B's home server 1 uses the texture using the actual video captured by the lower camera 2 (that is, the video captured in the above imaging direction) and the depth data of the actual video. Mapping is performed to generate a 3D image piece of Mr. A's lower body. The 3D image piece generated at this stage is a 3D image piece when viewed from the imaging direction of the lower camera 2.
 一方、Bさん側のホームサーバ1は、下側のカメラ2の撮像方向から見たときの三次元映像片に対して映像回転処理を実行する。この映像回転処理は、下側のカメラ2の撮像方向から見たときの三次元映像片を、基準面であるディスプレイ5の表示画面の法線方向から仮想的に見た場合の三次元映像片へ変換させるための処理である。具体的には、上記の法線方向に対する下側のカメラ2の撮像方向の傾き度合いを角度(傾き角度)にて特定し、当該傾き角度だけ、三次元映像片を回転させる。これにより、Aさんの下半身の三次元映像片として、基準面の法線方向から見たときの三次元映像片が取得されるようになる。なお、上記の映像回転処理は、公知の映像処理によって実現される。 On the other hand, Mr. B's home server 1 executes a video rotation process on the 3D video fragment when viewed from the imaging direction of the lower camera 2. In this video rotation process, a three-dimensional image fragment when the three-dimensional image fragment when viewed from the imaging direction of the lower camera 2 is virtually viewed from the normal direction of the display screen of the display 5 serving as a reference plane. It is a process for making it convert into. Specifically, the inclination degree of the imaging direction of the lower camera 2 with respect to the normal direction is specified by an angle (inclination angle), and the 3D image piece is rotated by the inclination angle. As a result, the 3D image piece when viewed from the normal direction of the reference plane is acquired as the 3D image piece of the lower half of Mr. A. Note that the above video rotation processing is realized by known video processing.
 上半身及び下半身の各々の三次元映像片を取得した後、Bさん側のホームサーバ1は、Aさんの三次元映像を生成するために上記三次元映像片同士を結合する結合工程を行う(S044)。この結合工程では、上半身及び下半身の各々の三次元映像片を、当該各々の三次元映像片に含まれる共通の映像領域(具体的には、Aさんの腹部の映像を示す領域)同士が重なり合うように結合する。なお、映像片の結合に際して、上半身の三次元映像片のうち、腹部より下の映像を切り捨て、下半身の三次元映像片のうち、腹部より上の映像を切り捨てる。 After acquiring the 3D image pieces of the upper and lower bodies, the Mr. B's home server 1 performs a combining step of combining the 3D image pieces to generate the 3D image of Mr. A (S044). ). In this joining step, the 3D image pieces of the upper body and the lower body are overlapped with a common image area (specifically, an area showing the image of Mr. A's abdomen) included in each 3D image piece. To join. When combining the video pieces, the upper part of the three-dimensional video piece is cut off the video below the abdomen, and the lower half of the three-dimensional video piece is cut off the video above the abdomen.
 そして、結合工程が完了した時点でAさんの三次元映像が完成する(S045)。かかる三次元映像は、図18に示すようにAさんを正面(換言すると、基準面の法線方向)から見たときの三次元映像となっている。その後、Bさん側のホームサーバ1(厳密には、合成映像表示部22)は、上記の手順により得られたAさんの三次元映像と、背景及び前景のそれぞれの三次元映像と、を合成し、その合成映像をディスプレイ5に表示させる。この際、Aさんの三次元映像中、三次元映像片の結合部分付近の映像(具体的には、腹部付近)が違和感なく表示されることとなる。 Then, when the joining process is completed, the 3D image of Mr. A is completed (S045). Such a 3D image is a 3D image when A is viewed from the front (in other words, the normal direction of the reference plane) as shown in FIG. Thereafter, Mr. B's home server 1 (strictly speaking, the synthesized video display unit 22) synthesizes the 3D video of Mr. A obtained by the above procedure and the 3D video of the background and the foreground. Then, the synthesized video is displayed on the display 5. At this time, in Mr. A's 3D image, an image in the vicinity of the joined portion of the 3D image pieces (specifically, near the abdomen) is displayed without a sense of incongruity.
 分かり易く説明すると、上側のカメラ2が撮像した実映像及びその深度データをそのまま用いて取得した上半身の三次元映像片と、下側のカメラ2が撮像した実映像及びその深度データをそのまま用いて取得した下半身の三次元映像片と、を単に結合させたとする。この場合に得られるAさんの三次元映像をディスプレイ5に表示させると、当該三次元映像中、三次元映像片同士を結合した部分付近が屈曲しているかのように見えてしまう(つまり、直立姿勢に対してやや前屈しているかのように見えてしまう)。これに対して、本変形例では、上半身の三次元映像片を生成する際に目線高さ合わせ用のプロセスを行っている。また、下半身の三次元映像片を生成する際には、深度データを基準面の法線方向から見た映像についてのデータに変換し、変換後の深度データに基づいて三次元映像片を生成する。これにより、三次元映像片同士を結合することで取得されるAさんの三次元映像については、三次元映像片同士の結合部分付近が屈曲して見えるような違和感を抑制することが可能となる。 To explain in an easy-to-understand manner, a three-dimensional image piece of the upper body obtained by using the actual image captured by the upper camera 2 and its depth data as it is, and an actual image captured by the lower camera 2 and its depth data as it is. Assume that the acquired 3D image piece of the lower body is simply combined. When the 3D image of Mr. A obtained in this case is displayed on the display 5, it appears as if the vicinity of the portion where the 3D image pieces are joined is bent in the 3D image (that is, upright). It looks as if it is slightly bent forward with respect to the posture). On the other hand, in the present modification, a process for adjusting the eye height is performed when generating a 3D image piece of the upper body. In addition, when generating a 3D image piece of the lower body, the depth data is converted into data about an image viewed from the normal direction of the reference plane, and a 3D image piece is generated based on the converted depth data. . As a result, for Mr. A's 3D video acquired by joining the 3D video pieces, it is possible to suppress a sense of incongruity that the vicinity of the joined portion of the 3D video pieces appears to be bent. .
 なお、本変形例では、複数のカメラ2(具体的には2台のカメラ2)が上下に並んで配置されていることとしたが、これに限定されるものではない。例えば、2台のカメラ2が左右に並んで配置されていてもよい。かかる場合にも上記と同様の手順にて、三次元映像片(具体的には、左半身及び右半身のそれぞれの三次元映像片)を生成し、三次元映像片同士を結合してAさんの三次元映像を生成することになる。 In this modification, a plurality of cameras 2 (specifically, two cameras 2) are arranged side by side, but the present invention is not limited to this. For example, two cameras 2 may be arranged side by side. In such a case, a 3D image piece (specifically, a 3D image piece for each of the left and right bodies) is generated in the same procedure as described above, and the 3D image pieces are combined with each other. 3D images will be generated.
 <<その他の実施形態>>
 上記の実施形態では、本発明の映像表示システム及び映像表示方法について具体例を挙げて説明した。ただし、上記の実施形態は、本発明の理解を容易にするための一例に過ぎず、本発明を限定するものではない。すなわち、本発明は、その趣旨を逸脱することなく、変更、改良され得ると共に、本発明にはその等価物が含まれることは勿論である。
<< Other Embodiments >>
In the above embodiment, the video display system and the video display method of the present invention have been described with specific examples. However, said embodiment is only an example for making an understanding of this invention easy, and does not limit this invention. That is, the present invention can be changed and improved without departing from the gist thereof, and the present invention includes its equivalents.
 また、上記の実施形態では、本システムSを通じて二人のユーザ(AさんとBさん)が対面対話をするケースを例に挙げて説明したが、これに限定されるものではなく、同時に対面対話をすることが可能な人数については三人以上であってもよい。 In the above embodiment, the case where two users (Mr. A and Mr. B) have a face-to-face conversation through the system S has been described as an example. However, the present invention is not limited to this, and the face-to-face conversation is performed simultaneously. The number of people who can do this may be three or more.
 また、上記の実施形態では、映像表示に係る一連の工程、厳密にはユーザ(例えばAさん)及びその背景や前景の各々について三次元映像を生成して当該三次元映像同士を合成する工程が、第二のユーザ(例えばBさん)側のホームサーバ1によって実施されることとした。ただし、これに限定されるものではなく、上記一連の工程が、ユーザ(Aさん)側のホームサーバ1によって実施されてもよい。 In the above embodiment, a series of steps relating to video display, strictly speaking, a step of generating a 3D video for each of the user (for example, Mr. A) and the background and foreground and synthesizing the 3D video. The home server 1 on the side of the second user (for example, Mr. B) is supposed to be implemented. However, the present invention is not limited to this, and the series of steps described above may be performed by the home server 1 on the user (Mr. A) side.
 また、上記の実施形態では、背景映像として、背景に相当する空間内にユーザが居ないときに撮像した当該空間の映像を用いることとした。ただし、これに限定されるものではなく、例えば、カメラ2がユーザとその背景を同時に撮像したときの映像、すなわち、実映像から人物映像及び背景映像をそれぞれ分離し、分離された背景映像を用いてもよい。かかる場合には、背景映像のうち、人物映像と重なっている部分の映像が欠落しているので、補完を行う必要がある。これに対して、ユーザが居ないときに撮像した背景映像を用いれば、上記のような映像の欠落がないため、映像補完を行う必要がない分、より容易に背景映像を取得することが可能となる。 In the above-described embodiment, the image of the space captured when there is no user in the space corresponding to the background is used as the background image. However, the present invention is not limited to this. For example, when the camera 2 captures the user and the background at the same time, that is, the person video and the background video are separated from the actual video, and the separated background video is used. May be. In such a case, since a portion of the background video that overlaps the human video is missing, it is necessary to complement the background video. On the other hand, if the background video captured when there is no user is used, there is no omission of the video as described above, so it is not necessary to perform video complementation, so the background video can be acquired more easily. It becomes.
 また、上記の実施形態では、第二のユーザの顔の移動を検知した場合に実行される遷移処理において、合成映像におけるユーザの三次元映像の表示位置、及び、背景の三次元映像において合成映像中に含まれる範囲(表示範囲)の双方をずらすこととした。ただし、これに限定されるものではなく、ユーザの三次元映像の表示位置及び背景の三次元映像の表示範囲のうちの一方のみをずらし、他方については固定する(ずらさない)こととしてもよい。 In the above-described embodiment, in the transition process executed when the movement of the second user's face is detected, the display position of the user's 3D video in the composite video and the composite video in the background 3D video It was decided to shift both the range (display range) included in the inside. However, the present invention is not limited to this, and only one of the display position of the user's 3D video and the display range of the background 3D video may be shifted, and the other may be fixed (not shifted).
 また、上記の実施形態では、遷移処理において、前景の三次元映像の表示位置、Aさんの三次元映像の表示位置、背景の三次元映像の表示範囲の順でずれ量が大きくなることとした。ただし、ずれ量の大小関係については、上記の大小関係と異なっていてもよい。すなわち、背景の三次元映像の表示範囲、Aさんの三次元映像の表示位置、前景映像の表示位置の順で、ずれ量が大きくなってもよい。より具体的に説明すると、Bさん側のディスプレイ5に当初、図20の(A)に図示した合成映像が表示されているときに、Bさんの顔が横移動すると、第二の遷移処理が実行され、この結果、合成映像が図20の(B)に図示した状態へ徐々に遷移するようになる。図20は、第二の遷移処理に関する説明図であり、(A)が第二の遷移処理前の合成映像を、(B)が第二の遷移処理後の合成映像を、それぞれ示している。 In the above embodiment, in the transition process, the shift amount increases in the order of the display position of the foreground 3D image, the display position of Mr. A's 3D image, and the display range of the background 3D image. . However, the magnitude relationship of the shift amounts may be different from the above magnitude relationship. That is, the shift amount may increase in the order of the display range of the background 3D video, the display position of Mr. A's 3D video, and the display position of the foreground video. More specifically, when Mr. B's face moves sideways while the composite video shown in FIG. 20A is initially displayed on the display 5 on the B side, the second transition process is performed. As a result, the synthesized video gradually transitions to the state shown in FIG. 20A and 20B are explanatory diagrams relating to the second transition process, in which FIG. 20A shows a composite video before the second transition process, and FIG. 20B shows a composite video after the second transition process.
 ところで、先に説明した遷移処理(すなわち、図11に図示した遷移処理)と、図20に図示した第二の遷移処理とでは、ディスプレイ5を見ているBさんの視線の向き、厳密には視線が向いている対象が異なっている。分かり易く説明すると、仮にBさんがAさんと実際に対面している場合、Bさんの視線がAさんに向いた状態でBさんの顔が横移動すると、Bさんに対してより遠くにあるものほど大きなずれ量だけ当初の位置からずれた位置に見えるようになる。このような見え方を再現するため、先に説明した遷移処理、すなわち、図11に図示した遷移処理では、前景の三次元映像の表示位置、Aさんの三次元映像の表示位置、背景の三次元映像の表示範囲の順でずれ量が大きくなっている。これに対して、Bさんの視線がAさんの背景に向いた状態でBさんの顔が横移動すると、Bさんに対してより近くにあるものほど大きくずれ量だけ当初の位置からずれた位置に見えるようになる。このような見え方を再現するため、第二の遷移処理では、背景の三次元映像の表示範囲、Aさんの三次元映像の表示位置、前景映像の表示位置の順で、ずれ量が大きくなっている。 By the way, in the transition process described above (that is, the transition process illustrated in FIG. 11) and the second transition process illustrated in FIG. 20, the direction of the line of sight of Mr. The subject whose line of sight is facing is different. To make it easier to understand, if Mr. B is actually facing Mr. A, if Mr. B's face moves sideways with Mr. B's line of sight facing Mr. A, he will be farther away from Mr. B. The larger the amount of shift, the more the position is shifted from the original position. In order to reproduce such an appearance, in the transition process described above, that is, in the transition process illustrated in FIG. 11, the display position of the foreground 3D image, the display position of Mr. A's 3D image, and the tertiary of the background The amount of deviation increases in the order of the display range of the original video. On the other hand, when Mr. B's face moves sideways with Mr. B's line of sight facing Mr. A's background, the position closer to Mr. B is more displaced from the original position by the amount of deviation. Become visible. In order to reproduce such an appearance, in the second transition process, the amount of deviation increases in the order of the background 3D video display range, Mr. A's 3D video display position, and the foreground video display position. ing.
 なお、遷移処理の実行モードについては、背景の三次元映像の表示範囲のずれ量を最も大きくするモード(先に説明した遷移処理に相当)と、前景の三次元映像の表示位置のずれ量を最も大きくするモード(第二の遷移処理に相当)と、の間で切り替え自在としてもよい。かかる場合には、遷移処理が、そのときのBさんの視線の向きに応じて適切に実行されるようになる。 Regarding the execution mode of the transition process, the mode that maximizes the amount of shift in the display range of the background 3D video (corresponding to the transition process described above) and the amount of shift in the display position of the foreground 3D video It may be possible to switch between the largest mode (corresponding to the second transition process). In such a case, the transition process is appropriately executed according to the direction of the line of sight of Mr. B at that time.
1 ホームサーバ
2 カメラ(撮像装置)
3 マイク
4 赤外線センサ
4a 発光部
4b 受光部
5 ディスプレイ
5a タッチパネル
6 スピーカ
11 データ送信部
12 データ受信部
13 背景映像記憶部
14 第1深度データ記憶部
15 実映像記憶部
16 人物映像抽出部
17 骨格モデル記憶部
18 第2深度データ記憶部
19 前景映像抽出部
20 高さ検知部
21 三次元映像生成部
22 合成映像表示部
23 判定部
24 顔移動検知部
100 通信ユニット
GN 外部通信ネットワーク
S 本システム
1 Home server 2 Camera (imaging device)
3 Microphone 4 Infrared sensor 4a Light emitting unit 4b Light receiving unit 5 Display 5a Touch panel 6 Speaker 11 Data transmitting unit 12 Data receiving unit 13 Background video storage unit 14 First depth data storage unit 15 Real video storage unit 16 Human video extraction unit 17 Skeletal model Storage unit 18 Second depth data storage unit 19 Foreground image extraction unit 20 Height detection unit 21 3D image generation unit 22 Composite image display unit 23 Determination unit 24 Face movement detection unit 100 Communication unit GN External communication network S This system

Claims (8)

  1.  撮像装置により撮像されたユーザの映像を取得する映像取得部と、
     前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得する距離データ取得部と、
     前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成する三次元映像生成部と、
     前記ユーザの目の高さを検知する高さ検知部と、を有し、
     前記撮像装置が設置されている高さ及び前記高さ検知部が検知した前記目の高さの双方が異なるとき、前記三次元映像生成部は、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記高さ検知部が検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することを特徴とする映像表示システム。
    A video acquisition unit that acquires the video of the user captured by the imaging device;
    A distance data acquisition unit that acquires distance data indicating a distance from the object in the video piece from the imaging device for each video piece when the video is divided into a predetermined number of video pieces;
    A 3D video generation unit that generates a 3D video of the user by executing a rendering process using the video of the user and the distance data;
    A height detection unit that detects the height of the eyes of the user,
    When both the height at which the imaging device is installed and the height of the eyes detected by the height detection unit are different, the 3D image generation unit determines the difference between the two and the imaging device and the user. The rendering process for acquiring the 3D video of the user when viewed from a virtual viewpoint at the eye height detected by the height detection unit is executed based on the distance between A video display system characterized by that.
  2.  前記映像取得部は、前記撮像装置により撮像された前記ユーザの映像、及び、前記撮像装置により撮像された背景の映像をそれぞれ取得し、
     前記距離データ取得部は、前記ユーザの映像及び前記背景の映像のそれぞれについて、前記距離データを取得し、
     前記三次元映像生成部は、前記ユーザの映像及び当該ユーザの映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記ユーザの前記三次元映像を生成すると共に、前記背景の映像及び当該背景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記背景の前記三次元映像を生成し、
     前記ユーザの前記三次元映像と前記背景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置した合成映像をディスプレイに表示させる合成映像表示部を有することを特徴とする請求項1に記載の映像表示システム。
    The video acquisition unit acquires the video of the user captured by the imaging device and the background video captured by the imaging device, respectively.
    The distance data acquisition unit acquires the distance data for each of the user's video and the background video,
    The 3D video generation unit generates the 3D video of the user by executing the rendering process using the video of the user and the distance data acquired for the video of the user. Generating the 3D video of the background by performing the rendering process using the distance data acquired for the video and the video of the background;
    The composite video display unit comprising: a composite video display unit configured to combine the 3D video of the user with the 3D video of the background and display a composite video positioned by the user in front of the background. The video display system according to 1.
  3.  前記映像取得部は、前記撮像装置により撮像された前景の映像を更に取得し、
     前記距離データ取得部は、前記前景の映像についての前記距離データを更に取得し、
     前記三次元映像生成部は、前記前景の映像及び当該前景の映像について取得された前記距離データを用いた前記レンダリング処理を実行することによって前記前景の前記三次元映像を更に生成し、
     前記合成映像表示部は、前記ユーザの前記三次元映像と前記背景の前記三次元映像と前記前景の前記三次元映像とを合成し、前記背景の手前に前記ユーザが位置し、かつ、前記ユーザの手前に前記前景が位置している前記合成映像を前記ディスプレイに表示させることを特徴とする請求項2に記載の映像表示システム。
    The video acquisition unit further acquires a foreground video captured by the imaging device,
    The distance data acquisition unit further acquires the distance data for the foreground video,
    The 3D image generation unit further generates the 3D image of the foreground by executing the rendering process using the distance data acquired for the foreground image and the foreground image;
    The synthesized video display unit synthesizes the 3D video of the user, the 3D video of the background, and the 3D video of the foreground, the user is positioned in front of the background, and the user 3. The video display system according to claim 2, wherein the composite video in which the foreground is positioned in front of is displayed on the display.
  4.  前記距離データに基づいて、前記撮像装置と前記ユーザとの間の距離が変化したかどうかを判定する判定部を備え、
     前記撮像装置が前記ユーザの映像を撮像している間に、前記撮像装置と前記ユーザとの間の距離が変化したと前記判定部が判定したとき、前記合成映像表示部は、前記合成映像における前記ユーザの映像の表示サイズを、前記撮像装置と前記ユーザとの間の距離が変化する前の前記表示サイズとなるように調整することを特徴とする請求項2又は3に記載の映像表示システム。
    A determination unit that determines whether a distance between the imaging device and the user has changed based on the distance data;
    When the determination unit determines that the distance between the imaging device and the user has changed while the imaging device is capturing the video of the user, the composite video display unit 4. The video display system according to claim 2, wherein the display size of the video of the user is adjusted to be the display size before the distance between the imaging device and the user is changed. .
  5.  前記ディスプレイに表示された前記合成映像を見る第二のユーザの顔が前記ディスプレイの幅方向に移動したことを検知する顔移動検知部を有し、
     該顔移動検知部が前記顔の移動を検知したとき、前記合成映像表示部は、前記ディスプレイに表示されている前記合成映像を、前記顔移動検知部が前記顔の移動を検知する前の状態から遷移させる遷移処理を実行し、該遷移処理では、前記合成映像における前記ユーザの前記三次元映像の表示位置、及び、前記背景の前記三次元映像の中で前記合成映像中に含まれる範囲のうちの一方を、他方のずれ量よりも大きいずれ量だけ前記幅方向に沿ってずらした状態へ前記合成映像を遷移させることを特徴とする請求項2乃至4のいずれか一項に記載の映像表示システム。
    A face movement detection unit that detects that the face of the second user who sees the composite image displayed on the display has moved in the width direction of the display;
    When the face movement detection unit detects the movement of the face, the composite image display unit displays the composite image displayed on the display, and the state before the face movement detection unit detects the movement of the face. In the transition process, the display position of the 3D video of the user in the composite video and the range included in the composite video in the 3D video of the background 5. The video according to claim 2, wherein the composite video is transitioned to a state in which one of them is shifted in the width direction by an amount larger than the other shift amount. 6. Display system.
  6.  前記映像取得部は、互いに異なる撮像方向にて前記ユーザの映像を撮像する複数の前記撮像装置により撮像された前記ユーザの映像を、前記撮像装置別に取得し、
     前記距離データ取得部は、前記ユーザの映像についての前記距離データを前記撮像装置別に取得し、
     前記三次元映像生成部は、
     前記撮像装置別に取得された前記ユーザの映像と、前記撮像装置別に取得された前記距離データと、に基づいて、前記撮像装置別の前記ユーザの三次元映像片を生成する映像片生成工程と、
     前記ユーザの前記三次元映像を生成するために、前記撮像装置別の前記ユーザの前記三次元映像片の各々を、当該各々に含まれる共通の映像領域同士が重なり合うように結合する結合工程と、
     を行い、前記映像片生成工程において前記ユーザの目を含む部分の前記三次元映像片を生成する際、前記双方が異なるときには、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、前記仮想的な視点から見たときの前記三次元映像片を取得するための前記レンダリング処理を実行することを特徴とする請求項2乃至5のいずれか一項に記載の映像表示システム。
    The video acquisition unit acquires the video of the user captured by a plurality of the imaging devices that capture the video of the user in different imaging directions, for each imaging device,
    The distance data acquisition unit acquires the distance data about the user's video for each imaging device,
    The 3D video generation unit
    A video piece generating step for generating a 3D video piece of the user for each imaging device based on the video of the user acquired for the imaging device and the distance data acquired for the imaging device;
    In order to generate the 3D video of the user, a combining step of combining each of the 3D video pieces of the user for each imaging device so that common video areas included in the respective 3D video images overlap each other,
    When generating the 3D image piece of the portion including the user's eyes in the image piece generating step, if the two are different, the difference between the two and the distance between the imaging device and the user are set. The video display system according to any one of claims 2 to 5, wherein the rendering processing for acquiring the 3D video piece when viewed from the virtual viewpoint is executed based on the rendering process. .
  7.  前記撮像方向が基準面の法線方向と異なるとき、前記三次元映像生成部は、前記映像片生成工程において、前記撮像方向にて撮像した映像に基づいて生成した前記ユーザの前記三次元映像片を、前記法線方向から仮想的に見た場合の前記三次元映像片へ変換することを特徴とする請求項6に記載の映像表示システム。 When the imaging direction is different from the normal direction of the reference plane, the 3D video generation unit generates the 3D video piece of the user generated based on the video captured in the imaging direction in the video piece generation step. The video display system according to claim 6, wherein the video is converted into the three-dimensional video piece when viewed virtually from the normal direction.
  8.  コンピュータが、撮像装置により撮像されたユーザの映像を取得することと、
     コンピュータが、前記映像を所定数の映像片に分割した際の該映像片毎に、前記撮像装置から前記映像片中の対象物との間の距離を示した距離データを取得することと、
     コンピュータが、前記ユーザの映像及び前記距離データを用いたレンダリング処理を実行することによって前記ユーザの三次元映像を生成することと、
     コンピュータが、前記ユーザの目の高さを検知することと、を有し、
     前記撮像装置が設置されている高さ及び検知した前記目の高さの双方が異なるとき、コンピュータは、前記双方の差及び前記撮像装置と前記ユーザとの間の距離に基づいて、検知した前記目の高さにある仮想的な視点から見たときの前記ユーザの前記三次元映像を取得するための前記レンダリング処理を実行することを特徴とする映像表示方法。
    A computer acquiring an image of a user imaged by an imaging device;
    Obtaining distance data indicating a distance from the imaging device to the object in the video piece for each video piece when the computer divides the video into a predetermined number of video pieces;
    A computer generates a 3D video of the user by executing a rendering process using the video of the user and the distance data;
    Detecting a height of the user's eyes,
    When both the height at which the imaging device is installed and the detected eye height are different, the computer detects the detected based on the difference between the two and the distance between the imaging device and the user. An image display method comprising: executing the rendering process for acquiring the 3D image of the user when viewed from a virtual viewpoint at an eye level.
PCT/JP2016/060533 2015-03-31 2016-03-30 Image display system and image display method WO2016159166A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2015-071764 2015-03-31
JP2015071764A JP6461679B2 (en) 2015-03-31 2015-03-31 Video display system and video display method

Publications (1)

Publication Number Publication Date
WO2016159166A1 true WO2016159166A1 (en) 2016-10-06

Family

ID=57004772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/060533 WO2016159166A1 (en) 2015-03-31 2016-03-30 Image display system and image display method

Country Status (2)

Country Link
JP (1) JP6461679B2 (en)
WO (1) WO2016159166A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018192455A1 (en) * 2017-04-18 2018-10-25 杭州海康威视数字技术股份有限公司 Method and apparatus for generating subtitles

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10963104B2 (en) 2017-02-06 2021-03-30 Flatfrog Laboratories Ab Optical coupling in touch-sensing systems
JP7020115B2 (en) * 2017-12-28 2022-02-16 凸版印刷株式会社 Selfie devices, methods, and programs in VR space
WO2020153890A1 (en) * 2019-01-25 2020-07-30 Flatfrog Laboratories Ab A videoconferencing terminal and method of operating the same
JP2023512682A (en) 2020-02-10 2023-03-28 フラットフロッグ ラボラトリーズ アーベー Improved touch detector

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090066786A1 (en) * 2004-05-10 2009-03-12 Humaneyes Technologies Ltd. Depth Illusion Digital Imaging
JP2014071871A (en) * 2012-10-02 2014-04-21 Nippon Telegr & Teleph Corp <Ntt> Video communication system and video communication method
WO2014100250A2 (en) * 2012-12-18 2014-06-26 Nissi Vilcovsky Devices, systems and methods of capturing and displaying appearances

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040075735A1 (en) * 2002-10-17 2004-04-22 Koninklijke Philips Electronics N.V. Method and system for producing a pseudo three-dimensional display utilizing a two-dimensional display device
JP4461739B2 (en) * 2003-08-18 2010-05-12 ソニー株式会社 Imaging device
JP2009500878A (en) * 2005-04-11 2009-01-08 ヒューマンアイズ テクノロジーズ リミテッド Depth illusion digital imaging

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090066786A1 (en) * 2004-05-10 2009-03-12 Humaneyes Technologies Ltd. Depth Illusion Digital Imaging
JP2014071871A (en) * 2012-10-02 2014-04-21 Nippon Telegr & Teleph Corp <Ntt> Video communication system and video communication method
WO2014100250A2 (en) * 2012-12-18 2014-06-26 Nissi Vilcovsky Devices, systems and methods of capturing and displaying appearances

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018192455A1 (en) * 2017-04-18 2018-10-25 杭州海康威视数字技术股份有限公司 Method and apparatus for generating subtitles

Also Published As

Publication number Publication date
JP6461679B2 (en) 2019-01-30
JP2016192688A (en) 2016-11-10

Similar Documents

Publication Publication Date Title
JP6496172B2 (en) Video display system and video display method
WO2016159166A1 (en) Image display system and image display method
US9538160B1 (en) Immersive stereoscopic video acquisition, encoding and virtual reality playback methods and apparatus
US20110316853A1 (en) Telepresence systems with viewer perspective adjustment
US20210065432A1 (en) Method and system for generating an image of a subject in a scene
US8847956B2 (en) Method and apparatus for modifying a digital image
CN109477966A (en) The head-mounted display for virtual reality and mixed reality with interior-external position tracking, user&#39;s body tracking and environment tracking
JPH08237629A (en) System and method for video conference that provides parallax correction and feeling of presence
US20130113701A1 (en) Image generation device
WO2017094543A1 (en) Information processing device, information processing system, method for controlling information processing device, and method for setting parameter
KR102279143B1 (en) Layered augmented entertainment experiences
JP2014511049A (en) 3D display with motion parallax
US20170237941A1 (en) Realistic viewing and interaction with remote objects or persons during telepresence videoconferencing
JP5833526B2 (en) Video communication system and video communication method
WO2017141584A1 (en) Information processing apparatus, information processing system, information processing method, and program
CN103747236A (en) 3D (three-dimensional) video processing system and method by combining human eye tracking
JP5478357B2 (en) Display device and display method
US10404964B2 (en) Method for processing media content and technical equipment for the same
JP2014182597A (en) Virtual reality presentation system, virtual reality presentation device, and virtual reality presentation method
WO2016159165A1 (en) Image display system and image display method
US20230179756A1 (en) Information processing device, information processing method, and program
WO2017043662A1 (en) Image display system and image display method
JP6595591B2 (en) Method for collecting image data for the purpose of generating immersive video and spatial visualization method based on those image data
US20230122149A1 (en) Asymmetric communication system with viewer position indications
JP6550307B2 (en) Image display system and image display method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16773047

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16773047

Country of ref document: EP

Kind code of ref document: A1