WO2016159165A1

WO2016159165A1 - Image display system and image display method

Info

Publication number: WO2016159165A1
Application number: PCT/JP2016/060532
Authority: WO
Inventors: 康夫高橋; 吏中野; 貴司折目; 雄一郎竹内; 暦本　純一; 宮島　靖
Original assignee: 大和ハウス工業株式会社; ソニー株式会社
Priority date: 2015-03-31
Filing date: 2016-03-30
Publication date: 2016-10-06
Also published as: JP2016192687A

Abstract

With the present invention, even if there is a missing part in an image of a user captured by a camera, the effect of the missing part on a displayed image is lessened. This image display system is provided with a first unit utilized by a first user, and a second unit utilized by a second user. A camera of the first unit captures an image of the first user, and the image is displayed on a display screen of the second unit. Either the first unit or the second unit specifies a missing part which is a part of the body of the first user not captured by the camera since being outside an imaging range, and combines the image captured by the camera with another image to create a composite image. The other image is formed by at least one of a virtual image of the missing part, and a virtual image of an object located ahead of the first user in the composite image. In addition, the composite image is configured to cause the other image to be displayed in an area corresponding to the relative position of the missing part with respect to the body in the displayed image.

Description

Video display system and video display method

The present invention relates to a video display system and a video display method, and more particularly to a video display system and a video display method used for displaying a video of the other user on a display screen provided on one user side.

A user who is in a space apart from each other uses a communication technology to interact with each other while watching each other's video, and a video display system for realizing it is already known. In this system, video data when a video of one user is captured is transmitted, and the video data is received and expanded on the other user side. As a result, the video of one user is displayed on the display screen of the other user.

In recent years, a technique for reconstructing a video indicated by video data received from one user and displaying the reconstructed video has been developed. For example, Patent Document 1 discloses that a main subject image and a background image are extracted from an original image, and each extracted image is edited and then combined to obtain a new image. Here, the editing of the extracted video is to complement a missing part, specifically, to complement a part of the background video that overlaps the video of the subject to obtain a complete video.

Further, Patent Document 2 is given as another example related to the reconstruction. Patent Document 2 discloses that a predetermined partial video (for example, a video in which a person who is a subject is normally reflected) is cut out from continuously picked-up video, and correction or the like is performed on the cut out partial video. ing. As a result, it is possible to display only an image in which a person who is a subject is shown. In other words, a person who is a subject is always displayed on the display screen.

JP 2012-133593 A JP 2012-27339 A

By the way, in a dialog realized by the above-described video display system, it is conceivable to display the video of the other party (specifically, a whole body video) in a life-size manner as a method for improving the realism. On the other hand, when an image of a conversation partner is captured by an imaging device such as a camera, if the conversation partner is standing relatively close to the imaging device, a so-called out-of-frame (out of frame) occurs, and the body A part of it will not be shot. If an image of a conversation partner captured in such a situation is to be displayed in a life-size size, a portion of the conversation partner's body that is missing due to being cut off appears on the display screen. As a result, the realism of the dialogue is impaired.

On the other hand, the above-described video reconstruction (that is, the techniques described in Patent Documents 1 and 2) is effective only within the imaging range of the imaging apparatus, and a portion of the video generated due to being out of the imaging range. It is difficult to use when making up for a physical deficit. In addition, if an image outside the imaging range of each imaging device is covered by another imaging device using a plurality of imaging devices, it is possible to compensate for a partial loss of the image. However, the system construction cost increases due to the use of a plurality of imaging devices.

Therefore, the present invention has been made in view of the above problems, and the purpose of the present invention is to assume that a part of the user's body is outside the imaging range of the imaging device when imaging the user's video. Is to provide a video display system and a video display method capable of realizing a realistic dialogue.

According to the video display system of the present invention, the problem is that a first unit used by a first user and a second unit used by a second user in a different space from the first user. A video display system used for displaying the video of the first user on a display screen provided on the second user side, wherein the first unit is configured to display a subject within an imaging range. The second unit includes an image display unit that displays an image on the display screen, and one of the first unit and the second unit includes the subject. When the imaging unit captures an image of the first user in a state where a part of the body of the first user is outside the imaging range, the specifying unit for specifying the part and the part are the Imaging range A composite video generation unit that generates a composite video by synthesizing a video captured by the imaging unit in a state of being outside and another video, and the other video has a part of the imaging range At least one of the part of the virtual video generated based on the video captured by the imaging unit and the virtual video of the object positioned in front of the first user in the composite video The composite video generation unit displays the other video in an area corresponding to the relative position of the part with respect to the body in the display screen when the video display unit displays the composite video. As described above, the above-mentioned synthesized video is generated.

In the above configuration, when a part of the body of the first user is outside the imaging range and the imaging unit captures the first user's video, the part is specified and the video captured by the imaging unit A composite video is generated by combining with other video. Here, the other video is composed of at least one of a part of the virtual video that has not been captured and a virtual video of an object located in front of the first user. When the composite video is displayed on the display screen, the other video is displayed in an area corresponding to the relative position of the missing portion (the portion outside the imaging range) with respect to the body in the display screen. As described above, according to the video display device of the present invention, since the video of the missing part that has not been imaged is complemented by the virtual video, it is possible to suppress the realism of the dialogue from being impaired by the defective part. It becomes possible.

In the video display system, any one of the units includes a skeleton information acquisition unit that acquires skeleton information indicating a positional relationship between a head and a portion other than the head in the body of the first user. Preferably, the specifying unit specifies the part based on the image captured by the imaging unit and the skeleton information in a state where the part is outside the imaging range.
According to said structure, since a missing part is specified based on skeleton information, it becomes possible to specify a missing part more correctly.

Further, in the above video display system, when the imaging unit captures an image of the first user performing the moving operation in a state where the leg of the first user is outside the imaging range, More preferably, the composite video generation unit generates the composite video using the virtual video of the leg during the movement operation as the other video.
According to the above configuration, after the leg of the first user reaches the outside of the imaging range of the imaging unit due to the moving operation, the video of the missing leg becomes a virtual video of the leg during the moving operation ( That is, it is complemented by an image that reproduces the movement of the leg during the movement operation. As a result, it is possible to appropriately cope with a case where an interruption occurs during the moving operation.

In the video display system, the second unit includes a screen forming device that forms the display screen, and the screen forming device does not form the display screen while the second unit has the screen forming device. It is more preferable to show an appearance as a door, window, or appearance provided in the room where the second user is present.
In the above configuration, the screen forming device is to display the appearance as a door, window, or appearance provided in the room where the second user is located while the display screen is not formed. As a result, the screen forming device becomes inconspicuous during a period in which no dialogue is performed, and as a result, its presence is difficult to notice. On the other hand, when the display screen is formed and the video of the first user is displayed, the second user will get a visual effect as if talking to the first user through the glass. . As a result, the realism of dialogue realized by the video display system of the present invention is further improved.

In the video display system, the video display unit may display the height of the first user displayed on the display screen and the actual state of the first user when displaying the video of the first user. It is more preferable to adjust the display size of the video displayed on the display screen so that the height of the video matches the height of the video.
In the above configuration, the video of the first user is displayed in life size. Thereby, the realism of the dialogue realized by the video display system of the present invention is further improved.

Further, in the above video display system, the part outside the imaging range is a predetermined part in the body of the first user and the video of the first user performing a predetermined operation is captured. When the unit captures an image, it is preferable that the composite video generation unit generates the composite video using the other video configured by the predetermined portion of the virtual video.
In particular, the first unit and the second unit each include the imaging unit, the display screen, and the video display unit, and the lens of the imaging unit included in the first unit is a screen that forms the display screen. In the forming device, facing the forming surface of the display screen, the first user is performing an operation of bringing a hand into contact with the forming surface, and the first unit includes the imaging unit When the hand portion is outside the imaging range, it is more preferable that the composite video generation portion generates the composite video using the other video configured by the virtual video of the hand portion.
In the above configuration, when the first user places his / her hand on the display screen forming surface, even if the hand is outside the imaging range of the imaging unit, the image of the hand can be complemented and displayed. It is. Thereby, the second user can perform an operation (manual alignment operation) of aligning the first user's hand displayed on the display screen with his / her hand. As a result, it is possible to enhance the effect of interactive performance by the video display system of the present invention.

In addition, the above-described problem is achieved by using the first unit used by the first user and the second unit used by the second user in a different space from the first user. A video display method for displaying a video of a user on a display screen provided on the second user side, wherein an image of a subject within an imaging range is captured by an imaging unit included in the first unit, and the first The video display unit included in the two units displays video on the display screen, and the first user's body is part of the first user who is the subject is outside the imaging range. When the video is captured, the specific unit included in any one of the first unit and the second unit identifies the part, and the composite video generation unit included in the one unit Thus, a composite image is generated by combining the video captured by the imaging unit with the portion being outside the imaging range and another video, and the other video is partially within the imaging range. At least one of the virtual image of the part generated based on the image captured by the imaging unit and the virtual image of an object positioned in front of the first user in the composite image When the composite video is generated by the composite video generation unit, the display screen on which the composite video is displayed by the video display unit has an area corresponding to the relative position of the part with respect to the body. The problem is solved by generating the composite video so that another video is displayed.
According to said method, the image | video of the defect | deletion part which was outside the imaging range of the imaging part in the 1st user's body and was not imaged comes to be supplemented with a virtual image. As a result, it is possible to prevent the presence of dialogue from being lost due to the missing portion.

According to the video display system and the video display method of the present invention, the video of the missing part that is outside the imaging range of the imaging unit and is not captured in the body of the first user is complemented by the virtual video. As a result, even when a part of the body of the first user is outside the imaging range of the imaging unit when the first user's video is captured by a single imaging unit (for example, a camera), the loss It becomes possible to suppress that the presence of dialogue is impaired by the part.

It is the figure which showed the structure of the video display system which concerns on one Embodiment of this invention. It is the figure which showed arrangement | positioning of the system component apparatus in the room where a user exists. FIGS. 3A and 3B are views showing an example of a screen forming device according to the present invention. FIGS. 4A and 4B are diagrams showing how the display video changes according to the distance between the imaging unit and the user. It is explanatory drawing about the synthetic | combination image | video which complemented the missing part. It is the figure which showed the structure of the home server which each user holds from a functional surface. It is explanatory drawing about the procedure which acquires skeleton information. It is explanatory drawing regarding the specification of a missing part, and the production | generation of a complementary image | video. It is the figure which showed the flow of the video display process. (A) and (B) of FIG. 10 are diagrams showing a flow of processing for complementing the video of the missing part. It is the figure which showed a mode that the users are performing hand-matching operation | movement. (A), (B), and (C) of FIG. 12 are explanatory diagrams of a procedure for complementing the image of the missing portion when the missing portion is generated by the movement operation. (A) and (B) of FIG. 13 are diagrams showing display images when missing portions are complemented using foreground images. It is the figure which showed the flow of the video display process which concerns on a 1st modification. It is the figure which showed the flow of the process which complements a missing part with a foreground image | video in a 1st modification. It is the figure which showed the display image at the time of complementing the said missing part when a hand part is missing in the 1st modification. It is the figure which showed the display image at the time of complementing a missing part using a frame image | video. It is the figure which showed the flow of the video display process which concerns on a 1st modification.

Hereinafter, an embodiment of the present invention (hereinafter, this embodiment) will be described with reference to the drawings. The video display system according to the present embodiment (hereinafter, system S) is used for users who are in rooms separated from each other to interact with each other while watching each other's appearance (video). More specifically, due to the visual effect provided by the system S, each user feels as if he / she is talking with a conversation partner.

In the present embodiment, the system S is used when each user is at his / her own home. That is, this system S is used for each user to interact with the conversation partner while staying at home. However, the present invention is not limited to this, and the present system S may be used when the user is in a place other than home (for example, a meeting place or a commercial facility). Moreover, it is good also as using this system S in order for the users in the room apart from each other in the same building to interact.

In the following, in order to make the system S easier to understand and explain, a case where two users interact using the system S will be described as an example, with one user A and the other user. Let's say B. In the following, the configuration of the system S will be described from the viewpoint of Mr. B, that is, from the viewpoint of viewing Mr. A's video. That is, Mr. A corresponds to the “first user”, and Mr. B corresponds to the “second user”. However, the “first user” and the “second user” are relative concepts that are switched according to the relationship between the viewer and the viewer. Therefore, for example, in the viewpoint of Mr. A, Mr. B corresponds to the “first user” and Mr. A corresponds to the “second user”.

<< Basic configuration of this system >>
First, the basic configuration of the system S will be described. This system S is used by two users (ie, Mr. A and Mr. B) to interact while watching each other's images, and more specifically, each user has a life-size counterpart of the conversation partner. The video is displayed and the voice of the conversation partner is played back. In order to obtain such an audiovisual effect, each user has a communication unit 100. That is, this system S is comprised by the communication unit 100 which each user possesses. Here, the communication unit 100 owned by Mr. A corresponds to the “first unit”, and the communication unit 100 owned by Mr. B corresponds to the “second unit”.

Next, the configuration of the communication unit 100 will be described with reference to FIG. FIG. 1 is a diagram showing the configuration of the system S, more specifically, the configuration of each communication unit 100.

Each communication unit 100 includes a home server 1, a camera 2 as a photographing unit, a microphone 3 as a sound collecting unit, an infrared sensor 4, a screen terminal 5 as a screen forming device, and a speaker 6 as main components. Among these devices, the camera 2, the microphone 3, the infrared sensor 4, the screen terminal 5, and the speaker 6 are arranged in a predetermined room at the home of each user (for example, a room where a user interacts with a conversation partner). Yes.

The home server 1 includes a server computer constituting a so-called home gateway, and includes a CPU, a memory such as a ROM and a RAM, a communication interface, and a hard disk drive. The home server 1 is installed with a program for executing data processing necessary for a dialogue (face-to-face dialogue) between Mr. A and Mr. B through the system S (hereinafter, a dialogue program).

Further, the home server 1 is connected in a communicable state with a communication device via an external communication network GN such as the Internet. That is, the home server 1 belonging to the communication unit 100 owned by Mr. A communicates with the home server 1 belonging to the communication unit 100 owned by Mr. B via the external communication network GN, and transmits and receives various data between the two servers. I do. The data transmitted and received by the home server 1 is data necessary for a dialogue (face-to-face dialogue) between Mr. A and Mr. B through the system S. For example, video data indicating each user's video and audio indicating audio. It is data.

The camera 2 is a well-known network camera, picks up an image of a subject within the image pickup range, and sends the image signal to the home server 1 (strictly speaking, a home unit belonging to the same unit as the communication unit 100 to which the camera 2 belongs). Output to server 1). The number of cameras 2 installed is not particularly limited, but in the present embodiment, only one camera 2 is provided in each communication unit 100 in consideration of cost. In the present embodiment, the lens of the camera 2 faces the display screen forming surface provided in the screen terminal 5. Here, the panel (strictly speaking, the touch panel 5a) of the screen terminal 5 constituting the forming surface is made of transparent glass. Therefore, as shown in FIG. 2, the camera 2 captures an image of a subject located in front of the panel through the panel. FIG. 2 is a diagram illustrating an arrangement position of each component device of the communication unit 100 in a room where a user is present.

The microphone 3 collects sound in the room in which the microphone 3 is installed, and the sound signal is sent to the home server 1 (strictly, the home server 1 belonging to the same unit as the communication unit 100 to which the microphone 3 belongs). Output. In the present embodiment, the microphone 3 is installed at a position directly above the screen terminal 5 as shown in FIG.

The infrared sensor 4 is a sensor for measuring the depth of the measurement object by an infrared method. Specifically, the infrared sensor 4 irradiates infrared rays from the light emitting unit 4a toward the measurement object, and measures the depth by receiving the reflected light at the light receiving unit 4b. Here, the depth is a distance (ie, depth) from the light receiving unit 4b to the measurement target. In the present embodiment, the light emitting unit 4 a and the light receiving unit 4 b of the infrared sensor 4 face the display screen forming surface provided in the screen terminal 5. In addition, as described above, the panel of the screen terminal 5 constituting the formation surface is made of transparent glass. Thereby, as shown in FIG. 2, the infrared sensor 4 measures the depth of the measurement object located in front of the panel through the panel.

The speaker 6 emits sound (reproduced sound) that is reproduced when the home server 1 develops the sound data, and is configured by a known speaker. In the present embodiment, as shown in FIG. 2, a plurality of speakers 4 (four in FIG. 2) are installed at positions sandwiching the screen terminal 5 in the horizontal width direction of the screen terminal 5.

The screen terminal 5 forms a video display screen that is reproduced by the home server 1 developing video data. Specifically, the screen terminal 5 includes a panel made of transparent glass, and forms a display screen on the front surface of the panel. That is, the front surface of the panel corresponds to the display screen formation surface. In the present embodiment, the above panel is the touch panel 5a and receives an operation (touch operation) performed by the user.

Furthermore, the above panel has a size sufficient to display a whole body image of a person. In the face-to-face conversation by the system S, the whole body image of the conversation partner is displayed in a life-size size on the display screen formed on the front surface of the panel. As a result, Mr. B who is looking at the display screen feels as if he is meeting Mr. A, in particular, the feeling of facing through the glass.

Furthermore, the screen terminal 5 according to the present embodiment normally functions as an appearance arranged in a room and forms a display screen only during face-to-face conversation. Hereinafter, the configuration of the screen terminal 5 will be described in detail with reference to FIGS. 3A and 3B. FIGS. 3A and 3B are diagrams showing a configuration example of the screen terminal 5, where FIG. 3A shows a normal state (non-interactive state), and FIG. 3B shows a state of face-to-face conversation. Each is shown.

The touch panel 5a of the screen terminal 5 constitutes a part of the appearance arranged in the room, specifically a mirror surface part. Then, as shown in FIG. 3A, the touch panel 5a has an appearance as a specular portion without forming a display screen during a period in which no dialogue is performed, that is, in a normal state. On the other hand, when the face-to-face conversation is started, the touch panel 5a forms a display screen on the front surface. As a result, an image of the conversation partner is displayed on the front surface of the touch panel 5a as shown in FIG.

Incidentally, the home server 1 is supposed to switch the display screen on and off according to the measurement result of the infrared sensor 4. More specifically, while the user is standing at the front position of the screen terminal 5, the home server 1 specifies the position of the user based on the depth measured by the infrared sensor 4, strictly speaking, the distance from the front surface of the touch panel 5a. To do.

When the distance between the user and the touch panel 5a becomes smaller than a predetermined distance, the home server 1 controls the screen terminal 5 to form a display screen on the front surface of the touch panel 5a. As a result, the touch panel 5a which has been functioning as a figure until then functions as a screen for displaying images. On the other hand, when the distance between the user and the touch panel 5a is equal to or greater than a predetermined distance, the home server 1 controls the screen terminal 5 and erases the display screen that has been formed so far. As a result, the touch panel 5a functions as a figure again.

As described above, in the present system S, the screen terminal 5 serving as a screen for displaying images is normally used as an appearance. This makes it difficult to notice the presence of the display screen during normal (non-interactive) times. On the other hand, during the face-to-face conversation, a display screen is formed and the conversation partner's video is displayed. Thereby, the user who is looking at the display screen feels a visual effect as if he / she is talking through the glass with the conversation partner. As a result, a more realistic dialogue (face-to-face dialogue) is realized.

In addition, about the structure which uses both a video display screen and appearance, a well-known structure can be utilized like the structure described in the international publication 2009/122716, for example. Further, the screen terminal 5 is not limited to a configuration that is also used as a figure. The screen terminal 5 only needs to have a size sufficient to display a conversation partner's video (whole body video). For example, a door (glass door) or a window (glass window) installed in the room. ) May also be used. In addition, about the screen terminal 5, it is not limited to what is used as a door, a window, or a figure, but the normal apparatus which always forms a display screen during start-up may be sufficient.

<< Measures to be taken by System S for the occurrence of missing parts >>
In the face-to-face conversation by the system S, the A-side camera 2 captures Mr. A's video, and the A-side microphone 3 acquires A's voice. Then, Mr. A's home server 1 transmits video data and audio data to Mr. B's home server 1. When Mr. B's home server 1 receives the video data and audio data via the network, the home server 1 develops them. As a result, Mr. A's video is displayed on the display screen formed by Mr. B's screen terminal 5, and Mr. B's speaker 6 receives Mr. A's voice (strictly speaking, the room where Mr. A is present). (Reproduced sound of the sound collected in step 1).

By the way, when displaying the image of Mr. A taken by the camera 2 on the side of Mr. A on the display screen on the side of Mr. B, the display image varies depending on the distance between the camera 2 and Mr. A. Hereinafter, a description will be given with reference to FIGS. FIG. 4 is a diagram showing how the display image changes in accordance with the distance between the camera 2 and the user. FIG. 4A shows the positional relationship between the camera 2 and Mr. A. B) shows an image of Mr. A displayed on the display screen of Mr. B. In FIG. 4A, when Mr. A stands at the position indicated by the symbols a, b, c, the display screen on the side of Mr. B displays the symbols a, b in FIG. , C, the video with the same symbol is displayed.

When Mr. A approaches the camera 2 and the distance between Mr. A and the camera 2 becomes smaller than the predetermined distance (specifically, Mr. A approaches the camera 2 rather than the position of the symbol c in FIG. 4A). 4), as shown in FIGS. 4A and 4B, a part of Mr. A's body is out of the imaging range of the camera 2 and can be seen. And as Mr. A approaches the camera 2, as shown to (A) and (B) of FIG. 4, the part located out of an imaging range in Mr. A's body, ie, a defective part, spreads. In this embodiment, when Mr. A approaches the camera 2, the missing part spreads from below in the vertical direction. On the other hand, even if Mr. A approaches the camera 2, Mr. A's head and shoulders are always within the imaging range, and Mr. A's body is not completely cut out in the width direction.

If a part of Mr. A's body is not imaged at the time of imaging, Mr. A's image is displayed on the display screen of Mr. B without the image of the missing part. Such a situation gives a sense of incongruity to Mr. B who is watching Mr. A's video, and significantly impairs the realism of the dialogue (face-to-face dialogue) performed while watching each other's video.

On the other hand, in this system S, when a defective part occurs, a video (complementary video) that combines the actually captured video and the complementary video is displayed using a video that complements the defective part (complementary video). To do. Referring to FIG. 5, in the present system S, first, a missing portion is specified based on an actually captured image. After specifying the missing part, a virtual image showing the same part as the missing part is acquired. Next, as shown in FIG. 5, size conversion is performed on each of the actually captured image and the virtual image of the missing portion. This size conversion is video processing for displaying the whole body image of Mr. A as a composite image to be displayed thereafter in a size (that is, a life size) that matches the actual height of Mr. A.

Then, a synthesized video is generated by synthesizing the videos after the size conversion. When the synthesized video generated by such a procedure is displayed on the display screen of Mr. B, the missing part is complemented by the supplemental video, so that Mr. A's whole body video is displayed as the display video. .

By complementing the deficient portion described above, the present system S can effectively prevent the presence of the face-to-face dialogue from being lost due to the occurrence of the deficient portion. Such an effect is particularly effective in a configuration in which only one camera 2 is provided in each communication unit 100 as in the present embodiment. More specifically, when only one camera 2 is installed, imaging of a portion outside the imaging range of the camera 2 (that is, a missing portion) cannot be supplemented by another camera. On the other hand, as described above, even if one camera 2 is used, it is possible to supplement the missing portion using the virtual video. As a result, it is possible to provide a cheaper system as a video display system that realizes a face-to-face conversation with a sense of reality.
In the following description, regarding the present system S, the configuration related to the video display processing including the complement of the missing portion and the flow of the video display processing will be described in detail.

<< About home server functions >>
Next, functions of the home server 1, particularly functions related to video display processing will be described. Note that both the Mr. A's home server 1 and Mr. B's home server 1 have the same function, and execute the same data processing through two-way communication when performing the face-to-face conversation. ing. For this reason, only the function of one home server 1 (for example, Mr. A's home server 1) will be described below.

The home server 1 functions as the home server 1 when the CPU of the apparatus executes a dialogue program, and specifically executes a series of data processing related to a face-to-face dialogue. Here, the configuration of the home server 1 will be described in terms of its functions. As shown in FIG. 6, the home server 1 includes a video acquisition unit 11, a person video extraction unit 12, a video storage unit 13, and a skeleton information acquisition unit 14. , A skeleton information storage unit 15, a specifying unit 16, a complementary video generation unit 17, a composite video generation unit 18, a video data transmission unit 19, a video data reception unit 20, and a video display unit 21. FIG. 6 is a diagram showing the configuration of the home server 1 in terms of functions.

Each of the data processing units described above is realized by the hardware devices (specifically, CPU, memory, communication interface, hard disk drive, etc.) of the home server 1 cooperating with a dialogue program as software. . Hereinafter, each data processing unit will be described individually.

The video acquisition unit 11 acquires a video signal from the camera 2. Here, the video signal acquired by the video acquisition unit 11 indicates a video actually captured by the camera 2 (hereinafter, an actual video). Therefore, when the user is within the imaging range of the camera 2, the video acquisition unit 11 acquires a video signal of an actual video including the user's video.

The person video extraction unit 12 extracts a person video from the actual video indicated by the video signal acquired by the video acquisition unit 11. Here, the person image is an image of a part recognized as a person in the actual image. In the present embodiment, the actual video is not used as it is, and the human video and the background video are separated. This is because it is possible to use both videos individually (for example, editing and processing) by separating the person video and the background video, and the final display video by freely combining the human video and the background video. Variations in (composite video) will also increase. Note that a method for extracting a person image from a real image is not particularly limited, but an example is a method for identifying a person image based on depth data of the actual image. The real video depth data is obtained by dividing each frame image of the real video in units of pixels and specifying the measurement result of the infrared sensor 4, that is, the depth for each pixel. Then, according to the depth data of the actual video, as shown in FIG. 7 described later, pixels belonging to the person video (white pixels in FIG. 7) and pixels belonging to the background video (hatched in FIG. 7) are added. The depth value is clearly different from that of the pixel. It is possible to extract a person image from an actual image using such a property.

The video storage unit 13 stores various videos. As shown in FIG. 6, the video stored in the video storage unit 13 is a whole body video, a template video, and a background video. The whole body video is the person video when the person video extracted by the person video extraction unit 12 corresponds to the whole body video. That is, when the person video extracted by the person video extraction unit 12 corresponds to the whole body video, the video storage unit 13 stores the person video as the whole body video. The background video is the actual video when the human video is not included in the real video indicated by the video signal acquired by the video acquisition unit 11. That is, when the person video is not included in the actual video indicated by the video signal acquired by the video acquisition unit 11, the video storage unit 13 stores the actual video as a background video. The template video is a video that is stored in advance as a video that is used for complementing the missing portion, and is a standard video as a video of each part of the human body (hand, foot, waist, etc.), for example. This template video is used when there is no whole body video (when it is not stored in the video storage unit 13).

The skeleton information acquisition unit 14 acquires the skeleton information of the person from the person video extracted from the real video by the person video extraction unit 12. Here, the skeletal information indicates the positional relationship between the head of the person and parts other than the head (specifically, shoulder, elbow, wrist, upper body center, waist, knee, ankle). is there. In the present embodiment, a simple model (skeleton model) related to the skeleton of the person illustrated in FIG. 7 is acquired as the skeleton information. FIG. 7 is an explanatory diagram of a procedure for acquiring a skeleton model as skeleton information.
Incidentally, the skeleton model is acquired based on the above-described depth data of the real video. As a method for acquiring the skeleton model based on the depth data, a known method can be used. For example, it is adopted in the inventions described in Japanese Patent Application Laid-Open Nos. 2014-155893 and 2013-116311. A method similar to the method may be used.

The skeleton information storage unit 15 stores a skeleton model. Here, as the skeleton model stored in the skeleton information storage unit 15, a skeleton model acquired by the skeleton information acquisition unit 14, or a model acquired in advance as a skeleton model of a person having a standard physique (hereinafter referred to as a sample model). ). In the present embodiment, the skeleton model stored in the skeleton information storage unit 15 includes a skeleton model of a whole body image (hereinafter referred to as a whole body model) and a skeleton model during a movement operation (hereinafter referred to as a movement model). However, the present invention is not limited to this, and a skeleton model when performing a predetermined motion or a skeleton model when in a predetermined posture may be included.

The specifying unit 16 determines whether or not there is a missing part (that is, a part located outside the imaging range of the camera 2 at the time of shooting) in the human video in the actual video, and further determines that there is a missing part. Sometimes the missing part is identified. Here, the identification of the missing portion is based on the skeleton model (hereinafter, this skeleton model) acquired by the skeleton information acquisition unit 14 from the person image in the actual image and the whole body model stored in the skeleton information storage unit 15. Done. More specifically, referring to FIG. 8, by comparing the current skeleton model with the whole body model, a missing portion, that is, a missing portion in the current skeleton model is specified. FIG. 8 is an explanatory diagram regarding the identification of a missing portion and the generation of a complementary video.

In the present embodiment, the identifying unit 16 identifies the state of the missing part when identifying the missing part, and specifically identifies whether the state of the missing part is a stationary state or an operating state. To do. Here, the operating state means a state in which the camera 2 is moving in a direction crossing the camera 2 and a state in which the camera 2 is approaching or separated from the camera 2. The identification of the state of the missing part is performed based on the actual video other than the missing part in the user's body photographed by the camera 2 and the depth data of the actual video.

The complementary video generation unit 17 generates a virtual video (hereinafter, complementary video) that complements the missing portion specified by the specifying unit 16. As shown in FIG. 8, the complementary image is a whole-body image stored in the image storage unit 13 based on the missing part specified by the specifying unit 16 (strictly, the missing part in the current skeleton model). And by processing the template video. More specifically, for example, in the case of generating a complementary image from a whole-body image, a missing portion (specifically, in the present skeleton model) specified by the specifying unit 16 in the whole-body image read from the image storage unit 13. , A portion corresponding to the missing portion) is cut out, and the cut out video is edited in accordance with the position and orientation of the missing portion. When generating a complementary image from a template image, the template image corresponding to the missing portion is read from the image storage unit 13 and the read template image is edited according to the position and orientation of the missing portion.

Note that the supplementary video is generated based on the missing part specified by the specifying unit 16. On the other hand, the defect part is identified by the identifying unit 16 using the whole body model and the current skeleton model. Here, the whole body model is acquired from an image captured by the camera 2 when a portion corresponding to the missing portion is within the imaging range, that is, when the whole body is within the imaging range. Therefore, it can be said that the complementary video is a video (virtual video) generated based on the video captured by the camera 2 when the missing portion is within the imaging range.

In addition, when the state of the missing part specified by the specifying unit 16 is the operating state, the complementary video generation unit 17 generates a complementary video of the defective part reflecting the operating state. More specifically, the camera 2 captures an image of a user who is performing a moving operation that moves in a direction across the camera 2 (hereinafter referred to as a horizontal moving operation) with the user's legs outside the imaging range. At this time, the complementary video generation unit 17 generates a complementary video of the leg during the lateral movement operation. Similarly, when the camera 2 captures an image of a user who is moving toward or away from the camera 2 (hereinafter referred to as a depth moving operation) with the user's legs outside the imaging range, The complementary video generation unit 17 generates a complementary video of the leg during the depth movement operation.

The composite video generation unit 18 generates a composite video. The composite video is a video obtained by synthesizing a person video and a background video. That is, the composite video generation unit 18 combines the separated person video and the background video, and generates a background human video as a composite video. In addition, when generating the composite video, the composite video generation unit 18 adjusts the display size of the human video so that the height of the user displayed on the display screen of the conversation partner matches the actual height of the user. To do. More specifically, the skeleton model acquired by the skeleton information acquisition unit 14 from the human image in the actual image (the current skeleton model), the whole body model stored in the skeleton information storage unit 15, and the elements of the whole body model The video display size is adjusted using the depth data of the actual video. More specifically, while calculating the height of the user from the above-mentioned whole body model and the depth data of the actual video that is the basis of the whole body model, the ratio between the current skeleton model and the whole body model is calculated. Thereafter, the video display size is adjusted based on the calculated height of the user and the calculated ratio between the models.

Furthermore, when there is a missing part in the person video in the real video (that is, when the camera 2 captures the video with a part of the user's body outside the imaging range), the composite video generation unit 18 And the supplemental video (corresponding to another video) generated by the complementary video generation unit 17 are combined to generate a human video in which the missing portion is complemented. As a result, when a composite video including a person video in which the missing part is complemented is displayed on the display screen of the conversation partner, the complemented part in the person video (that is, the part where the missing part was present) in the display screen. The complementary video is displayed in an area corresponding to the relative position. In other words, the composite video generation unit 18 generates the composite video so that the supplemental video is displayed in an area corresponding to the relative position of the missing part with respect to the user's body on the display screen on which the composite video is displayed.

The video data transmission unit 19 transmits video data indicating the composite video generated by the composite video generation unit 18 to the home server 1 on the conversation partner side. The video data receiving unit 20 receives the video data transmitted by the home server 1 on the conversation partner side via the external communication network GN. The video display unit 21 expands the video data received by the video data receiving unit 20 and displays the video indicated by the video data (that is, the synthesized video synthesized by the home server 1 on the conversation partner side) on the screen terminal 5. Display on the screen.

<< Flow of video display processing >>
Next, in the face-to-face conversation using the system S, data processing related to video display, that is, video display processing will be described in detail. In the video display processing described below, the video display method of the present invention is applied. That is, each step performed in the video display process corresponds to each process constituting the video display method of the present invention.

In the video display process, first, video data is generated and transmitted in the communication unit 100 used by one user (for example, Mr. A), and then the communication unit 100 used by the other user (for example, Mr. B). The video data is received and expanded in step. In the following, the flow from the generation and transmission of video data during the video display process will be mainly described.

The video display process proceeds according to the flow shown in FIG. FIG. 9 is a diagram showing the flow of the video display process. More specifically, first, the camera 2 captures an image within the imaging range, and the image signal indicating the image is sent to the home server 1 (home belonging to the same communication unit 100 as the communication unit 100 to which the camera 2 belongs). Output to the server 1) (S001). The home server 1 that has received the video signal applies face recognition processing to the video (real video) indicated by the video signal (S002). Thereby, the home server 1 determines whether there is a user within the imaging range of the camera 2 (S003). Note that the face recognition process is a video analysis process for determining whether or not a person video is included in the actual video, and the specific content of the process is well known, and thus the description thereof is omitted. To do.

When it is determined that the user is within the imaging range of the camera 2, the home server 1 acquires the depth data of the real video from the measurement result of the infrared sensor 4 (S004). Further, the home server 1 extracts a person video from the actual video, and acquires a skeleton model of the user within the imaging range of the camera 2 based on the person video and the depth data acquired in the previous step S004 ( S005). Further, the home server 1 calculates the user's life size (specifically, height, etc.) based on the acquired depth data and skeleton model (S006). Furthermore, the home server 1 adjusts the video display size of the person video extracted from the real video so as to match the life size calculated in the previous step S006 (S007).

And the home server 1 determines whether said person image | video is a whole body image | video based on the skeleton model acquired in step S005 (S008). If it is determined that the video is a whole-body video, the home server 1 registers (stores) the person video as a user's whole-body video (S009). At the same time, the home server 1 registers (stores) the skeleton model acquired in step S005 as a whole body model, and registers (stores) the life-size size calculated in step S006 (S009). Thereafter, the home server 1 generates a synthesized video by synthesizing the person video, which is a whole-body video, and the background video (S010), and transmits video data indicating the generated synthesized video to the home server 1 on the conversation partner side. (S011).

On the other hand, if the person video is not a whole body video in step S008, that is, if it is determined that there is a missing part in the person video, the home server 1 identifies the missing part, and the video of the missing part is displayed. Completion. Here, the missing portion occurs when the user approaches the camera 2, and usually the leg portion corresponds to the missing portion. However, when the user remarkably approaches the camera 2, not only the leg portion but also a part of the upper body (for example, the hand portion) can be a missing portion. Therefore, when the home server 1 specifies the missing part, it determines whether the hand part is not included in the missing part (S012). When it is determined that the hand part is included, the home server 1 complements the images of both the hand part and the leg part (S013, S014). On the contrary, when it is determined that the hand portion is not included, the home server 1 complements only the image of the leg portion (S014).

Next, the procedure for complementing the images of the hands and legs will be described with reference to FIGS. 10 (A) and 10 (B). FIG. 10 is a diagram showing a flow of processing for complementing the image of the missing portion, where FIG. 10A shows the procedure for complementing the image of the hand, and FIG. 10B shows the leg portion. Shows the procedure to complement the video.

First, a process for complementing the image of the hand when the hand is missing will be described. As shown in FIG. 10A, the present process starts when the home server 1 reads the hand template video stored in the hard disk drive (S021). Thereafter, the home server 1 adjusts the display size of the hand template image according to the life size calculated in step S006 illustrated in FIG. 9 (S022). After adjusting the video display size, the home server 1 combines the template image of the hand whose size has been adjusted with the person video whose size has been adjusted in step S007 illustrated in FIG. 9 (S023). As a result, a person image in which the image of the hand is complemented is generated.

As described above, since the image of the hand part is complemented, even if the user remarkably approaches the camera 2, the user's hand part (strictly speaking, the complemented hand part image) is displayed on the conversation partner side. It can be displayed on the screen. Thereby, for example, when Mr. A who is one user approaches the camera 2 and puts his / her hand on the display screen forming surface (specifically, the front surface of the touch panel 5a) on the screen terminal 5, Despite being out of the imaging range of the camera 2, the image of the hand of Mr. A is displayed on the display screen of the other user, Mr. B. As a result, as shown in FIG. 11, Mr. B can perform an operation of placing his hand on Mr. A's hand displayed on the display screen, that is, a hand-matching operation. By complementing the image of the hand part in this way, it becomes possible for the users to perform a hand-matching operation, and the visual effect of the system S is further enhanced. In addition, FIG. 11 is a diagram illustrating a state in which the users are performing a manual matching operation.

Next, a process for complementing the video of the leg when the leg is missing will be described. As shown in FIG. 10B, this processing starts when the home server 1 determines whether there is a change in the user's position, that is, whether the user's state is in a moving operation state ( S031). Such a determination is performed based on the real video and the depth data of the real video acquired in step S004 illustrated in FIG. When the home server 1 determines that the user is in a state of performing a movement operation, the home server 1 reads out a movement model, that is, a skeleton model at the time of walking movement, from the skeleton models stored in the hard disk drive ( S032). On the other hand, when it is determined that the user is in a stationary state, the home server 1 reads out a skeleton model other than the movement model, that is, a skeleton model in an upright state (S033).

After that, the home server 1 confirms whether or not the user's whole body video acquired in the past is present in the video stored in the hard disk drive (S034). The home server 1 reads the whole body video when there is a whole body video (S035), and reads the template image of the leg when there is no whole body video (S036).

Further, the home server 10 edits (transforms) the read whole body video or template video so as to match the skeleton model read in steps S032 and S033 (S037). Thereby, the complementary video of the leg is generated. Then, the home server 1 adjusts the display size of the complementary video of the leg according to the life size calculated in step S006 illustrated in FIG. 9 (S038). After that, the home server 1 synthesizes the size-adjusted leg complement image and the person image size-adjusted in step S007 illustrated in FIG. 9 (S039). As a result, a person image in which the image of the leg is complemented is generated.

As described above, the leg image is complemented, so that the whole body image of the user including the leg can be displayed on the display screen of the conversation partner regardless of the distance between the camera 2 and the user. Further, even if the user moves during shooting, it is possible to appropriately complement the leg image using the skeleton model during walking. For example, as shown in FIG. 12A, when the user performs an operation of approaching the camera 2 during photographing, the skeleton model at the time of the moving operation illustrated in FIG. A complementary image of the leg is generated using the change pattern. At this time, the change speed of the change pattern of the skeleton model is adjusted according to the movement speed when the user moves, and a complementary image of the leg is generated according to the change pattern after the speed adjustment. As a result, as shown in FIG. 12C, it is possible to complement the leg image so as to follow the movement operation of the user. FIG. 12 is an explanatory diagram of a procedure for complementing the image of the missing portion when the missing portion is generated by the moving operation. FIG. 12A shows a state in which the user is approaching the camera 2. (B) in the figure shows a change pattern of the skeleton model during walking, and (C) in the figure shows a human image in which the foot image is complemented so as to follow the movement movement of the user. Show.

<< First modification of this system >>
In the above-described embodiment, the virtual image of the missing portion is used as the complementary image of the missing portion. However, it is also possible to use a virtual video different from the missing part as the complemented picture of the missing part. For example, a virtual image of an object positioned in front of the user in a display image (in other words, a composite image) displayed on the display screen of the conversation partner may be used as a complementary image.

By using the foreground video as the complementary video, the user who is viewing the display video on the display screen cannot see because the missing part is behind the foreground video (in other words, the missing part is hidden by the foreground video). The visual effect is obtained.

The foreground image may be the foreground image FP1 imitating the character, animal, or person illustrated in FIG. 13A, and the wall panel, figurine, or structure illustrated in FIG. The simulated foreground image FP2 may be used. (A) and (B) of FIG. 13 are diagrams showing display images when missing portions are complemented using foreground images.

Hereinafter, as a first modification of the system S, a case where a missing portion is complemented with a foreground image will be described. In the following, the contents relating to the first modification will be described focusing on the contents different from the video display process described above.

As shown in FIG. 14, the procedure for complementing the missing portion in the foreground video is that the person video in the actual video from the step of acquiring the video (real video) taken by the camera 2 is the user's whole body video. The process up to the step of determining whether or not is the same as the above-described embodiment, that is, the procedure in the case of complementing with the virtual video of the missing part (S041 to S048). FIG. 14 is a diagram showing a flow of video display processing according to the first modification.

If it is determined that the person video is a whole-body video, the home server 1 generates a composite video by combining the whole-body video and the background video (S049), and transmits video data indicating the generated composite video to the conversation partner side. To the home server 1 (S050).

On the other hand, when the person video is not a whole body video, that is, when it is determined that there is a missing part, the home server 1 identifies the missing part and complements the video of the missing part. Here, as described above, the leg portion usually corresponds to the missing portion, but when the user remarkably approaches the camera 2, not only the leg portion but also the hand portion can become the missing portion. Therefore, when the home server 1 specifies the missing portion, it determines whether the hand portion is not included in the missing portion (S051). When the hand part is not included in the missing part, that is, when the missing part is only the leg part, the home server 1 complements the missing part with the foreground image according to the procedure shown in FIG. 15 (S053). FIG. 15 is a diagram illustrating a flow of processing for complementing a missing portion with a foreground image in the first modification.

Specifically, the home server 1 reads the foreground video stored in the hard disk drive (S061), and adjusts the display size of the foreground video according to the life size calculated in step S046 shown in FIG. (S062). Thereafter, the home server 1 combines the size-adjusted foreground image with the person image whose size has been adjusted in step S047 illustrated in FIG. 14 (S063). Further, the home server 1 further synthesizes the background video with the video obtained by synthesizing the foreground video and the person video (S049).

According to the procedure described above, video data indicating a video in which a missing part is complemented in the foreground video is generated and transmitted to the home server 1 on the conversation partner side (S050). Then, when the video data is expanded and the video is displayed on the display screen of the conversation partner, an area corresponding to the relative position of the complementary portion (that is, the portion where the missing portion is present) in the display video in the display screen The foreground video will be displayed.

The flow of the video display processing according to the first modification will be further described. When it is determined in step S051 illustrated in FIG. 14 that the missing part includes a hand, the home server 1 identifies the missing hand. It complements with the virtual image of a hand (S052). As described above, in the first modified example, the missing portion is complemented with the foreground image, and when the hand portion is included in the missing portion, the missing portion is complemented with the virtual image of the hand portion.

More specifically, in the first modification, Mr. A performs an operation of bringing his / her hand (palm) into contact with the front surface of the touch panel 5a constituting the screen terminal 5 on the A's side, so that Mr. A's hand is A When it is out of the imaging range of the camera 2 on the far side, the hand portion of the missing portion is complemented by the virtual image, and the other portion is complemented by the foreground image. As described above, in the first modification, even if a part of Mr. A's body including the hand part is outside the imaging range of the camera 2 on the Mr. A side, as shown in FIG. Instead, it is complemented by a virtual image of the hand (image indicated by symbol VP in the figure). FIG. 16 is a diagram showing a display image when the missing part is complemented when the hand part is missing in the first modified example.

Then, by complementing the hand part with the virtual image of the hand part, in the first modification example, the action of Mr. B overlapping his hand displayed on the hand part of Mr. A displayed on the display screen, that is, Manual alignment can be realized. In the first modification, among the cases where the image of the hand portion is missing, Mr. A is performing a predetermined action, specifically, the hand portion is the front surface of the screen terminal 5 (strictly speaking, Only in the case where it is in contact with the front surface of the touch panel 5a, the virtual image of the hand is complemented. This is because it is necessary to complement the virtual image of the hand part at least during the period when the hand part is touching the front surface of the screen terminal 5 in order to realize the above-described hand matching operation. However, the present invention is not limited to this, and in the case where the missing portion is complemented with a foreground image, if the image of the hand portion is lost, the virtual image of the hand portion regardless of the action or posture performed by the user. You may supplement with Alternatively, even if a hand portion is included in the missing portion, all the missing portions including the hand portion may be complemented with the foreground video.

<< Second modification of this system >>
In the first modification described above, as the foreground image, the foreground image FP1 imitating the character, animal or person shown in FIG. 13A, or the wall panel, figurine or structure shown in FIG. 13B. The foreground image FP2 simulating an object is used. However, foreground images can be considered other than the foreground images FP1 and FP2. For example, as shown in FIGS. 17A and 17B, an image of a frame surrounding the actual image (hereinafter referred to as a frame image) may be used as the foreground image. (A) and (B) of FIG. 17 are diagrams showing display images when a missing portion is complemented using a frame image.

The above frame image is recognized as an outer frame of the glass (for example, a window frame or a door frame) for a user who feels facing the conversation partner through the glass. Therefore, by using the frame video as the complementary video, the user who is viewing the display video on the display screen cannot see because the missing portion is behind the outer frame (in other words, the missing portion is not visible in the outer frame). The visual effect is hidden).

Note that the frame video may be a frame video RP1 illustrated in FIG. 17A, that is, a frame video surrounding four sides of the display video like an actual window frame or door frame. Alternatively, the frame image RP2 illustrated in FIG. 17B, that is, a frame image that hides only a missing portion may be used.

Hereinafter, as a second modification of the system S, a case where a missing portion is complemented with a frame image will be described. In the following, the contents relating to the second modification will be described focusing on the contents different from the video display process described above.

As shown in FIG. 18, the procedure for complementing the missing portion in the frame video is that the person video in the actual video from the step of acquiring the video (actual video) taken by the camera 2 is the whole body video of the user. Steps for determining whether or not are the same as those in the embodiment described above (S071 to S078). FIG. 18 is a diagram showing a flow of video display processing according to the second modification.

If it is determined that the person video is a whole-body video, the home server 1 generates a composite video by combining the whole-body video and the background video (S079), and sends video data indicating the generated composite video to the conversation partner side. To the home server 1 (S080). On the other hand, when the actual video is not a whole-body video, that is, when it is determined that there is a missing part, the home server 1 identifies the missing part and complements the video of the missing part with the frame video. More specifically, the home server 1 reads the frame video stored in the hard disk drive (S081), and adjusts the display size of the frame video according to the life size calculated in step S076 (S082). Thereafter, the home server 1 combines the size-adjusted frame image with the person image whose size has been adjusted in step S077 (S083). Further, the home server 1 further synthesizes the background video with the video obtained by synthesizing the frame video and the person video (S079).

By the procedure described above, video data indicating a video in which the missing portion is complemented by the frame video is generated and transmitted to the home server 1 on the conversation partner side (S080). Then, when the video data is expanded and the video is displayed on the display screen of the conversation partner, an area corresponding to the relative position of the complementary portion (that is, the portion where the missing portion is present) in the display video in the display screen A frame image is displayed on the screen.

<< Other Embodiments >>
In the above embodiment, the video display system and the video display method of the present invention have been described with reference to an example. However, the above embodiment is for facilitating understanding of the present invention, and does not limit the present invention. The present invention can be changed and improved without departing from the gist thereof, and the present invention includes the equivalents thereof.

In the above embodiment, the case where two users interact through the system S has been described as an example. However, the present invention is not limited to this, and the number of people who can interact at the same time is three. It may be more than people.

Further, in the above-described embodiment, the missing video is complemented (that is, the real video and the complementary video are combined) by the home server 1 belonging to the same communication unit 100 as the camera 2 that captured the real video. It was. To explain in an easy-to-understand manner, when a video of a part of Mr. A's body is missing, the video is complemented by the home server 1 on the A's side. However, the present invention is not limited to this, and the video complement may be performed by the home server 1 on the B side.

Further, in the above embodiment, as a form for complementing the missing part, any one of the form supplemented with the virtual video of the missing part, the form supplemented with the foreground video, or the form supplemented with the frame video is adopted. . However, it is possible to deal with all the above three forms, and the form that is actually adopted may be freely switched according to the user's request.

In the above-described embodiment, the video (for example, the whole body video, the template video, the foreground video, and the frame video) that is the basis of the complementary video is stored in the home server 1, more specifically, in the hard disk drive. did. However, the present invention is not limited to this, and is stored in an apparatus different from the home server 1, for example, an external server connected to the home server 1 through the external communication network GN. You may download the data.

1 Home server 2 Camera (imaging part)
3 Microphone 4 Infrared sensor 4a Light emitting unit 4b Light receiving unit 5 Screen terminal 5a Touch panel 6 Speaker 11 Video acquisition unit 12 Human video extraction unit 13 Video storage unit 14 Skeletal information acquisition unit 15 Skeletal information storage unit 16 Identification unit 17 Complementary video generation unit 18 Composite video generation unit 19 Video data transmission unit 20 Video data reception unit 21 Video display unit 100 Communication units FP1, FP2 Foreground video GN External communication network RP1, RP2 Frame video S This system (video display system)

Claims

A first unit used by the first user and a second unit used by a second user in a different space from the first user, A video display system used for displaying on a display screen provided on the user side of
The first unit includes an imaging unit that captures an image of a subject within an imaging range;
The second unit includes a video display unit that displays video on the display screen,
One of the first unit and the second unit is
A specifying unit for specifying the part when the imaging unit captures the video of the first user in a state where a part of the body of the first user as the subject is outside the imaging range;
A synthesized video generation unit that generates a synthesized video by synthesizing a video captured by the imaging unit in a state where the portion is outside the imaging range and another video;
The other video is a portion of the virtual video generated based on the video captured by the imaging unit when the part is within the imaging range, and the first video in front of the first user in the composite video. It is composed of at least one of the virtual images of the objects located,
The composite video generation unit is configured to display the other video in an area corresponding to a relative position of the part with respect to the body in the display screen when the video display unit displays the composite video. A video display system characterized by generating video.
The one of the units has a skeleton information acquisition unit that acquires skeleton information indicating a positional relationship between the head and a portion other than the head in the body of the first user.
The video display according to claim 1, wherein the specifying unit specifies the part based on an image captured by the imaging unit in a state where the part is outside the imaging range and the skeleton information. system.
When the imaging unit captures an image of the first user who is performing a moving operation in a state where the legs of the first user are outside the imaging range, the composite image generating unit 3. The video display system according to claim 1, wherein the composite video is generated by using a virtual video of the legs in the middle as the other video. 4.
The second unit has a screen forming device for forming the display screen,
2. The screen forming device displays an appearance as a door, a window, or an appearance provided in a room where the second user is present while the display screen is not formed. 4. The video display system according to any one of 3.
When the video display unit displays the video of the first user, the height of the first user displayed on the display screen matches the actual height of the first user. The video display system according to any one of claims 1 to 4, wherein a display size of a video displayed on the display screen is adjusted.
When the imaging unit captures an image of the first user who is out of the imaging range and is a predetermined part in the body of the first user and performing a predetermined operation, the composite image The video display system according to claim 1, wherein the generation unit generates the composite video using the other video configured by the virtual video of the predetermined portion.
Each of the first unit and the second unit includes the imaging unit, the display screen, and the video display unit,
The lens of the imaging unit included in the first unit faces the formation surface of the display screen in the screen forming device that forms the display screen,
When the first user performs an operation of bringing a hand part into contact with the forming surface and the hand part is outside the imaging range of the imaging unit included in the first unit, the composite video generation unit The video display system according to claim 6, wherein the composite video is generated using the other video configured by the virtual video of the hand portion.
Using the first unit used by the first user and the second unit used by the second user in a different space from the first user, the video of the first user is displayed in the second An image display method for displaying on a display screen provided on the user side of
The imaging unit included in the first unit captures an image of a subject within the imaging range,
The video display unit provided in the second unit displays video on the display screen,
Of the first unit and the second unit, when the imaging unit captures an image of the first user in a state in which a part of the body of the first user that is the subject is outside the imaging range. The part is specified by the specifying unit provided in any one of the units,
The synthesized video generation unit provided in any one of the units generates a synthesized video by synthesizing the video captured by the imaging unit in a state where the part is outside the imaging range, and another video,
The other video is a portion of the virtual video generated based on the video captured by the imaging unit when the part is in the imaging range, and the first video in the composite video in front of the first user. It is composed of at least one of the virtual images of the objects located,
When the composite video is generated by the composite video generation unit, the other video is displayed in an area corresponding to the relative position of the part with respect to the body in the display screen on which the composite video is displayed by the video display unit. The composite video is generated so as to be displayed.