WO2018016316A1

WO2018016316A1 - Image processing device, image processing method, program, and telepresence system

Info

Publication number: WO2018016316A1
Application number: PCT/JP2017/024571
Authority: WO
Inventors: 穎陸; 祐介阪井; 雅人赤尾
Original assignee: ソニー株式会社
Priority date: 2016-07-19
Filing date: 2017-07-05
Publication date: 2018-01-25

Abstract

The present disclosure relates to an image processing device, an image processing method, a program, and a telepresence system which make it possible to achieve communication that feels less unnatural. A plurality of image capture devices for capturing images of a user from a plurality of viewpoints are disposed around a display device for displaying a display image transmitted from a counterpart, and an image acquisition unit acquires images respectively captured by the plurality of image capture devices. On the basis of the plurality of images, an image generation unit generates a display image such that when a hand of the user falls outside the angle of view when the plurality of images are captured, the portion is appropriately displayed, and an image transmission unit transmits the display image to the counterpart. This technology is applicable, for example, to a telepresence system.

Description

Image processing apparatus, image processing method, program, and telepresence system

The present disclosure relates to an image processing device, an image processing method, a program, and a telepresence system, and more particularly, to an image processing device, an image processing method, a program, and a telepresence system that enable communication with less discomfort. .

Conventionally, by using a telepresence system by a plurality of users in remote locations, it is attempted to communicate as if each other's users are facing each other. By the way, in general, in the telepresence system, the imaging device that captures the user on the other side and the display device that displays the user on the other side are arranged at different positions. There was a case where a sense of incongruity that it did not fit occurred.

Therefore, as disclosed in Patent Document 1, the lines of sight of the users are matched by imaging with a camera provided on the back side of the display main body through a hole provided in the polarizing plate of the liquid crystal display unit. There has been proposed a camera-integrated display device.

JP-A-6-245209

As described above, conventionally, in the telepresence system, there is a case where a sense of incongruity occurs in the communication, and therefore, there is a need for improvement. For example, when a user tries to join hands with each other using a telepresence system and approaches the other user's hand displayed on the display device, the hand has an angle of view that can be captured by the imaging device. When you are outside, your hand will not be visible on the screen. For this reason, it has been difficult for users to communicate with each other without feeling uncomfortable. Note that the camera-integrated display device proposed in Patent Document 1 described above has a structure that requires a predetermined thickness in order to capture an image with a camera provided on the back side of the liquid crystal display unit. It could not be installed easily.

This disclosure has been made in view of such a situation, and is intended to enable communication without a sense of incongruity.

An image processing apparatus according to an aspect of the present disclosure includes an image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints, and a plurality of the user's portions based on the plurality of images. An image generation unit that generates a display image that appropriately displays a part of the angle of view when the image is captured, and the display image generated by the image generation unit is transmitted. An image transmission unit.

An image processing method or program according to an aspect of the present disclosure acquires a plurality of images captured by a user from a plurality of viewpoints, and a part of the user acquires a plurality of the images based on the plurality of images. The method includes a step of generating a display image so that a part of the image is appropriately displayed when the angle of view is outside of the field of view, and transmitting the generated display image.

A telepresence system according to one aspect of the present disclosure includes a display device that displays a display image transmitted from a counterpart, and a plurality of imaging devices that are arranged around the display device and that capture a user from a plurality of viewpoints. An image acquisition unit that acquires images captured by each of the plurality of imaging devices, and an outside angle of view when a part of the user captures the plurality of images based on the plurality of images. An image generation unit that generates a display image that properly displays a part thereof, and an image transmission unit that transmits the display image generated by the image generation unit.

In one aspect of the present disclosure, a plurality of images captured by a user from a plurality of viewpoints are acquired, and a portion of the user captures a plurality of images based on the plurality of images. Then, a display image is generated so that a part of the display image is appropriately displayed, and the generated display image is transmitted.

の一 According to one aspect of the present disclosure, communication without a sense of incongruity can be achieved.

1 is a perspective view illustrating a schematic configuration of a telepresence system to which the present technology is applied. It is a block diagram which shows the structural example of 1st Embodiment of an image processing apparatus. It is a flowchart explaining a depth production | generation process. It is a figure which shows the example of the image according to the distance from a display apparatus to a user's hand. It is a figure explaining a virtual viewpoint. It is a flowchart explaining a base image generation process. It is a figure explaining the method of estimating the depth which arrange | positions 3D model of a hand. It is a figure explaining the other method of estimating the depth which arrange | positions 3D model of a hand. It is a figure which shows the image of the base image which a part of user's hand lacked. It is a figure which shows the local coordinate system of 3D model of a hand. It is a flowchart explaining an image composition process. It is a flowchart explaining the process performed in an image processing apparatus. It is a block diagram which shows the structural example of 2nd Embodiment of an image processing apparatus. It is a figure which shows the image which aligns two point clouds. It is a block diagram which shows the structural example of 3rd Embodiment of an image processing apparatus. It is a block diagram which shows the structural example of 4th Embodiment of an image processing apparatus. And FIG. 18 is a block diagram illustrating a configuration example of an embodiment of a computer to which the present technology is applied.

Hereinafter, specific embodiments to which the present technology is applied will be described in detail with reference to the drawings.

FIG. 1 is a perspective view showing a schematic configuration of a telepresence system to which the present technology is applied.

As shown in FIG. 1, the telepresence system 11 includes a display device 12, imaging devices 13a to 13d, and an image processing device 14.

The telepresence system 11 can provide, for example, a communication experience in which two users at remote locations are facing each other. In the following, the user side in front of the display device 12 shown in FIG. 1 is referred to as the user side, and the user side displayed on the display device 12 is referred to as the partner side. The telepresence system 11 is provided on both the own side and the other side, and the telepresence systems 11 on the own side and the other side can communicate with each other via a network.

The display device 12 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side, and displays an image transmitted from the telepresence system 11 on the other side. The user on the side is displayed on the screen.

The imaging devices 13a to 13d are arranged around the display device 12. The imaging devices 13a to 13d image the user from the viewpoints of the respective arrangement positions, and supply images (RGB color images) obtained by the imaging to the image processing device 14. In FIG. 1, the arrangement positions where the vertical and horizontal directions are 2 × 2 by the four imaging devices 13 a to 13 d are shown, but the arrangement positions are limited to the example shown in FIG. 1. There is nothing. If the user can be imaged from a plurality of viewpoints, the number of the imaging devices 13 may be three or less or five or more.

The image processing device 14 uses the four images supplied from the imaging devices 13a to 13d to perform image processing for generating an image viewed from the virtual viewpoint that is different from the viewpoints of the imaging devices 13a to 13d. Go to the telepresence system 11 on the other side. For example, during telepresence, the image processing apparatus 14 provides a virtual viewpoint (viewpoint P described later with reference to FIG. 5) so that the other user does not feel uncomfortable when he / she sees the user on the other side. Can be set. The detailed configuration of the image processing apparatus 14 will be described with reference to FIG.

Here, for example, when the user on the other side and the user on the other side perform an operation of bringing their hands together through the telepresence system 11, the hand approaching the display device 12 is moved to the imaging devices 13a to 13a. It is assumed that the angle of view is outside the imageable angle of view by 13d. Therefore, in the image processing device 14, an image that allows the palm to be put together at the center of the display device 12 as shown in FIG. Processing is performed.

In the following, as described here, when the user on the other side and the user on the other side perform an operation of bringing their hands together through the telepresence system 11, an image performed by the image processing apparatus 14 is performed. Processing will be described.

FIG. 2 is a block diagram illustrating a configuration example of the first embodiment of the image processing apparatus 14.

As shown in FIG. 2, the image processing apparatus 14 includes an image acquisition unit 21, a depth generation unit 22, an image generation unit 23, and an image transmission unit 24.

The image acquisition unit 21 is connected to the imaging devices 13a to 13 in FIG. 1 by wire or wirelessly, acquires four images captured by the user from the viewpoints of the imaging devices 13a to 13d, and obtains a depth generation unit. 22 is supplied.

The depth generation unit 22 uses the four images supplied from the image acquisition unit 21 to generate a depth representing the depth for each coordinate in the plane direction of each image, and supplies the depth to the image generation unit 23. For example, the depth generation unit 22 obtains a stereo depth for each image by a stereo matching method using two images arranged vertically or horizontally, and then synthesizes the stereo depth in the vertical direction and the horizontal direction of each image. Thus, the depth for the four images can be finally generated.

The image generation unit 23 includes a base image generation unit 31, a data recording unit 32, and an image composition unit 33. For example, when the image generation unit 23 performs communication in which the users of the telepresence system 11 join hands, when the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, The hand generates an image that can be appropriately displayed on the display device 12 of the telepresence system 11 on the other side.

In a state where the user's hand is sufficiently far from the display device 12 (for example, state A in FIG. 4 described later), the base image generation unit 31 is any one of the four images acquired by the image acquisition unit 21. The images are selected to be displayed on the display device 12 of the telepresence system 11 on the other side. On the other hand, in a state where the user's hand is not sufficiently distant from the display device 12 (for example, state B or C in FIG. 4 described later), the base image generation unit 31 displays an image viewed from the virtual viewpoint as the partner. The base image is generated as a base image for display on the display device 12 of the telepresence system 11 on the side. For example, the base image generation unit 31 views the user from a virtual viewpoint based on the four images acquired by the image acquisition unit 21 and the depths for the four images generated by the depth generation unit 22. A base image can be generated. Note that the base image generation processing of the base image generation unit 31 will be described later with reference to FIGS.

In the data recording unit 32, for example, a 3D model of a hand in which the shape of the user's hand is formed in three dimensions and the texture of the hand is pasted is recorded in advance. For example, when the users of the telepresence system 11 communicate with each other, hands that can display the palm of the hand can be displayed by displaying an image in which the palm of the other user can be seen on each other's display device 12. A 3D model is created in advance. Note that the 3D model of the hand may be created so as to include, for example, the part from the hand to the elbow in addition to creating only the hand part. In addition to using a hand 3D model registered in advance according to a user, a hand 3D model used in existing computer graphics may be used.

The image compositing unit 33 is, for example, a state in which the user's hand approaches the display device 12 and is outside the angle of view that can be captured by the imaging devices 13a to 13d, and is close enough that a part of the user's hand cannot be captured (for example, In the case of state C) in FIG. 4 described later, the user's hand is combined with the base image. For example, the image synthesizing unit 33 converts the user's hand to the data recording unit 32 into a base image viewed from a virtual viewpoint so that the user's hand is appropriately displayed on the display device 12 of the telepresence system 11 on the other side. An image of the palm of the user based on the recorded 3D model of the hand is synthesized. Note that the image composition processing of the image composition unit 33 will be described later with reference to FIGS.

The image transmission unit 24 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side via the network, and the image generated by the image generation unit 23 is transmitted to the other side. As a display image to be displayed on the telepresence system 11.

The image processing device 14 configured as described above is configured to combine images obtained by combining the user's hand based on the 3D model of the hand with the base image generated based on the images captured by the imaging devices 13a to 13d. Can be displayed on the display device 12. Thereby, the telepresence system 11 can make the communication that the users match each other without a sense of incongruity.

The depth generation process executed in the depth generation unit 22 will be described with reference to the flowchart shown in FIG.

For example, an image a captured by the image capturing device 13a, an image b captured by the image capturing device 13b, an image c captured by the image capturing device 13c, and an image d captured by the image capturing device 13d are transferred from the image acquisition unit 21 to the depth. When supplied to the generation unit 22, the process is started.

In step S11, the depth generation unit 22 uses the image a picked up by the image pickup device 13a and the image b picked up by the image pickup device 13b, and the first stereo depth a1 and the image of the image a by the stereo matching method. The first stereo depth b1 of b is calculated.

In step S12, the depth generation unit 22 uses the image c picked up by the image pickup device 13c and the image d picked up by the image pickup device 13d, and the first stereo depth c1 and the image c of the image c by the stereo matching method. The first stereo depth d1 of d is calculated.

In step S 13, the depth generation unit 22 uses the image a captured by the imaging device 13 a and the image c captured by the imaging device 13 c to perform the second stereo depth a 2 and the image of the image a by the stereo matching method. The second stereo depth c2 of c is calculated.

In step S14, the depth generation unit 22 uses the image b picked up by the image pickup device 13b and the image d picked up by the image pickup device 13d, and the second stereo depth b2 and the image b of the image b by the stereo matching method. The second stereo depth d2 of d is calculated.

In step S15, the depth generation unit 22 combines the first stereo depth a1 of the image a calculated in step S11 and the second stereo depth a2 of the image a calculated in step S13, thereby obtaining a depth a for the image a. Ask for. Similarly, the depth generation unit 22 combines the first stereo depth b1 of the image b calculated in step S11 and the second stereo depth b2 of the image b calculated in step S14, thereby calculating the depth b for the image b. Ask. Hereinafter, similarly, the depth generation unit 22 calculates the depth c for the image c and the depth d for the image d, and the depth generation process ends.

As described above, the depth generation unit 22 can generate a depth for each of the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d.

The processing performed in the base image generation unit 31 will be described with reference to FIGS.

FIG. 4 shows an example of four images a to d captured by the imaging devices 13a to 13d in three states according to the distance from the display device 12 to the user's hand.

The base image generation unit 31 determines whether or not to generate a base image viewed from the virtual viewpoint according to the distance from the display device 12 to the user's hand.

For example, when the user's hand is sufficiently far from the display device 12 and the distance from the display device 12 to the user's hand is equal to or greater than a predetermined first distance, as shown in FIG. In each of the images a to d, the user's hand is shown in the approximate center of the screen. Accordingly, the base image generation unit 31 determines that the base image is not generated when the user's hand is sufficiently far from the display device 12 (hereinafter referred to as state A as appropriate), and determines that four images a Any one of the images from d to d is selected for display on the display device 12 of the telepresence system 11 on the other side.

On the other hand, when the user's hand is closer to the center of the display device 12 than in the state A, and the distance from the display device 12 to the user's hand is not sufficiently long, and becomes less than the first distance described above, B in FIG. As shown in FIG. 4, in the four images a to d, the user's hand is projected around the screen. As described above, when the user's hand is projected around the screen (hereinafter referred to as state B as appropriate), the user is viewed from the other user who is trying to match each other's hands in the center of the display device 12. There will be a sense of incongruity. Therefore, in the state B, the base image generation unit 31 determines to generate a base image, and the four images acquired by the image acquisition unit 21 and the four images generated by the depth generation unit 22 are determined. Based on the depth, a base image in which the user is viewed from a virtual viewpoint is generated.

Furthermore, the user's hand is closer to the center of the display device 12 than in the state B, and the user's hand is closer to the outside of the angle of view that can be imaged by the imaging devices 13a to 13d. When the distance to the user's hand is less than the predetermined second distance, a part of the user's hand is not displayed on the screen in the four images a to d as shown in FIG. 4C. In this way, in a state close enough that a part of the user's hand cannot be imaged (hereinafter referred to as state C as appropriate), a base image lacking a part of the user's hand is generated. Compared to state B, the user on the other side is more uncomfortable. Therefore, in the case of the state C, the base image generation unit 31 generates a base image viewed from the virtual viewpoint, and then adds the user's hand to the generated base image so as to compensate for the lacking part of the user's hand. Hands are synthesized.

In this way, the base image generation unit 31 determines whether the state is the state A, the state B, or the state C according to the distance from the display device 12 to the user's hand. When the base image generation unit 31 determines that the state A has changed to the state B, the image to be displayed on the display device 12 of the telepresence system 11 on the other side is obtained from any one of the images a to d. A process of switching from a virtual viewpoint to a base image viewed from the user is performed. Note that the base image generation unit 31 may generate a base image even in the state A.

Here, referring to FIG. 5, a virtual viewpoint when generating the base image will be described.

In the case of the above-mentioned state B and state C, if the other side user sees the user on his / her side without feeling uncomfortable, assuming that the display device 12 is a window, the other-side user can see the user on his / her side from the other side of the window. It becomes the viewpoint when seeing.

Therefore, as shown in FIG. 5, the global coordinate system uses the center of the display device 12 as the origin O, the direction orthogonal to the surface of the display device 12 as the Z axis, and the horizontal direction along the surface of the display device 12. Is set as the X axis, and the vertical direction along the surface of the display device 12 is set as Y. Here, for example, it is assumed that the height of the other user is 150 cm, and the height of the display device 12 is L. When the partner user views the center of the display device 12 from a position 0.5 m away from the display device 12 of the telepresence system 11, the center of the viewpoint P of the partner user is the coordinate in the global coordinate system. (0, 150-L / 2, -0.5) can be set. Further, the x-axis, y-axis, and z-axis of the local coordinate system of the viewpoint P are set to be parallel to the X-axis, Y-axis, and Z-axis of the global coordinate system, respectively.

In this way, the base image generation unit 31 uses the viewpoint P of the other user as a virtual viewpoint when generating the base image, so that the user's hand is too close to the display device 12. In addition, it is possible to generate a base image that does not make the other user feel uncomfortable.

Note that the telepresence system 11 is a user who specifies the user's viewpoint position such as the distance from the display device 12 to the user and the height of the user's viewpoint based on the images captured by the imaging devices 13a to 13d. Information can be sought and transmitted over the network at any time. Then, the base image generation unit 31 can determine the coordinates of the virtual viewpoint P based on the other-party user information.

FIG. 6 is a flowchart illustrating a base image generation process in which the base image generation unit 31 generates a base image.

In step S21, the base image generation unit 31 shows the four images acquired by the image acquisition unit 21 and the depths generated by the depth generation unit 22 for these four images as shown in FIG. Convert to a point cloud in the global coordinate system. Thereby, one point cloud that three-dimensionally represents the surface of the user viewed from the imaging devices 13a to 13d side by a set of points is synthesized.

In step S22, the base image generation unit 31 selects the user on the own side from the virtual viewpoint P (the other user's viewpoint) shown in FIG. 5 based on the point cloud of the global coordinate system synthesized in step S21. The viewed image is generated as a base image.

As described above, the base image generation unit 31 is based on the virtual viewpoint P based on the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13 and the depths of the images. An image can be generated.

Next, processing performed in the image composition unit 33 will be described with reference to FIGS.

In the case of the state A described above with reference to FIG. 4, that is, in a state where the user's hand is sufficiently far from the display device 12, the image composition unit 33 directly selects the image selected by the base image generation unit 31. Output. Further, in the case of the state B described above with reference to FIG. 4, that is, when the user's hand is visible in the periphery of the screen, the image composition unit 33 displays the base image generated by the base image generation unit 31. Output as is.

On the other hand, in the case of the state C described above with reference to FIG. 4, that is, in a state close enough that a part of the user's hand cannot be imaged, a part of the user's hand is missing. The user's hand is combined with the existing base image and output. For example, the image synthesizing unit 33 uses the base image generated by the base image generating unit 31 and the point cloud of the global coordinate system, and the 3D model of the hand recorded in the data recording unit 32, to the user's base image. Perform image composition processing to synthesize hands.

When combining the user's hand with the base image, the image combining unit 33 first estimates a depth Z ₀ for placing the 3D model of the hand.

A method for estimating the depth Z ₀ of the 3D model of the hand will be described with reference to FIG.

FIG. 7A shows a state in which the user reaches for the display device 12 in the state B, and FIG. 7B shows a state in which the user reaches for the display device 12 in the state C. It is shown. Then, assuming that the relative distance from the user's body to the hand does not change when the state B changes to the state C, the image compositing unit 33 calculates the depth difference L1 between the body and the hand at the time of the state B. The hand depth Z ₀ can be inferred from the body depth Zs at the time of the state C by referring to FIG.

That is, when in the state B, the image composition unit 33 detects an area in which the user's body is shown and an area in which the user's hand is shown from the image acquired by the image acquisition unit 21, and the depth generation unit The average depth of each region is calculated from the depth of the image generated by 22. In addition, for example, learning performed in advance can be used to detect the region where the user's body or hand is shown. Then, the image composition unit 33 obtains the depth difference L1 by calculating the difference between the average depth of the region where the user's body is shown and the average depth of the region where the user's hand is shown, and the data recording unit 32 Keep a record.

Thereafter, when the state B changes to the state C, the image composition unit 33 detects an area in which the user's body is reflected from the image acquired by the image acquisition unit 21, and the depth for the image generated by the depth generation unit 22. From this, the average depth Zs of the area in which the user's body is shown is calculated. Then, the image composition unit 33 reads out the depth difference L1 from the data recording unit 32, and subtracts the depth difference L1 from the calculated average depth Zs, so that it is outside the angle of view that can be imaged by the imaging devices 13a to 13d. The depth Z ₀ (Z ₀ = Zs−L1) of the user's hand is calculated.

For example, the depth Z ₀ of the 3D model of the hand may be estimated using a depth camera 15 as shown in FIG.

In the example shown in FIG. 8, the telepresence system 11 is configured to include a depth camera 15 on the ceiling above the room where the display device 12 is arranged. Therefore, the image composition unit 33 can estimate the depth Z ₀ where the 3D model of the hand is arranged based on the distance from the display device 12 measured by the depth camera 15 to the user's hand.

FIG. 9 shows an image of the base image generated by the base image generation unit 31 in the state C.

As described above, in the state C, a part of the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, and an image in which the part of the user's hand is not reflected is acquired. A base image lacking a part of the hand is generated.

Therefore, the image compositing unit 33 uses the center position (u _h , v _h ) of the hand region lacking in the base image as the luminance I (u _i , v _i ) of the pixel (u _i , v _i ) in the base image. As shown in the following equation (1).

Then, the image composition unit 33 obtains the center position (u _h , v _h ) of the hand region in the base image generated by the base image generation unit 31 as described with reference to FIG. Based on the depth Z ₀ of the 3D model of the hand, the 3D model of the hand is projected onto the base image.

As shown in FIG. 10, the local coordinate system T of the 3D model of the hand has, for example, the center of gravity of the hand as the origin (X _T0 , Y _T0 , Z _T0 ), the reverse direction of the palm as the z axis, and the finger direction of the middle finger Can be defined in a right-handed coordinate system with y as the y-axis. Then, the 3D model of the hand is fully base so that the center of gravity of the 3D model of the hand is projected on the center position (u _h , v _h ) of the hand region in the base image viewed from the viewpoint P shown in FIG. Projected on the image.

The image composition unit 33 calculates the coordinates (X _T0 , Y _T0 , Z _T0 ) of the center of gravity of the 3D model of the hand in the coordinate system of the viewpoint P based on the following equation (2).

Note that f in Expression (2) is a focal length when the base image of the viewpoint P is generated.

Here, when the user tries to align his / her hand with the other user's hand displayed on the display device 12, the x-axis, y-axis, and z of the local coordinate system T of the 3D model of the hand It is assumed that the axis and the x-axis, y-axis, and z-axis of the virtual viewpoint P coordinate system are parallel to each other. Thereby, when each viewpoint in the local coordinate system T of the 3D model of the hand is converted into the coordinate system of the virtual viewpoint P, it is not necessary to perform the rotation, and only the translation is required.

Then, for each point i in the 3D model of the hand, the image composition unit 33 calculates the coordinates (X _Ti , Y _Ti , Z _Ti ) in the local coordinate system T of the 3D model of the hand according to the following equation (3). _Is converted into coordinates (X _Pi , Y _Pi , Z _Pi ) in the coordinate system of a typical viewpoint P.

Further, the image composition unit 33 calculates the following equation (4) for the pixel (u _i , v _i ) obtained by projecting each point i in the 3D model of the hand from the coordinate system of the virtual viewpoint P onto the base image. To find out.

Here, according to the equation (4), the depth of the pixel (u _i , v _i ) is Z _Pi in the depth of the base image. Note that multiple points on the 3D model of the hand may be projected onto the same pixel of the base image. In this case, a point with a small depth among the multiple points is selected as the one to be projected onto the base image. Is done.

FIG. 11 is a flowchart illustrating an image composition process in which the image composition unit 33 composes the user's hand with the base image.

In step S31, for example, as described with reference to FIG. 7 described above, the image composition unit 33 estimates the depth Z ₀ where the 3D model of the hand is placed.

In step S32, the image composition unit 33 determines the center position of the hand region lacking in the base image generated by the base image generation unit 31, that is, the center position (u _h , v _h ) shown in FIG. presume.

In step S33, the image composition unit 33 sets the depth Z ₀ for placing the 3D model of the hand estimated in step S31 and the center position (u _h , v _h ) of the hand region in the base image estimated in step S32. Based on this, a 3D model of the hand is projected onto the base image. Thereby, the image composition unit 33 can generate an image in which the user's hand can be seen by composing the hand with the base image in which a part of the user's hand is missing.

As described above, the image composition unit 33 displays the image so that the user's hand is appropriately displayed based on the depth at which the 3D model of the hand is arranged and the center position of the hand region that is missing in the base image. Synthesis can be performed.

Processing executed in the image processing apparatus 14 will be described with reference to the flowchart shown in FIG.

For example, when the telepresence system 11 is activated, the processing is started. In step S41, the image acquisition unit 21 acquires four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d, and the depth generation unit 22 is acquired. To supply.

In step S42, the depth generation unit 22 performs depth generation processing (the flowchart of FIG. 4 described above) that generates depths for the four images supplied from the image acquisition unit 21 in step S41. Then, the depth generation unit 22 supplies the four images and the depth corresponding to each image to the base image generation unit 31 and the image synthesis unit 33.

In step S43, the base image generation unit 31 obtains the distance from the display device 12 to the user's hand based on the depth supplied in step S42, and the user's hand is not so uncomfortable as viewed from the other user. Is determined to be sufficiently far away.

In step S43, when the base image generation unit 31 determines that the user's hand is sufficiently far away (state A in FIG. 4), the process proceeds to step S44. In step S44, the base image generation unit 31 displays any one of the four images captured by the imaging devices 13a to 13d on the display device 12 of the telepresence system 11 on the other side. Choose as a thing.

On the other hand, if the base image generation unit 31 determines in step S43 that the user's hand is not sufficiently far away (state B or state C in FIG. 4), the process proceeds to step S45. In step S45, the base image generation unit 31 generates a base image using the four images captured by the imaging devices 13a to 13d and the depth generated by the depth generation unit 22 in step S42. Processing (the flowchart of FIG. 6 described above) is performed.

After the base image generation process in step S45, the process proceeds to step S46. In step S46, the image composition unit 33 cannot capture a part of the user's hand based on the distance from the display device 12 to the user's hand obtained based on the depth generated by the depth generation unit 22 in step S42. It is determined whether or not the state is almost the same.

If it is determined in step S46 that the image compositing unit 33 is in a state close enough that a part of the user's hand cannot be captured (state C in FIG. 4), the process proceeds to step S47. In step S47, the image composition unit 33 composes the user's hand with the base image generated in step S45 based on the 3D model of the hand recorded in the data recording unit 32 (see FIG. 11 described above). Flowchart).

After the image synthesizing process in step S47, the process proceeds to step S48, and the image synthesizing unit 33 supplies the image transmitting unit 24 with an image in which the user's hand is synthesized with the base image, and the image transmitting unit 24 Send an image. On the other hand, after the process of step S44, the process proceeds to step S48, where the image composition unit 33 supplies the image selected by the base image generation unit 31 to the image transmission unit 24, and the image transmission unit 24 transmits the image. To do. If it is determined in step S46 that the part of the user's hand is not close enough to be captured (state B in FIG. 4), the process proceeds to step S48, and the image compositing unit 33 proceeds to step S45. The base image generated by the base image generation unit 31 is supplied to the image transmission unit 24 as it is, and the image transmission unit 24 transmits the base image.

Then, after the process of step S48, the process returns to step S41, and the same process is repeatedly performed for the next image captured by the imaging devices 13a to 13d.

As described above, the image processing apparatus 14 can perform communication that allows users to join hands with each other based on the images captured by the imaging apparatuses 13 a to 13 d arranged around the display apparatus 12 without a sense of incongruity. Processing can be realized. Thereby, the telepresence system 11 can provide more friendly communication.

FIG. 13 is a block diagram illustrating a configuration example of the second embodiment of the image processing apparatus 14. In the image processing apparatus 14A shown in FIG. 13, the same reference numerals are given to the same components as those in the image processing apparatus 14 in FIG. 2, and detailed descriptions thereof are omitted.

Similar to the image processing apparatus 14 of FIG. 2, the image processing apparatus 14A includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24. The image generation unit 23A includes a base image generation unit. 31, a data recording unit 32, and an image composition unit 33.

In the image processing apparatus 14A, in the image generation unit 23A, the base image generation unit 31 is configured to supply the point cloud of the global coordinate system to the data recording unit 32 and record the point cloud in FIG. The configuration is different from that of the image processing apparatus 14.

For example, in the image generation unit 23A, the base image generation unit 31 determines the depth for the four images generated by the depth generation unit 22 when it is determined that the user's hand is in the state B reflected on the periphery of the screen. The point cloud synthesized from is supplied to the data recording unit 32. Then, when the image composition unit 33 determines that the state C is so close that a part of the user's hand cannot be captured, the point cloud of the state B recorded in the data recording unit 32 and the point of the state C Align with the cloud and synthesize the user's hand with the base image.

FIG. 14 is a diagram showing an image of aligning two point clouds.

For example, in the point cloud in the state B shown in the upper left side of FIG. 14, there is a part of the user's hand, whereas in the point cloud in the state C shown in the lower left side of FIG. The hand part is missing.

Therefore, the image composition unit 33 aligns these two point clouds using, for example, a technique such as ICP (Iterative Closest Point), and replaces the missing portion in the state C point cloud with the state B. Get from the point cloud. In this case, when the state B changes to the state C, it is assumed that there is almost no movement of the hand with respect to the user's body, and only the distance to the display device 12 is changed, and the point cloud is aligned. Is called.

After that, as described above, the image composition unit 33 projects the point cloud in the state C in which the hand portion is complemented from the point cloud in the state B onto the base image viewed from the virtual viewpoint P, and the user Synthesize your hands. For example, the image composition unit 33 places the point cloud in the state B and the point cloud in the state C in the coordinate system of the virtual viewpoint P as shown in FIG. 5 and calculates the above-described equation (4). Thus, the point cloud of the hand portion can be projected onto the base image.

As described above, the image generation unit 23A does not use the 3D model of the hand created in advance, but combines the user's hand with the base image by using the point cloud in the state B several frames before, so that the user You can realize communication that hands are together.

FIG. 15 is a block diagram illustrating a configuration example of the third embodiment of the image processing apparatus 14. In the image processing device 14B shown in FIG. 15, the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof is omitted.

That is, the image processing device 14B includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24, as with the image processing device 14 of FIG. The image generation unit 23B of the image processing apparatus 14B includes a user recognition unit 34 in addition to the base image generation unit 31, the data recording unit 32, and the image synthesis unit 33.

In the data recording unit 32, a 3D model of a hand is recorded for each of a plurality of users, and characteristics of each user are recorded. In addition, a user ID (Identification) for identifying the user is set for each user, and the data recording unit 32 images the 3D model of the hand corresponding to the user ID specified by the user recognition unit 34. This is supplied to the combining unit 33.

The user recognizing unit 34 detects a user feature based on the image obtained from the base image generating unit 31 and refers to the user feature recorded in the data recording unit 32 to correspond to the detected user feature. To the data recording unit 32. For example, the user recognizing unit 34 detects a face using a face detection method as a user feature, and the facial feature reflected in the image and the facial feature of each user recorded in the data recording unit 32 Is used to recognize the user by the face recognition method. Then, the user recognition unit 34 can specify the user ID of the user recognized as the same user from the facial features to the data recording unit 32. For the face detection method and the face recognition method by the user recognition unit 34, for example, a learning method such as deep learning can be used.

As described above, the image processing apparatus 14 B records the 3D models of a plurality of user's hands in the data recording unit 32 in advance, and recognizes the user by the user recognition unit 34. Can be combined with an image.

FIG. 16 is a block diagram showing a configuration example of the fourth embodiment of the image processing apparatus 14. In the image processing device 14C shown in FIG. 16, the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof will be omitted.

That is, the image processing device 14 C includes the image acquisition unit 21, the depth generation unit 22, the image generation unit 23, and the image transmission unit 24 similarly to the image processing device 14 of FIG. 2. Furthermore, the image processing apparatus 14 C is configured to include a hand alignment recognition unit 25.

The hand recognizing unit 25 tries to perform communication in which the user puts hands together using an arbitrary depth of the depths of the four images generated by the depth generating unit 22 and an image corresponding to the depth. Recognize your intention. Then, the hand recognizing unit 25 transmits a recognition result of recognizing the intention of the user to communicate with each other to the telepresence system 11 on the other side via a network (not shown).

For example, the hand recognizing unit 25 recognizes the region where the hand is shown from the image, and extracts the depth of the recognized hand region. Then, the hand recognizing unit 25 refers to the depth of the extracted hand region, and when it is determined that the user's hand is equal to or less than the predetermined distance, there is an intention that the user intends to perform communication that hands are put together. It can be recognized that there is.

Alternatively, for example, the hand recognizing unit 25 records the data of several frames before the extracted depth of the hand area, and the depth of the hand area in the current frame and the depth of the hand area in the current frame. Based on the above, it is determined whether or not the user's hand is approaching. Then, when it is determined that the user's hand is approaching, the hand recognizing unit 25 can recognize that the user intends to perform communication that the hand is put together.

As described above, the image processing apparatus 14C can recognize whether or not there is an intention of performing communication that the users join hands, and when the user recognizes the intention, the hands of the users can be recognized. Suitable image processing can be performed more reliably.

Furthermore, in the image processing device 14, for example, by outputting a specific sound or producing a specific visualization at the moment when the hands of the users come together via the telepresence system 11, the communication can be performed. Provide feedback.

In the present embodiment, the communication in which each other's hands are brought together via the telepresence system 11 has been described as an example. However, the telepresence system 11 performs image processing other than combining the user's hand with the base image. It can be performed. For example, when the user's body or face is out of the angle of view that can be captured by the imaging devices 13a to 13d, the telepresence system 11 uses the 3D model such as the body and face corresponding to the base image The image processing to be combined can be performed.

Note that the processes described with reference to the flowcharts described above do not necessarily have to be processed in chronological order in the order described in the flowcharts, but are performed in parallel or individually (for example, parallel processes or objects). Processing). The program may be processed by one CPU, or may be distributedly processed by a plurality of CPUs.

Further, the above-described series of processing (image processing method) can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs. For example, the program is installed in a general-purpose personal computer from a program recording medium on which the program is recorded.

FIG. 16 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.

In a computer, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and an EEPROM (Electronically Erasable Memory and Programmable Read Only Memory) 104 are interconnected by a bus 105. . Further, an input / output interface 106 is connected to the bus 105, and the input / output interface 106 is connected to the outside (for example, the imaging devices 13a to 13d in FIG. 1 or a communication device (not shown)).

In the computer configured as described above, for example, the CPU 101 loads the program stored in the ROM 102 and the EEPROM 104 to the RAM 103 via the bus 105 and executes the program, thereby performing the above-described series of processing. A program executed by the computer (CPU 101) can be written in the ROM 102 in advance, and can be installed or updated from the outside in the EEPROM 104 via the input / output interface 105.

In addition, in this specification, the system represents the entire apparatus composed of a plurality of apparatuses.

In addition, this technique can also take the following structures.
(1)
An image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
An image processing apparatus comprising: an image transmission unit that transmits the display image generated by the image generation unit.
(2)
A depth generation unit configured to generate a depth representing a depth in each image based on the plurality of images;
The image processing device according to (1), wherein the image generation unit generates the display image based on a plurality of the images and the corresponding depths.
(3)
The image generation unit generates a base image that is an image obtained by viewing the user from a virtual viewpoint different from the plurality of viewpoints based on the plurality of images and the corresponding depths. The image processing apparatus according to (2), including a generation unit.
(4)
The image processing apparatus according to (3), wherein the virtual viewpoint is set to a viewpoint of a counterpart user to whom the image transmission unit transmits the display image.
(5)
A plurality of imaging devices that image the user are arranged around a display device that displays a display image transmitted from the other party to which the image transmission unit transmits the display image,
When the distance from the display device to a part of the user is less than a first distance such that the part of the user is approximately the center of each of the plurality of images, the base image generation unit The image processing apparatus according to (3) or (4).
(6)
When the distance from the display device to a part of the user is equal to or greater than the first distance, the image transmission unit is any one of the plurality of images captured by the user from a plurality of viewpoints. The image processing apparatus according to (5), wherein a sheet is transmitted as the display image.
(7)
The image generation unit captures a plurality of images when a distance from the display device to a part of the user is less than a second distance at which the part of the user does not appear in the plurality of images. An image composition unit for composing a part of the user outside the angle of view to the base image;
The image processing device according to (5) or (6), wherein the image transmission unit transmits an image in which a part of the user is combined with the base image by the image combining unit as the display image.
(8)
When the distance from the display device to the part of the user is less than the first distance and greater than or equal to the second distance, the image transmission unit generates the base image generated by the base image generation unit. As the display image. The image processing device according to (7).
(9)
The image generation unit further includes a data recording unit that records a 3D model in which a part of the user is three-dimensionally formed,
The image processing device according to (7) or (8), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model recorded in the data recording unit.
(10)
The data recording unit generates the base image when the distance from the display device to a part of the user is less than the first distance and greater than or equal to the second distance. Record the point cloud of the user used when
The image synthesizing unit is recorded in the data recording unit when a distance from the display device to a part of the user is less than a second distance at which the part of the user is not reflected in a plurality of images. The image processing device according to any one of (7) to (9), wherein a part of the user is synthesized with the base image using the point cloud.
(11)
The image generation unit further includes a user recognition unit that recognizes the user shown in the image,
The data recording unit records the 3D model for each of the plurality of users,
The image processing apparatus according to (9), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model corresponding to the user recognized by the user recognition unit.
(12)
The image transmission unit performs communication to match the other user's hand displayed on the display device that displays the display image transmitted from the other party transmitting the display image with the user's hand on the own side. The image processing apparatus according to any one of (1) to (11) above.
(13)
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
An image processing method, comprising: transmitting the generated display image.
(14)
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
A program for causing a computer to execute processing including a step of transmitting the generated display image.
(15)
A display device that displays a display image transmitted from the other party;
A plurality of imaging devices arranged around the display device and imaging a user from a plurality of viewpoints;
An image acquisition unit for acquiring images captured by each of the plurality of imaging devices;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
A telepresence system comprising: an image transmission unit that transmits the display image generated by the image generation unit.

Note that the present embodiment is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present disclosure.

11 telepresence system, 12 display device, 13a to 13d imaging device, 14 image processing device, 15 depth camera, 21 image acquisition unit, 22 depth generation unit, 23 image generation unit, 24 image transmission unit, 25 manual recognition unit, 31 base image generation unit, 32 data recording unit, 33 image composition unit, 34 user recognition unit

Claims

An image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
An image processing apparatus comprising: an image transmission unit that transmits the display image generated by the image generation unit.
A depth generation unit configured to generate a depth representing a depth in each image based on the plurality of images;
The image processing apparatus according to claim 1, wherein the image generation unit generates the display image based on a plurality of images and the corresponding depths.
The image generation unit generates a base image that is an image obtained by viewing the user from a virtual viewpoint different from the plurality of viewpoints based on the plurality of images and the corresponding depths. The image processing apparatus according to claim 2, further comprising a generation unit.
The image processing apparatus according to claim 3, wherein the virtual viewpoint is set to a viewpoint of a user on the other side to which the image transmission unit transmits the display image.
A plurality of imaging devices that image the user are arranged around a display device that displays a display image transmitted from the other party to which the image transmission unit transmits the display image,
When the distance from the display device to a part of the user is less than a first distance such that the part of the user is approximately the center of each of the plurality of images, the base image generation unit The image processing device according to claim 3.
When the distance from the display device to a part of the user is equal to or greater than the first distance, the image transmission unit is any one of the plurality of images captured by the user from a plurality of viewpoints. The image processing apparatus according to claim 5, wherein a sheet is transmitted as the display image.
The image generation unit captures a plurality of images when a distance from the display device to a part of the user is less than a second distance at which the part of the user does not appear in the plurality of images. An image composition unit for composing a part of the user outside the angle of view to the base image;
The image processing apparatus according to claim 5, wherein the image transmission unit transmits an image in which a part of the user is combined with the base image by the image combining unit as the display image.
When the distance from the display device to the part of the user is less than the first distance and greater than or equal to the second distance, the image transmission unit generates the base image generated by the base image generation unit. The image processing apparatus according to claim 7, wherein the image is transmitted as the display image.
The image generation unit further includes a data recording unit that records a 3D model in which a part of the user is three-dimensionally formed,
The image processing apparatus according to claim 7, wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model recorded in the data recording unit.
The data recording unit generates the base image when the distance from the display device to a part of the user is less than the first distance and greater than or equal to the second distance. Record the point cloud of the user used when
The image synthesizing unit is recorded in the data recording unit when a distance from the display device to a part of the user is less than a second distance at which the part of the user is not reflected in a plurality of images. The image processing apparatus according to claim 7, wherein the point cloud is used to synthesize a part of the user with the base image.
The image generation unit further includes a user recognition unit that recognizes the user shown in the image,
The data recording unit records the 3D model for each of the plurality of users,
The image processing apparatus according to claim 9, wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model corresponding to the user recognized by the user recognition unit.
The image transmission unit performs communication to match the other user's hand displayed on the display device that displays the display image transmitted from the other party transmitting the display image with the user's hand on the own side. The image processing apparatus according to claim 1, further comprising: a communication recognition unit that recognizes what is being read.
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
An image processing method including a step of transmitting the generated display image.
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
A program for causing a computer to execute processing including a step of transmitting the generated display image.
A display device that displays a display image transmitted from the other party;
A plurality of imaging devices arranged around the display device and imaging a user from a plurality of viewpoints;
An image acquisition unit for acquiring images captured by each of the plurality of imaging devices;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
A telepresence system comprising: an image transmission unit that transmits the display image generated by the image generation unit.