WO2018016316A1 - Image processing device, image processing method, program, and telepresence system - Google Patents

Image processing device, image processing method, program, and telepresence system Download PDF

Info

Publication number
WO2018016316A1
WO2018016316A1 PCT/JP2017/024571 JP2017024571W WO2018016316A1 WO 2018016316 A1 WO2018016316 A1 WO 2018016316A1 JP 2017024571 W JP2017024571 W JP 2017024571W WO 2018016316 A1 WO2018016316 A1 WO 2018016316A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
user
images
unit
hand
Prior art date
Application number
PCT/JP2017/024571
Other languages
French (fr)
Japanese (ja)
Inventor
穎 陸
祐介 阪井
雅人 赤尾
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2018016316A1 publication Critical patent/WO2018016316A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working

Definitions

  • the present disclosure relates to an image processing device, an image processing method, a program, and a telepresence system, and more particularly, to an image processing device, an image processing method, a program, and a telepresence system that enable communication with less discomfort. .
  • the imaging device that captures the user on the other side and the display device that displays the user on the other side are arranged at different positions. There was a case where a sense of incongruity that it did not fit occurred.
  • Patent Document 1 the lines of sight of the users are matched by imaging with a camera provided on the back side of the display main body through a hole provided in the polarizing plate of the liquid crystal display unit.
  • a camera-integrated display device There has been proposed a camera-integrated display device.
  • This disclosure has been made in view of such a situation, and is intended to enable communication without a sense of incongruity.
  • An image processing apparatus includes an image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints, and a plurality of the user's portions based on the plurality of images.
  • An image generation unit that generates a display image that appropriately displays a part of the angle of view when the image is captured, and the display image generated by the image generation unit is transmitted.
  • An image transmission unit An image transmission unit.
  • An image processing method or program acquires a plurality of images captured by a user from a plurality of viewpoints, and a part of the user acquires a plurality of the images based on the plurality of images.
  • the method includes a step of generating a display image so that a part of the image is appropriately displayed when the angle of view is outside of the field of view, and transmitting the generated display image.
  • a telepresence system includes a display device that displays a display image transmitted from a counterpart, and a plurality of imaging devices that are arranged around the display device and that capture a user from a plurality of viewpoints.
  • An image acquisition unit that acquires images captured by each of the plurality of imaging devices, and an outside angle of view when a part of the user captures the plurality of images based on the plurality of images.
  • An image generation unit that generates a display image that properly displays a part thereof, and an image transmission unit that transmits the display image generated by the image generation unit.
  • a plurality of images captured by a user from a plurality of viewpoints are acquired, and a portion of the user captures a plurality of images based on the plurality of images. Then, a display image is generated so that a part of the display image is appropriately displayed, and the generated display image is transmitted.
  • FIG. 1 is a perspective view illustrating a schematic configuration of a telepresence system to which the present technology is applied. It is a block diagram which shows the structural example of 1st Embodiment of an image processing apparatus. It is a flowchart explaining a depth production
  • FIG. 18 is a block diagram illustrating a configuration example of an embodiment of a computer to which the present technology is applied.
  • FIG. 1 is a perspective view showing a schematic configuration of a telepresence system to which the present technology is applied.
  • the telepresence system 11 includes a display device 12, imaging devices 13a to 13d, and an image processing device 14.
  • the telepresence system 11 can provide, for example, a communication experience in which two users at remote locations are facing each other.
  • the user side in front of the display device 12 shown in FIG. 1 is referred to as the user side
  • the user side displayed on the display device 12 is referred to as the partner side.
  • the telepresence system 11 is provided on both the own side and the other side, and the telepresence systems 11 on the own side and the other side can communicate with each other via a network.
  • the display device 12 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side, and displays an image transmitted from the telepresence system 11 on the other side. The user on the side is displayed on the screen.
  • the imaging devices 13a to 13d are arranged around the display device 12.
  • the imaging devices 13a to 13d image the user from the viewpoints of the respective arrangement positions, and supply images (RGB color images) obtained by the imaging to the image processing device 14.
  • FIG. 1 the arrangement positions where the vertical and horizontal directions are 2 ⁇ 2 by the four imaging devices 13 a to 13 d are shown, but the arrangement positions are limited to the example shown in FIG. 1. There is nothing. If the user can be imaged from a plurality of viewpoints, the number of the imaging devices 13 may be three or less or five or more.
  • the image processing device 14 uses the four images supplied from the imaging devices 13a to 13d to perform image processing for generating an image viewed from the virtual viewpoint that is different from the viewpoints of the imaging devices 13a to 13d. Go to the telepresence system 11 on the other side. For example, during telepresence, the image processing apparatus 14 provides a virtual viewpoint (viewpoint P described later with reference to FIG. 5) so that the other user does not feel uncomfortable when he / she sees the user on the other side. Can be set. The detailed configuration of the image processing apparatus 14 will be described with reference to FIG.
  • the hand approaching the display device 12 is moved to the imaging devices 13a to 13a. It is assumed that the angle of view is outside the imageable angle of view by 13d. Therefore, in the image processing device 14, an image that allows the palm to be put together at the center of the display device 12 as shown in FIG. Processing is performed.
  • FIG. 2 is a block diagram illustrating a configuration example of the first embodiment of the image processing apparatus 14.
  • the image processing apparatus 14 includes an image acquisition unit 21, a depth generation unit 22, an image generation unit 23, and an image transmission unit 24.
  • the image acquisition unit 21 is connected to the imaging devices 13a to 13 in FIG. 1 by wire or wirelessly, acquires four images captured by the user from the viewpoints of the imaging devices 13a to 13d, and obtains a depth generation unit. 22 is supplied.
  • the depth generation unit 22 uses the four images supplied from the image acquisition unit 21 to generate a depth representing the depth for each coordinate in the plane direction of each image, and supplies the depth to the image generation unit 23.
  • the depth generation unit 22 obtains a stereo depth for each image by a stereo matching method using two images arranged vertically or horizontally, and then synthesizes the stereo depth in the vertical direction and the horizontal direction of each image.
  • the depth for the four images can be finally generated.
  • the image generation unit 23 includes a base image generation unit 31, a data recording unit 32, and an image composition unit 33.
  • the hand when the image generation unit 23 performs communication in which the users of the telepresence system 11 join hands, when the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, The hand generates an image that can be appropriately displayed on the display device 12 of the telepresence system 11 on the other side.
  • the base image generation unit 31 is any one of the four images acquired by the image acquisition unit 21. The images are selected to be displayed on the display device 12 of the telepresence system 11 on the other side.
  • the base image generation unit 31 displays an image viewed from the virtual viewpoint as the partner.
  • the base image is generated as a base image for display on the display device 12 of the telepresence system 11 on the side.
  • the base image generation unit 31 views the user from a virtual viewpoint based on the four images acquired by the image acquisition unit 21 and the depths for the four images generated by the depth generation unit 22.
  • a base image can be generated. Note that the base image generation processing of the base image generation unit 31 will be described later with reference to FIGS.
  • a 3D model of a hand in which the shape of the user's hand is formed in three dimensions and the texture of the hand is pasted is recorded in advance.
  • the users of the telepresence system 11 communicate with each other, hands that can display the palm of the hand can be displayed by displaying an image in which the palm of the other user can be seen on each other's display device 12.
  • a 3D model is created in advance.
  • the 3D model of the hand may be created so as to include, for example, the part from the hand to the elbow in addition to creating only the hand part.
  • a hand 3D model used in existing computer graphics may be used.
  • the image compositing unit 33 is, for example, a state in which the user's hand approaches the display device 12 and is outside the angle of view that can be captured by the imaging devices 13a to 13d, and is close enough that a part of the user's hand cannot be captured (for example, In the case of state C) in FIG. 4 described later, the user's hand is combined with the base image.
  • the image synthesizing unit 33 converts the user's hand to the data recording unit 32 into a base image viewed from a virtual viewpoint so that the user's hand is appropriately displayed on the display device 12 of the telepresence system 11 on the other side. An image of the palm of the user based on the recorded 3D model of the hand is synthesized. Note that the image composition processing of the image composition unit 33 will be described later with reference to FIGS.
  • the image transmission unit 24 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side via the network, and the image generated by the image generation unit 23 is transmitted to the other side. As a display image to be displayed on the telepresence system 11.
  • the image processing device 14 configured as described above is configured to combine images obtained by combining the user's hand based on the 3D model of the hand with the base image generated based on the images captured by the imaging devices 13a to 13d. Can be displayed on the display device 12. Thereby, the telepresence system 11 can make the communication that the users match each other without a sense of incongruity.
  • the depth generation process executed in the depth generation unit 22 will be described with reference to the flowchart shown in FIG.
  • an image a captured by the image capturing device 13a, an image b captured by the image capturing device 13b, an image c captured by the image capturing device 13c, and an image d captured by the image capturing device 13d are transferred from the image acquisition unit 21 to the depth.
  • the process is started.
  • step S11 the depth generation unit 22 uses the image a picked up by the image pickup device 13a and the image b picked up by the image pickup device 13b, and the first stereo depth a1 and the image of the image a by the stereo matching method.
  • the first stereo depth b1 of b is calculated.
  • step S12 the depth generation unit 22 uses the image c picked up by the image pickup device 13c and the image d picked up by the image pickup device 13d, and the first stereo depth c1 and the image c of the image c by the stereo matching method.
  • the first stereo depth d1 of d is calculated.
  • step S ⁇ b> 13 the depth generation unit 22 uses the image a captured by the imaging device 13 a and the image c captured by the imaging device 13 c to perform the second stereo depth a ⁇ b> 2 and the image of the image a by the stereo matching method.
  • the second stereo depth c2 of c is calculated.
  • step S14 the depth generation unit 22 uses the image b picked up by the image pickup device 13b and the image d picked up by the image pickup device 13d, and the second stereo depth b2 and the image b of the image b by the stereo matching method.
  • the second stereo depth d2 of d is calculated.
  • step S15 the depth generation unit 22 combines the first stereo depth a1 of the image a calculated in step S11 and the second stereo depth a2 of the image a calculated in step S13, thereby obtaining a depth a for the image a.
  • the depth generation unit 22 combines the first stereo depth b1 of the image b calculated in step S11 and the second stereo depth b2 of the image b calculated in step S14, thereby calculating the depth b for the image b.
  • the depth generation unit 22 calculates the depth c for the image c and the depth d for the image d, and the depth generation process ends.
  • the depth generation unit 22 can generate a depth for each of the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d.
  • FIG. 4 shows an example of four images a to d captured by the imaging devices 13a to 13d in three states according to the distance from the display device 12 to the user's hand.
  • the base image generation unit 31 determines whether or not to generate a base image viewed from the virtual viewpoint according to the distance from the display device 12 to the user's hand.
  • the base image generation unit 31 determines that the base image is not generated when the user's hand is sufficiently far from the display device 12 (hereinafter referred to as state A as appropriate), and determines that four images a Any one of the images from d to d is selected for display on the display device 12 of the telepresence system 11 on the other side.
  • the base image generation unit 31 determines to generate a base image, and the four images acquired by the image acquisition unit 21 and the four images generated by the depth generation unit 22 are determined. Based on the depth, a base image in which the user is viewed from a virtual viewpoint is generated.
  • the user's hand is closer to the center of the display device 12 than in the state B, and the user's hand is closer to the outside of the angle of view that can be imaged by the imaging devices 13a to 13d.
  • the distance to the user's hand is less than the predetermined second distance, a part of the user's hand is not displayed on the screen in the four images a to d as shown in FIG. 4C.
  • state C a state close enough that a part of the user's hand cannot be imaged
  • state C a base image lacking a part of the user's hand is generated.
  • state B the user on the other side is more uncomfortable.
  • the base image generation unit 31 generates a base image viewed from the virtual viewpoint, and then adds the user's hand to the generated base image so as to compensate for the lacking part of the user's hand. Hands are synthesized.
  • the base image generation unit 31 determines whether the state is the state A, the state B, or the state C according to the distance from the display device 12 to the user's hand.
  • the base image generation unit 31 determines that the state A has changed to the state B, the image to be displayed on the display device 12 of the telepresence system 11 on the other side is obtained from any one of the images a to d.
  • a process of switching from a virtual viewpoint to a base image viewed from the user is performed. Note that the base image generation unit 31 may generate a base image even in the state A.
  • the global coordinate system uses the center of the display device 12 as the origin O, the direction orthogonal to the surface of the display device 12 as the Z axis, and the horizontal direction along the surface of the display device 12. Is set as the X axis, and the vertical direction along the surface of the display device 12 is set as Y.
  • the height of the other user is 150 cm, and the height of the display device 12 is L.
  • the center of the viewpoint P of the partner user is the coordinate in the global coordinate system. (0, 150-L / 2, -0.5) can be set.
  • the x-axis, y-axis, and z-axis of the local coordinate system of the viewpoint P are set to be parallel to the X-axis, Y-axis, and Z-axis of the global coordinate system, respectively.
  • the base image generation unit 31 uses the viewpoint P of the other user as a virtual viewpoint when generating the base image, so that the user's hand is too close to the display device 12. In addition, it is possible to generate a base image that does not make the other user feel uncomfortable.
  • the telepresence system 11 is a user who specifies the user's viewpoint position such as the distance from the display device 12 to the user and the height of the user's viewpoint based on the images captured by the imaging devices 13a to 13d. Information can be sought and transmitted over the network at any time. Then, the base image generation unit 31 can determine the coordinates of the virtual viewpoint P based on the other-party user information.
  • FIG. 6 is a flowchart illustrating a base image generation process in which the base image generation unit 31 generates a base image.
  • step S21 the base image generation unit 31 shows the four images acquired by the image acquisition unit 21 and the depths generated by the depth generation unit 22 for these four images as shown in FIG. Convert to a point cloud in the global coordinate system. Thereby, one point cloud that three-dimensionally represents the surface of the user viewed from the imaging devices 13a to 13d side by a set of points is synthesized.
  • step S22 the base image generation unit 31 selects the user on the own side from the virtual viewpoint P (the other user's viewpoint) shown in FIG. 5 based on the point cloud of the global coordinate system synthesized in step S21.
  • the viewed image is generated as a base image.
  • the base image generation unit 31 is based on the virtual viewpoint P based on the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13 and the depths of the images. An image can be generated.
  • the image composition unit 33 In the case of the state A described above with reference to FIG. 4, that is, in a state where the user's hand is sufficiently far from the display device 12, the image composition unit 33 directly selects the image selected by the base image generation unit 31. Output. Further, in the case of the state B described above with reference to FIG. 4, that is, when the user's hand is visible in the periphery of the screen, the image composition unit 33 displays the base image generated by the base image generation unit 31. Output as is.
  • the image synthesizing unit 33 uses the base image generated by the base image generating unit 31 and the point cloud of the global coordinate system, and the 3D model of the hand recorded in the data recording unit 32, to the user's base image. Perform image composition processing to synthesize hands.
  • the image combining unit 33 When combining the user's hand with the base image, the image combining unit 33 first estimates a depth Z 0 for placing the 3D model of the hand.
  • FIG. 7A shows a state in which the user reaches for the display device 12 in the state B
  • FIG. 7B shows a state in which the user reaches for the display device 12 in the state C. It is shown. Then, assuming that the relative distance from the user's body to the hand does not change when the state B changes to the state C, the image compositing unit 33 calculates the depth difference L1 between the body and the hand at the time of the state B.
  • the hand depth Z 0 can be inferred from the body depth Zs at the time of the state C by referring to FIG.
  • the image composition unit 33 detects an area in which the user's body is shown and an area in which the user's hand is shown from the image acquired by the image acquisition unit 21, and the depth generation unit The average depth of each region is calculated from the depth of the image generated by 22. In addition, for example, learning performed in advance can be used to detect the region where the user's body or hand is shown. Then, the image composition unit 33 obtains the depth difference L1 by calculating the difference between the average depth of the region where the user's body is shown and the average depth of the region where the user's hand is shown, and the data recording unit 32 Keep a record.
  • the depth Z 0 of the 3D model of the hand may be estimated using a depth camera 15 as shown in FIG.
  • the telepresence system 11 is configured to include a depth camera 15 on the ceiling above the room where the display device 12 is arranged. Therefore, the image composition unit 33 can estimate the depth Z 0 where the 3D model of the hand is arranged based on the distance from the display device 12 measured by the depth camera 15 to the user's hand.
  • FIG. 9 shows an image of the base image generated by the base image generation unit 31 in the state C.
  • a part of the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, and an image in which the part of the user's hand is not reflected is acquired.
  • a base image lacking a part of the hand is generated.
  • the image compositing unit 33 uses the center position (u h , v h ) of the hand region lacking in the base image as the luminance I (u i , v i ) of the pixel (u i , v i ) in the base image. As shown in the following equation (1).
  • the image composition unit 33 obtains the center position (u h , v h ) of the hand region in the base image generated by the base image generation unit 31 as described with reference to FIG. Based on the depth Z 0 of the 3D model of the hand, the 3D model of the hand is projected onto the base image.
  • the local coordinate system T of the 3D model of the hand has, for example, the center of gravity of the hand as the origin (X T0 , Y T0 , Z T0 ), the reverse direction of the palm as the z axis, and the finger direction of the middle finger Can be defined in a right-handed coordinate system with y as the y-axis.
  • the 3D model of the hand is fully base so that the center of gravity of the 3D model of the hand is projected on the center position (u h , v h ) of the hand region in the base image viewed from the viewpoint P shown in FIG. Projected on the image.
  • the image composition unit 33 calculates the coordinates (X T0 , Y T0 , Z T0 ) of the center of gravity of the 3D model of the hand in the coordinate system of the viewpoint P based on the following equation (2).
  • f in Expression (2) is a focal length when the base image of the viewpoint P is generated.
  • the x-axis, y-axis, and z of the local coordinate system T of the 3D model of the hand It is assumed that the axis and the x-axis, y-axis, and z-axis of the virtual viewpoint P coordinate system are parallel to each other. Thereby, when each viewpoint in the local coordinate system T of the 3D model of the hand is converted into the coordinate system of the virtual viewpoint P, it is not necessary to perform the rotation, and only the translation is required.
  • the image composition unit 33 calculates the coordinates (X Ti , Y Ti , Z Ti ) in the local coordinate system T of the 3D model of the hand according to the following equation (3). Is converted into coordinates (X Pi , Y Pi , Z Pi ) in the coordinate system of a typical viewpoint P.
  • the image composition unit 33 calculates the following equation (4) for the pixel (u i , v i ) obtained by projecting each point i in the 3D model of the hand from the coordinate system of the virtual viewpoint P onto the base image. To find out.
  • the depth of the pixel (u i , v i ) is Z Pi in the depth of the base image.
  • multiple points on the 3D model of the hand may be projected onto the same pixel of the base image. In this case, a point with a small depth among the multiple points is selected as the one to be projected onto the base image. Is done.
  • FIG. 11 is a flowchart illustrating an image composition process in which the image composition unit 33 composes the user's hand with the base image.
  • step S31 for example, as described with reference to FIG. 7 described above, the image composition unit 33 estimates the depth Z 0 where the 3D model of the hand is placed.
  • step S32 the image composition unit 33 determines the center position of the hand region lacking in the base image generated by the base image generation unit 31, that is, the center position (u h , v h ) shown in FIG. presume.
  • step S33 the image composition unit 33 sets the depth Z 0 for placing the 3D model of the hand estimated in step S31 and the center position (u h , v h ) of the hand region in the base image estimated in step S32. Based on this, a 3D model of the hand is projected onto the base image. Thereby, the image composition unit 33 can generate an image in which the user's hand can be seen by composing the hand with the base image in which a part of the user's hand is missing.
  • the image composition unit 33 displays the image so that the user's hand is appropriately displayed based on the depth at which the 3D model of the hand is arranged and the center position of the hand region that is missing in the base image. Synthesis can be performed.
  • step S41 the image acquisition unit 21 acquires four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d, and the depth generation unit 22 is acquired. To supply.
  • step S42 the depth generation unit 22 performs depth generation processing (the flowchart of FIG. 4 described above) that generates depths for the four images supplied from the image acquisition unit 21 in step S41. Then, the depth generation unit 22 supplies the four images and the depth corresponding to each image to the base image generation unit 31 and the image synthesis unit 33.
  • depth generation processing the flowchart of FIG. 4 described above
  • step S43 the base image generation unit 31 obtains the distance from the display device 12 to the user's hand based on the depth supplied in step S42, and the user's hand is not so uncomfortable as viewed from the other user. Is determined to be sufficiently far away.
  • step S43 when the base image generation unit 31 determines that the user's hand is sufficiently far away (state A in FIG. 4), the process proceeds to step S44.
  • step S44 the base image generation unit 31 displays any one of the four images captured by the imaging devices 13a to 13d on the display device 12 of the telepresence system 11 on the other side. Choose as a thing.
  • step S45 the base image generation unit 31 generates a base image using the four images captured by the imaging devices 13a to 13d and the depth generated by the depth generation unit 22 in step S42. Processing (the flowchart of FIG. 6 described above) is performed.
  • step S46 the image composition unit 33 cannot capture a part of the user's hand based on the distance from the display device 12 to the user's hand obtained based on the depth generated by the depth generation unit 22 in step S42. It is determined whether or not the state is almost the same.
  • step S46 If it is determined in step S46 that the image compositing unit 33 is in a state close enough that a part of the user's hand cannot be captured (state C in FIG. 4), the process proceeds to step S47.
  • step S47 the image composition unit 33 composes the user's hand with the base image generated in step S45 based on the 3D model of the hand recorded in the data recording unit 32 (see FIG. 11 described above). Flowchart).
  • step S48 the image synthesizing unit 33 supplies the image transmitting unit 24 with an image in which the user's hand is synthesized with the base image, and the image transmitting unit 24 Send an image.
  • step S44 the process proceeds to step S48, where the image composition unit 33 supplies the image selected by the base image generation unit 31 to the image transmission unit 24, and the image transmission unit 24 transmits the image. To do. If it is determined in step S46 that the part of the user's hand is not close enough to be captured (state B in FIG. 4), the process proceeds to step S48, and the image compositing unit 33 proceeds to step S45.
  • the base image generated by the base image generation unit 31 is supplied to the image transmission unit 24 as it is, and the image transmission unit 24 transmits the base image.
  • step S48 the process returns to step S41, and the same process is repeatedly performed for the next image captured by the imaging devices 13a to 13d.
  • the image processing apparatus 14 can perform communication that allows users to join hands with each other based on the images captured by the imaging apparatuses 13 a to 13 d arranged around the display apparatus 12 without a sense of incongruity. Processing can be realized. Thereby, the telepresence system 11 can provide more friendly communication.
  • FIG. 13 is a block diagram illustrating a configuration example of the second embodiment of the image processing apparatus 14.
  • the same reference numerals are given to the same components as those in the image processing apparatus 14 in FIG. 2, and detailed descriptions thereof are omitted.
  • the image processing apparatus 14A includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24.
  • the image generation unit 23A includes a base image generation unit. 31, a data recording unit 32, and an image composition unit 33.
  • the base image generation unit 31 is configured to supply the point cloud of the global coordinate system to the data recording unit 32 and record the point cloud in FIG.
  • the configuration is different from that of the image processing apparatus 14.
  • the base image generation unit 31 determines the depth for the four images generated by the depth generation unit 22 when it is determined that the user's hand is in the state B reflected on the periphery of the screen.
  • the point cloud synthesized from is supplied to the data recording unit 32.
  • the image composition unit 33 determines that the state C is so close that a part of the user's hand cannot be captured, the point cloud of the state B recorded in the data recording unit 32 and the point of the state C Align with the cloud and synthesize the user's hand with the base image.
  • FIG. 14 is a diagram showing an image of aligning two point clouds.
  • the image composition unit 33 aligns these two point clouds using, for example, a technique such as ICP (Iterative Closest Point), and replaces the missing portion in the state C point cloud with the state B. Get from the point cloud.
  • ICP Intelligent Closest Point
  • the image composition unit 33 projects the point cloud in the state C in which the hand portion is complemented from the point cloud in the state B onto the base image viewed from the virtual viewpoint P, and the user Synthesize your hands.
  • the image composition unit 33 places the point cloud in the state B and the point cloud in the state C in the coordinate system of the virtual viewpoint P as shown in FIG. 5 and calculates the above-described equation (4).
  • the point cloud of the hand portion can be projected onto the base image.
  • the image generation unit 23A does not use the 3D model of the hand created in advance, but combines the user's hand with the base image by using the point cloud in the state B several frames before, so that the user You can realize communication that hands are together.
  • FIG. 15 is a block diagram illustrating a configuration example of the third embodiment of the image processing apparatus 14.
  • the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof is omitted.
  • the image processing device 14B includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24, as with the image processing device 14 of FIG.
  • the image generation unit 23B of the image processing apparatus 14B includes a user recognition unit 34 in addition to the base image generation unit 31, the data recording unit 32, and the image synthesis unit 33.
  • a 3D model of a hand is recorded for each of a plurality of users, and characteristics of each user are recorded.
  • a user ID (Identification) for identifying the user is set for each user, and the data recording unit 32 images the 3D model of the hand corresponding to the user ID specified by the user recognition unit 34. This is supplied to the combining unit 33.
  • the user recognizing unit 34 detects a user feature based on the image obtained from the base image generating unit 31 and refers to the user feature recorded in the data recording unit 32 to correspond to the detected user feature. To the data recording unit 32. For example, the user recognizing unit 34 detects a face using a face detection method as a user feature, and the facial feature reflected in the image and the facial feature of each user recorded in the data recording unit 32 Is used to recognize the user by the face recognition method. Then, the user recognition unit 34 can specify the user ID of the user recognized as the same user from the facial features to the data recording unit 32. For the face detection method and the face recognition method by the user recognition unit 34, for example, a learning method such as deep learning can be used.
  • the image processing apparatus 14 ⁇ / b> B records the 3D models of a plurality of user's hands in the data recording unit 32 in advance, and recognizes the user by the user recognition unit 34. Can be combined with an image.
  • FIG. 16 is a block diagram showing a configuration example of the fourth embodiment of the image processing apparatus 14.
  • the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof will be omitted.
  • the image processing device 14 ⁇ / b> C includes the image acquisition unit 21, the depth generation unit 22, the image generation unit 23, and the image transmission unit 24 similarly to the image processing device 14 of FIG. 2. Furthermore, the image processing apparatus 14 ⁇ / b> C is configured to include a hand alignment recognition unit 25.
  • the hand recognizing unit 25 tries to perform communication in which the user puts hands together using an arbitrary depth of the depths of the four images generated by the depth generating unit 22 and an image corresponding to the depth. Recognize your intention. Then, the hand recognizing unit 25 transmits a recognition result of recognizing the intention of the user to communicate with each other to the telepresence system 11 on the other side via a network (not shown).
  • the hand recognizing unit 25 recognizes the region where the hand is shown from the image, and extracts the depth of the recognized hand region. Then, the hand recognizing unit 25 refers to the depth of the extracted hand region, and when it is determined that the user's hand is equal to or less than the predetermined distance, there is an intention that the user intends to perform communication that hands are put together. It can be recognized that there is.
  • the hand recognizing unit 25 records the data of several frames before the extracted depth of the hand area, and the depth of the hand area in the current frame and the depth of the hand area in the current frame. Based on the above, it is determined whether or not the user's hand is approaching. Then, when it is determined that the user's hand is approaching, the hand recognizing unit 25 can recognize that the user intends to perform communication that the hand is put together.
  • the image processing apparatus 14C can recognize whether or not there is an intention of performing communication that the users join hands, and when the user recognizes the intention, the hands of the users can be recognized. Suitable image processing can be performed more reliably.
  • the communication can be performed. Provide feedback.
  • the communication in which each other's hands are brought together via the telepresence system 11 has been described as an example.
  • the telepresence system 11 performs image processing other than combining the user's hand with the base image. It can be performed.
  • the telepresence system 11 uses the 3D model such as the body and face corresponding to the base image The image processing to be combined can be performed.
  • the processes described with reference to the flowcharts described above do not necessarily have to be processed in chronological order in the order described in the flowcharts, but are performed in parallel or individually (for example, parallel processes or objects). Processing).
  • the program may be processed by one CPU, or may be distributedly processed by a plurality of CPUs.
  • the above-described series of processing can be executed by hardware or can be executed by software.
  • a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs.
  • the program is installed in a general-purpose personal computer from a program recording medium on which the program is recorded.
  • FIG. 16 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Memory and Programmable Read Only Memory
  • the CPU 101 loads the program stored in the ROM 102 and the EEPROM 104 to the RAM 103 via the bus 105 and executes the program, thereby performing the above-described series of processing.
  • a program executed by the computer (CPU 101) can be written in the ROM 102 in advance, and can be installed or updated from the outside in the EEPROM 104 via the input / output interface 105.
  • system represents the entire apparatus composed of a plurality of apparatuses.
  • this technique can also take the following structures.
  • An image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints; An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured.
  • a generator An image processing apparatus comprising: an image transmission unit that transmits the display image generated by the image generation unit.
  • a depth generation unit configured to generate a depth representing a depth in each image based on the plurality of images; The image processing device according to (1), wherein the image generation unit generates the display image based on a plurality of the images and the corresponding depths.
  • the image generation unit generates a base image that is an image obtained by viewing the user from a virtual viewpoint different from the plurality of viewpoints based on the plurality of images and the corresponding depths.
  • the image processing apparatus according to (2) including a generation unit.
  • the image processing apparatus according to (3) wherein the virtual viewpoint is set to a viewpoint of a counterpart user to whom the image transmission unit transmits the display image.
  • a plurality of imaging devices that image the user are arranged around a display device that displays a display image transmitted from the other party to which the image transmission unit transmits the display image,
  • the base image generation unit The image processing apparatus according to (3) or (4).
  • the image transmission unit is any one of the plurality of images captured by the user from a plurality of viewpoints.
  • the image processing apparatus according to (5) wherein a sheet is transmitted as the display image.
  • the image generation unit captures a plurality of images when a distance from the display device to a part of the user is less than a second distance at which the part of the user does not appear in the plurality of images.
  • the image processing device according to (5) or (6), wherein the image transmission unit transmits an image in which a part of the user is combined with the base image by the image combining unit as the display image.
  • the image transmission unit When the distance from the display device to the part of the user is less than the first distance and greater than or equal to the second distance, the image transmission unit generates the base image generated by the base image generation unit. As the display image.
  • the image processing device according to (7).
  • the image generation unit further includes a data recording unit that records a 3D model in which a part of the user is three-dimensionally formed, The image processing device according to (7) or (8), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model recorded in the data recording unit.
  • the data recording unit generates the base image when the distance from the display device to a part of the user is less than the first distance and greater than or equal to the second distance. Record the point cloud of the user used when The image synthesizing unit is recorded in the data recording unit when a distance from the display device to a part of the user is less than a second distance at which the part of the user is not reflected in a plurality of images.
  • the image processing device according to any one of (7) to (9), wherein a part of the user is synthesized with the base image using the point cloud.
  • the image generation unit further includes a user recognition unit that recognizes the user shown in the image, The data recording unit records the 3D model for each of the plurality of users, The image processing apparatus according to (9), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model corresponding to the user recognized by the user recognition unit.
  • the image transmission unit performs communication to match the other user's hand displayed on the display device that displays the display image transmitted from the other party transmitting the display image with the user's hand on the own side.
  • the image processing apparatus according to any one of (1) to (11) above.
  • a display device that displays a display image transmitted from the other party;
  • a plurality of imaging devices arranged around the display device and imaging a user from a plurality of viewpoints;
  • An image acquisition unit for acquiring images captured by each of the plurality of imaging devices;
  • a generator, A telepresence system comprising: an image transmission unit that transmits the display image generated by the image generation unit.
  • 11 telepresence system 12 display device, 13a to 13d imaging device, 14 image processing device, 15 depth camera, 21 image acquisition unit, 22 depth generation unit, 23 image generation unit, 24 image transmission unit, 25 manual recognition unit, 31 base image generation unit, 32 data recording unit, 33 image composition unit, 34 user recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The present disclosure relates to an image processing device, an image processing method, a program, and a telepresence system which make it possible to achieve communication that feels less unnatural. A plurality of image capture devices for capturing images of a user from a plurality of viewpoints are disposed around a display device for displaying a display image transmitted from a counterpart, and an image acquisition unit acquires images respectively captured by the plurality of image capture devices. On the basis of the plurality of images, an image generation unit generates a display image such that when a hand of the user falls outside the angle of view when the plurality of images are captured, the portion is appropriately displayed, and an image transmission unit transmits the display image to the counterpart. This technology is applicable, for example, to a telepresence system.

Description

画像処理装置、画像処理方法、プログラム、およびテレプレゼンスシステムImage processing apparatus, image processing method, program, and telepresence system
 本開示は、画像処理装置、画像処理方法、プログラム、およびテレプレゼンスシステムに関し、特に、より違和感のないコミュニケーションを図ることができるようにした画像処理装置、画像処理方法、プログラム、およびテレプレゼンスシステムに関する。 The present disclosure relates to an image processing device, an image processing method, a program, and a telepresence system, and more particularly, to an image processing device, an image processing method, a program, and a telepresence system that enable communication with less discomfort. .
 従来、遠隔地にいる複数のユーザがテレプレゼンスシステムを利用することで、互いのユーザが対面しているようにコミュニケーションを図ることが行われている。ところで、一般的に、テレプレゼンスシステムでは、自分側のユーザを撮像する撮像装置と、相手側のユーザを表示する表示装置とが異なる位置に配置されているため、例えば、互いのユーザの視線が合わないという違和感が発生することがあった。 Conventionally, by using a telepresence system by a plurality of users in remote locations, it is attempted to communicate as if each other's users are facing each other. By the way, in general, in the telepresence system, the imaging device that captures the user on the other side and the display device that displays the user on the other side are arranged at different positions. There was a case where a sense of incongruity that it did not fit occurred.
 そのため、特許文献1に開示されているように、液晶表示部の偏光板に設けられた孔を介して、ディスプレイ本体の裏側に設けられたカメラにより撮像することで、ユーザどうしの視線を一致させることができるカメラ一体型のディスプレイ装置が提案されている。 Therefore, as disclosed in Patent Document 1, the lines of sight of the users are matched by imaging with a camera provided on the back side of the display main body through a hole provided in the polarizing plate of the liquid crystal display unit. There has been proposed a camera-integrated display device.
特開平6-245209号公報JP-A-6-245209
 上述したように、従来より、テレプレゼンスシステムでは、コミュニケーションに違和感が発生することがあるため、その改善が求められている。例えば、テレプレゼンスシステムを利用してユーザどうしが手を合わせようとして、表示装置に表示される相手側のユーザの手に向かって手を近づけると、その手が撮像装置により撮像可能な画角の外となったとき、画面に手が映らなくなってしまう。そのため、ユーザが互いの手を合わせようとするコミュニケーションを違和感なく行うことは困難であった。なお、上述の特許文献1で提案されているカメラ一体型のディスプレイ装置は、液晶表示部の裏側に設けたカメラにより撮像するためには所定の厚みが必要な構造であることより、例えば、部屋などに容易に設置することはできなかった。 As described above, conventionally, in the telepresence system, there is a case where a sense of incongruity occurs in the communication, and therefore, there is a need for improvement. For example, when a user tries to join hands with each other using a telepresence system and approaches the other user's hand displayed on the display device, the hand has an angle of view that can be captured by the imaging device. When you are outside, your hand will not be visible on the screen. For this reason, it has been difficult for users to communicate with each other without feeling uncomfortable. Note that the camera-integrated display device proposed in Patent Document 1 described above has a structure that requires a predetermined thickness in order to capture an image with a camera provided on the back side of the liquid crystal display unit. It could not be installed easily.
 本開示は、このような状況に鑑みてなされたものであり、より違和感のないコミュニケーションを図ることができるようにするものである。 This disclosure has been made in view of such a situation, and is intended to enable communication without a sense of incongruity.
 本開示の一側面の画像処理装置は、複数の視点からユーザが撮像された複数枚の画像を取得する画像取得部と、複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、前記画像生成部により生成された前記表示画像を送信する画像送信部とを備える。 An image processing apparatus according to an aspect of the present disclosure includes an image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints, and a plurality of the user's portions based on the plurality of images. An image generation unit that generates a display image that appropriately displays a part of the angle of view when the image is captured, and the display image generated by the image generation unit is transmitted. An image transmission unit.
 本開示の一側面の画像処理方法またはプログラムは、複数の視点からユーザが撮像された複数枚の画像を取得し、複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成し、生成された前記表示画像を送信するステップを含む。 An image processing method or program according to an aspect of the present disclosure acquires a plurality of images captured by a user from a plurality of viewpoints, and a part of the user acquires a plurality of the images based on the plurality of images. The method includes a step of generating a display image so that a part of the image is appropriately displayed when the angle of view is outside of the field of view, and transmitting the generated display image.
 本開示の一側面のテレプレゼンスシステムは、相手側から送信されてくる表示画像を表示する表示装置と、前記表示装置の周囲に配置され、複数の視点からユーザを撮像する複数台の撮像装置と、複数台の前記撮像装置それぞれにより撮像された画像を取得する画像取得部と、複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、前記画像生成部により生成された前記表示画像を送信する画像送信部とを備える。 A telepresence system according to one aspect of the present disclosure includes a display device that displays a display image transmitted from a counterpart, and a plurality of imaging devices that are arranged around the display device and that capture a user from a plurality of viewpoints. An image acquisition unit that acquires images captured by each of the plurality of imaging devices, and an outside angle of view when a part of the user captures the plurality of images based on the plurality of images. An image generation unit that generates a display image that properly displays a part thereof, and an image transmission unit that transmits the display image generated by the image generation unit.
 本開示の一側面においては、複数の視点からユーザが撮像された複数枚の画像が取得され、複数枚の画像に基づいて、ユーザの一部分が複数枚の画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像が生成され、その生成された表示画像が送信される。 In one aspect of the present disclosure, a plurality of images captured by a user from a plurality of viewpoints are acquired, and a portion of the user captures a plurality of images based on the plurality of images. Then, a display image is generated so that a part of the display image is appropriately displayed, and the generated display image is transmitted.
 本開示の一側面によれば、より違和感のないコミュニケーションを図ることができる。 の 一 According to one aspect of the present disclosure, communication without a sense of incongruity can be achieved.
本技術を適用したテレプレゼンスシステムの概略的な構成を示す斜視図である。1 is a perspective view illustrating a schematic configuration of a telepresence system to which the present technology is applied. 画像処理装置の第1の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 1st Embodiment of an image processing apparatus. デプス生成処理を説明するフローチャートである。It is a flowchart explaining a depth production | generation process. 表示装置からユーザの手までの距離に応じた画像の例を示す図である。It is a figure which shows the example of the image according to the distance from a display apparatus to a user's hand. 仮想的な視点について説明する図である。It is a figure explaining a virtual viewpoint. ベース画像生成処理を説明するフローチャートである。It is a flowchart explaining a base image generation process. 手の3Dモデルを配置するデプスを推測する方法を説明する図である。It is a figure explaining the method of estimating the depth which arrange | positions 3D model of a hand. 手の3Dモデルを配置するデプスを推測する他の方法を説明する図である。It is a figure explaining the other method of estimating the depth which arrange | positions 3D model of a hand. ユーザの手の一部が欠けたベース画像のイメージを示す図である。It is a figure which shows the image of the base image which a part of user's hand lacked. 手の3Dモデルのローカル座標系を示す図である。It is a figure which shows the local coordinate system of 3D model of a hand. 画像合成処理を説明するフローチャートである。It is a flowchart explaining an image composition process. 画像処理装置において実行される処理を説明するフローチャートである。It is a flowchart explaining the process performed in an image processing apparatus. 画像処理装置の第2の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 2nd Embodiment of an image processing apparatus. 2つのポイントクラウドを位置合わせするイメージを示す図である。It is a figure which shows the image which aligns two point clouds. 画像処理装置の第3の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 3rd Embodiment of an image processing apparatus. 画像処理装置の第4の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 4th Embodiment of an image processing apparatus. 本技術を適用したコンピュータの一実施の形態の構成例を示すブロック図である。And FIG. 18 is a block diagram illustrating a configuration example of an embodiment of a computer to which the present technology is applied.
 以下、本技術を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。 Hereinafter, specific embodiments to which the present technology is applied will be described in detail with reference to the drawings.
 <テレプレゼンスシステムの説明> <Description of telepresence system>
 図1は、本技術を適用したテレプレゼンスシステムの概略的な構成を示す斜視図である。 FIG. 1 is a perspective view showing a schematic configuration of a telepresence system to which the present technology is applied.
 図1に示すように、テレプレゼンスシステム11は、表示装置12、撮像装置13a乃至13d、および画像処理装置14を備えて構成される。 As shown in FIG. 1, the telepresence system 11 includes a display device 12, imaging devices 13a to 13d, and an image processing device 14.
 テレプレゼンスシステム11は、例えば、遠隔地にいる2人のユーザが対面しているようなコミュニケーション体験を提供することができる。以下、適宜、図1に示す表示装置12の手前に居るユーザ側を自分側と称し、表示装置12に映し出されているユーザ側を相手側と称する。また、テレプレゼンスシステム11は、自分側および相手側の両方に備えられており、自分側および相手側のテレプレゼンスシステム11は、ネットワークを介して互いに通信することができる。 The telepresence system 11 can provide, for example, a communication experience in which two users at remote locations are facing each other. In the following, the user side in front of the display device 12 shown in FIG. 1 is referred to as the user side, and the user side displayed on the display device 12 is referred to as the partner side. The telepresence system 11 is provided on both the own side and the other side, and the telepresence systems 11 on the own side and the other side can communicate with each other via a network.
 表示装置12は、相手側のテレプレゼンスシステム11と通信を行うことができる通信装置(図示せず)に接続されており、相手側のテレプレゼンスシステム11から送信されてくる画像を表示し、相手側のユーザを画面に映し出す。 The display device 12 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side, and displays an image transmitted from the telepresence system 11 on the other side. The user on the side is displayed on the screen.
 撮像装置13a乃至13dは、表示装置12の周囲に配置されており、それぞれの配置位置の視点からユーザを撮像し、その撮像により得られる画像(RGBカラー画像)を画像処理装置14に供給する。なお、図1には、4台の撮像装置13a乃至13dにより縦×横が2×2となるような配置位置が示されているが、それらの配置位置は図1に示す例に限定されることはない。また、複数の視点からユーザを撮像することができれば、撮像装置13は、3台以下または5台以上でもよい。 The imaging devices 13a to 13d are arranged around the display device 12. The imaging devices 13a to 13d image the user from the viewpoints of the respective arrangement positions, and supply images (RGB color images) obtained by the imaging to the image processing device 14. In FIG. 1, the arrangement positions where the vertical and horizontal directions are 2 × 2 by the four imaging devices 13 a to 13 d are shown, but the arrangement positions are limited to the example shown in FIG. 1. There is nothing. If the user can be imaged from a plurality of viewpoints, the number of the imaging devices 13 may be three or less or five or more.
 画像処理装置14は、撮像装置13a乃至13dから供給される4枚の画像を用いて、撮像装置13a乃至13dそれぞれの視点とは異なる仮想的な視点からユーザを見た画像を生成する画像処理を行って、相手側のテレプレゼンスシステム11に送信する。例えば、画像処理装置14は、テレプレゼンスの際に、相手側のユーザが自分側のユーザを見た時に違和感がないように、仮想的な視点(図5を参照して後述する視点P)を設定することができる。なお、画像処理装置14の詳細な構成については、後述の図2を参照して説明する。 The image processing device 14 uses the four images supplied from the imaging devices 13a to 13d to perform image processing for generating an image viewed from the virtual viewpoint that is different from the viewpoints of the imaging devices 13a to 13d. Go to the telepresence system 11 on the other side. For example, during telepresence, the image processing apparatus 14 provides a virtual viewpoint (viewpoint P described later with reference to FIG. 5) so that the other user does not feel uncomfortable when he / she sees the user on the other side. Can be set. The detailed configuration of the image processing apparatus 14 will be described with reference to FIG.
 ここで、例えば、自分側のユーザと相手側のユーザとが、テレプレゼンスシステム11を介して互いの手を合わせるような動作を行うとき、表示装置12に近づいてきた手が、撮像装置13a乃至13dにより撮像可能な画角の外になることが想定される。そのため、画像処理装置14では、ユーザどうしが手を合わせるというコミュニケーションを違和感なく行うことができるようにするため、図1に示すように、表示装置12の中央で手のひらを合わせることができるような画像処理が行われる。 Here, for example, when the user on the other side and the user on the other side perform an operation of bringing their hands together through the telepresence system 11, the hand approaching the display device 12 is moved to the imaging devices 13a to 13a. It is assumed that the angle of view is outside the imageable angle of view by 13d. Therefore, in the image processing device 14, an image that allows the palm to be put together at the center of the display device 12 as shown in FIG. Processing is performed.
 以下では、ここで説明したように、自分側のユーザと相手側のユーザとが、テレプレゼンスシステム11を介して互いの手を合わせるような動作を行うときに、画像処理装置14で行われる画像処理について説明する。 In the following, as described here, when the user on the other side and the user on the other side perform an operation of bringing their hands together through the telepresence system 11, an image performed by the image processing apparatus 14 is performed. Processing will be described.
 <画像処理装置の構成例> <Example configuration of image processing device>
 図2は、画像処理装置14の第1の実施の形態の構成例を示すブロック図である。 FIG. 2 is a block diagram illustrating a configuration example of the first embodiment of the image processing apparatus 14.
 図2に示すように、画像処理装置14は、画像取得部21、デプス生成部22、画像生成部23、および画像送信部24を備えて構成される。 As shown in FIG. 2, the image processing apparatus 14 includes an image acquisition unit 21, a depth generation unit 22, an image generation unit 23, and an image transmission unit 24.
 画像取得部21は、図1の撮像装置13a乃至13に有線または無線で接続されており、撮像装置13a乃至13dそれぞれの視点からユーザが撮像された4枚の画像を取得して、デプス生成部22に供給する。 The image acquisition unit 21 is connected to the imaging devices 13a to 13 in FIG. 1 by wire or wirelessly, acquires four images captured by the user from the viewpoints of the imaging devices 13a to 13d, and obtains a depth generation unit. 22 is supplied.
 デプス生成部22は、画像取得部21から供給される4枚の画像を用いて、それぞれの画像における平面方向の座標ごとに奥行きを表すデプスを生成して、画像生成部23に供給する。例えば、デプス生成部22は、上下または左右に並ぶ2枚の画像を用いたステレオマッチング手法により画像ごとのステレオデプスを求めた後、それぞれの画像の上下方向および左右方向のステレオデプスを合成することによって、最終的に、4枚の画像に対するデプスを生成することができる。 The depth generation unit 22 uses the four images supplied from the image acquisition unit 21 to generate a depth representing the depth for each coordinate in the plane direction of each image, and supplies the depth to the image generation unit 23. For example, the depth generation unit 22 obtains a stereo depth for each image by a stereo matching method using two images arranged vertically or horizontally, and then synthesizes the stereo depth in the vertical direction and the horizontal direction of each image. Thus, the depth for the four images can be finally generated.
 画像生成部23は、ベース画像生成部31、データ記録部32、および画像合成部33を有して構成される。例えば、画像生成部23は、テレプレゼンスシステム11のユーザどうしが手を合わせるコミュニケーションを行う場合、自分側のユーザの手が撮像装置13a乃至13dにより撮像可能な画角の外となったときに、その手が、相手側のテレプレゼンスシステム11の表示装置12に適切に表示されるような画像を生成する。 The image generation unit 23 includes a base image generation unit 31, a data recording unit 32, and an image composition unit 33. For example, when the image generation unit 23 performs communication in which the users of the telepresence system 11 join hands, when the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, The hand generates an image that can be appropriately displayed on the display device 12 of the telepresence system 11 on the other side.
 ベース画像生成部31は、ユーザの手が表示装置12から十分に遠い状態(例えば、後述の図4の状態A)では、画像取得部21が取得した4枚の画像のうちの、いずれか1枚の画像を、相手側のテレプレゼンスシステム11の表示装置12に表示するものとして選択する。一方、ベース画像生成部31は、ユーザの手が表示装置12から十分に遠くない状態(例えば、後述の図4の状態BまたはC)では、仮想的な視点からユーザを見た画像を、相手側のテレプレゼンスシステム11の表示装置12に表示する際の基本となるベース画像として生成する。例えば、ベース画像生成部31は、画像取得部21により取得される4枚の画像と、デプス生成部22により生成される4枚の画像に対するデプスとに基づいて、仮想的な視点からユーザを見たベース画像を生成することができる。なお、ベース画像生成部31のベース画像生成処理については、図4乃至図6を参照して後述する。 In a state where the user's hand is sufficiently far from the display device 12 (for example, state A in FIG. 4 described later), the base image generation unit 31 is any one of the four images acquired by the image acquisition unit 21. The images are selected to be displayed on the display device 12 of the telepresence system 11 on the other side. On the other hand, in a state where the user's hand is not sufficiently distant from the display device 12 (for example, state B or C in FIG. 4 described later), the base image generation unit 31 displays an image viewed from the virtual viewpoint as the partner. The base image is generated as a base image for display on the display device 12 of the telepresence system 11 on the side. For example, the base image generation unit 31 views the user from a virtual viewpoint based on the four images acquired by the image acquisition unit 21 and the depths for the four images generated by the depth generation unit 22. A base image can be generated. Note that the base image generation processing of the base image generation unit 31 will be described later with reference to FIGS.
 データ記録部32には、例えば、自分側のユーザの手の形状が立体的に形成され、その手のテクスチャが貼り付けられた手の3Dモデルが予め記録されている。例えば、テレプレゼンスシステム11のユーザどうしが手を合わせるコミュニケーションを行うときには、互いの表示装置12には相手側のユーザの手のひらが見えるような画像が表示されることより、手のひらを表示可能な手の3Dモデルが予め作成されている。なお、手の3Dモデルは、手の部分だけを作成する他、例えば、手から肘までの部分を含むように作成してもよい。また、ユーザに応じて事前に登録した手の3Dモデルを用いる他、既存のコンピュータグラフィックで使用される手の3Dモデルを用いてもよい。 In the data recording unit 32, for example, a 3D model of a hand in which the shape of the user's hand is formed in three dimensions and the texture of the hand is pasted is recorded in advance. For example, when the users of the telepresence system 11 communicate with each other, hands that can display the palm of the hand can be displayed by displaying an image in which the palm of the other user can be seen on each other's display device 12. A 3D model is created in advance. Note that the 3D model of the hand may be created so as to include, for example, the part from the hand to the elbow in addition to creating only the hand part. In addition to using a hand 3D model registered in advance according to a user, a hand 3D model used in existing computer graphics may be used.
 画像合成部33は、例えば、ユーザの手が表示装置12に近づいてきて撮像装置13a乃至13dにより撮像可能な画角の外となり、ユーザの手の一部が撮像できない程に近い状態(例えば、後述の図4の状態C)である場合、ベース画像にユーザの手を合成する。例えば、画像合成部33は、相手側のテレプレゼンスシステム11の表示装置12においてユーザの手が適切に表示されるように、仮想的な視点からユーザを見たベース画像に、データ記録部32に記録されている手の3Dモデルに基づいたユーザの手のひらの画像を合成する。なお、画像合成部33の画像合成処理については、図7乃至図11を参照して後述する。 The image compositing unit 33 is, for example, a state in which the user's hand approaches the display device 12 and is outside the angle of view that can be captured by the imaging devices 13a to 13d, and is close enough that a part of the user's hand cannot be captured (for example, In the case of state C) in FIG. 4 described later, the user's hand is combined with the base image. For example, the image synthesizing unit 33 converts the user's hand to the data recording unit 32 into a base image viewed from a virtual viewpoint so that the user's hand is appropriately displayed on the display device 12 of the telepresence system 11 on the other side. An image of the palm of the user based on the recorded 3D model of the hand is synthesized. Note that the image composition processing of the image composition unit 33 will be described later with reference to FIGS.
 画像送信部24は、ネットワークを介して相手側のテレプレゼンスシステム11と通信を行うことができる通信装置(図示せず)に接続されており、画像生成部23により生成された画像を、相手側のテレプレゼンスシステム11で表示される表示画像として送信する。 The image transmission unit 24 is connected to a communication device (not shown) that can communicate with the telepresence system 11 on the other side via the network, and the image generated by the image generation unit 23 is transmitted to the other side. As a display image to be displayed on the telepresence system 11.
 このように構成される画像処理装置14は、撮像装置13a乃至13dにより撮像された画像に基づいて生成したベース画像に、手の3Dモデルに基づいたユーザの手を合成した画像を、互いに相手側の表示装置12に表示させることができる。これにより、テレプレゼンスシステム11は、ユーザどうしが手を合わせるというコミュニケーションを違和感なく行わせることができる。 The image processing device 14 configured as described above is configured to combine images obtained by combining the user's hand based on the 3D model of the hand with the base image generated based on the images captured by the imaging devices 13a to 13d. Can be displayed on the display device 12. Thereby, the telepresence system 11 can make the communication that the users match each other without a sense of incongruity.
 <デプス生成部の処理の説明> <Description of processing of depth generation unit>
 図3に示すフローチャートを参照して、デプス生成部22において実行されるデプス生成処理について説明する。 The depth generation process executed in the depth generation unit 22 will be described with reference to the flowchart shown in FIG.
 例えば、撮像装置13aにより撮像された画像a、撮像装置13bにより撮像された画像b、撮像装置13cにより撮像された画像c、および撮像装置13dにより撮像された画像dが、画像取得部21からデプス生成部22に供給されると処理が開始される。 For example, an image a captured by the image capturing device 13a, an image b captured by the image capturing device 13b, an image c captured by the image capturing device 13c, and an image d captured by the image capturing device 13d are transferred from the image acquisition unit 21 to the depth. When supplied to the generation unit 22, the process is started.
 ステップS11において、デプス生成部22は、撮像装置13aにより撮像された画像aと撮像装置13bにより撮像された画像bとを用いて、ステレオマッチング手法により、画像aの第1のステレオデプスa1および画像bの第1のステレオデプスb1を算出する。 In step S11, the depth generation unit 22 uses the image a picked up by the image pickup device 13a and the image b picked up by the image pickup device 13b, and the first stereo depth a1 and the image of the image a by the stereo matching method. The first stereo depth b1 of b is calculated.
 ステップS12において、デプス生成部22は、撮像装置13cにより撮像された画像cと撮像装置13dにより撮像された画像dとを用いて、ステレオマッチング手法により、画像cの第1のステレオデプスc1および画像dの第1のステレオデプスd1を算出する。 In step S12, the depth generation unit 22 uses the image c picked up by the image pickup device 13c and the image d picked up by the image pickup device 13d, and the first stereo depth c1 and the image c of the image c by the stereo matching method. The first stereo depth d1 of d is calculated.
 ステップS13において、デプス生成部22は、撮像装置13aにより撮像された画像aと撮像装置13cにより撮像された画像cとを用いて、ステレオマッチング手法により、画像aの第2のステレオデプスa2および画像cの第2のステレオデプスc2を算出する。 In step S <b> 13, the depth generation unit 22 uses the image a captured by the imaging device 13 a and the image c captured by the imaging device 13 c to perform the second stereo depth a <b> 2 and the image of the image a by the stereo matching method. The second stereo depth c2 of c is calculated.
 ステップS14において、デプス生成部22は、撮像装置13bにより撮像された画像bと撮像装置13dにより撮像された画像dとを用いて、ステレオマッチング手法により、画像bの第2のステレオデプスb2および画像dの第2のステレオデプスd2を算出する。 In step S14, the depth generation unit 22 uses the image b picked up by the image pickup device 13b and the image d picked up by the image pickup device 13d, and the second stereo depth b2 and the image b of the image b by the stereo matching method. The second stereo depth d2 of d is calculated.
 ステップS15において、デプス生成部22は、ステップS11で算出した画像aの第1のステレオデプスa1とステップS13で算出した画像aの第2のステレオデプスa2とを組み合わせることにより、画像aに対するデプスaを求める。同様に、デプス生成部22は、ステップS11で算出した画像bの第1のステレオデプスb1とステップS14で算出した画像bの第2のステレオデプスb2とを組み合わせることにより、画像bに対するデプスbを求める。以下、同様に、デプス生成部22は、画像cに対するデプスcおよび画像dに対するデプスdを算出し、デプス生成処理は終了される。 In step S15, the depth generation unit 22 combines the first stereo depth a1 of the image a calculated in step S11 and the second stereo depth a2 of the image a calculated in step S13, thereby obtaining a depth a for the image a. Ask for. Similarly, the depth generation unit 22 combines the first stereo depth b1 of the image b calculated in step S11 and the second stereo depth b2 of the image b calculated in step S14, thereby calculating the depth b for the image b. Ask. Hereinafter, similarly, the depth generation unit 22 calculates the depth c for the image c and the depth d for the image d, and the depth generation process ends.
 以上のように、デプス生成部22は、撮像装置13a乃至13dそれぞれの視点からユーザを撮像した4枚の画像それぞれに対して、デプスを生成することができる。 As described above, the depth generation unit 22 can generate a depth for each of the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d.
 <ベース画像生成部の処理の説明> <Description of processing of base image generation unit>
 図4乃至図6を参照して、ベース画像生成部31において行われる処理について説明する。 The processing performed in the base image generation unit 31 will be described with reference to FIGS.
 図4には、表示装置12からユーザの手までの距離に応じた3つの状態において、撮像装置13a乃至13dにより撮像される4枚の画像a乃至dの例が示されている。 FIG. 4 shows an example of four images a to d captured by the imaging devices 13a to 13d in three states according to the distance from the display device 12 to the user's hand.
 ベース画像生成部31は、表示装置12からユーザの手までの距離に従って、仮想的な視点からユーザを見たベース画像を生成するか否かを判定する。 The base image generation unit 31 determines whether or not to generate a base image viewed from the virtual viewpoint according to the distance from the display device 12 to the user's hand.
 例えば、ユーザの手が表示装置12から十分に遠くにあり、表示装置12からユーザの手までの距離が所定の第1の距離以上である場合、図4のAに示すように、4枚の画像a乃至dそれぞれにおいて、ユーザの手は画面の略中央に映される。従って、ベース画像生成部31は、ユーザの手が表示装置12から十分に遠くにある状態(以下、適宜、状態Aと称する)の場合、ベース画像を生成しないと判定し、4枚の画像a乃至dのうちの、いずれか1枚の画像を、相手側のテレプレゼンスシステム11の表示装置12に表示するものとして選択する。 For example, when the user's hand is sufficiently far from the display device 12 and the distance from the display device 12 to the user's hand is equal to or greater than a predetermined first distance, as shown in FIG. In each of the images a to d, the user's hand is shown in the approximate center of the screen. Accordingly, the base image generation unit 31 determines that the base image is not generated when the user's hand is sufficiently far from the display device 12 (hereinafter referred to as state A as appropriate), and determines that four images a Any one of the images from d to d is selected for display on the display device 12 of the telepresence system 11 on the other side.
 一方、ユーザの手が状態Aよりも表示装置12の中央に近づいてきて、表示装置12からユーザの手までの距離が十分に遠くなくなり、上述の第1の距離未満になると、図4のBに示すように、4枚の画像a乃至dにおいて、ユーザの手は画面の周辺に映される。このように、ユーザの手が画面の周辺に映される状態(以下、適宜、状態Bと称する)の場合、表示装置12の中央で互いの手を合わせようとする相手側のユーザから見て違和感があるものとなる。従って、ベース画像生成部31は、状態Bの場合、ベース画像を生成すると判定し、画像取得部21により取得される4枚の画像、および、デプス生成部22により生成される4枚の画像に対するデプスに基づいて、仮想的な視点からユーザを見たベース画像を生成する。 On the other hand, when the user's hand is closer to the center of the display device 12 than in the state A, and the distance from the display device 12 to the user's hand is not sufficiently long, and becomes less than the first distance described above, B in FIG. As shown in FIG. 4, in the four images a to d, the user's hand is projected around the screen. As described above, when the user's hand is projected around the screen (hereinafter referred to as state B as appropriate), the user is viewed from the other user who is trying to match each other's hands in the center of the display device 12. There will be a sense of incongruity. Therefore, in the state B, the base image generation unit 31 determines to generate a base image, and the four images acquired by the image acquisition unit 21 and the four images generated by the depth generation unit 22 are determined. Based on the depth, a base image in which the user is viewed from a virtual viewpoint is generated.
 さらに、ユーザの手が状態Bよりも表示装置12の中央に近づいてきて、ユーザの手の一部が撮像装置13a乃至13dにより撮像可能な画角の外となるまで近くなり、表示装置12からユーザの手までの距離が所定の第2の距離未満である場合、図4のCに示すように、4枚の画像a乃至dでは、ユーザの手の一部は画面に映らなくなる。このように、ユーザの手の一部が撮像できない程に近い状態(以下、適宜、状態Cと称する)の場合、ユーザの手の一部が欠けたベース画像が生成されることになるため、状態Bよりも、相手側のユーザから見て違和感があるものとなる。従って、ベース画像生成部31は、状態Cの場合、仮想的な視点からユーザを見たベース画像を生成し、その後、ユーザの手の欠けた部分を補うように、生成したベース画像にユーザの手が合成される。 Furthermore, the user's hand is closer to the center of the display device 12 than in the state B, and the user's hand is closer to the outside of the angle of view that can be imaged by the imaging devices 13a to 13d. When the distance to the user's hand is less than the predetermined second distance, a part of the user's hand is not displayed on the screen in the four images a to d as shown in FIG. 4C. In this way, in a state close enough that a part of the user's hand cannot be imaged (hereinafter referred to as state C as appropriate), a base image lacking a part of the user's hand is generated. Compared to state B, the user on the other side is more uncomfortable. Therefore, in the case of the state C, the base image generation unit 31 generates a base image viewed from the virtual viewpoint, and then adds the user's hand to the generated base image so as to compensate for the lacking part of the user's hand. Hands are synthesized.
 このように、ベース画像生成部31は、表示装置12からユーザの手までの距離に従って、状態A、状態B、および状態Cのいずれであるかを判定する。そして、ベース画像生成部31は、状態Aから状態Bに変化したと判定した場合、相手側のテレプレゼンスシステム11の表示装置12に表示する画像を、画像a乃至dのいずれか1枚から、仮想的な視点からユーザを見たベース画像に切り替える処理を行う。なお、ベース画像生成部31は、状態Aであっても、ベース画像を生成するようにしてもよい。 In this way, the base image generation unit 31 determines whether the state is the state A, the state B, or the state C according to the distance from the display device 12 to the user's hand. When the base image generation unit 31 determines that the state A has changed to the state B, the image to be displayed on the display device 12 of the telepresence system 11 on the other side is obtained from any one of the images a to d. A process of switching from a virtual viewpoint to a base image viewed from the user is performed. Note that the base image generation unit 31 may generate a base image even in the state A.
 ここで、図5を参照して、ベース画像を生成する際の仮想的な視点について説明する。 Here, referring to FIG. 5, a virtual viewpoint when generating the base image will be described.
 上述の状態Bおよび状態Cの場合、相手側のユーザに違和感なく自分側のユーザを見せる視点は、表示装置12を窓とみなすとすると、相手側ユーザが、その窓の向こうから自分側のユーザを見た際の視点となる。 In the case of the above-mentioned state B and state C, if the other side user sees the user on his / her side without feeling uncomfortable, assuming that the display device 12 is a window, the other-side user can see the user on his / her side from the other side of the window. It becomes the viewpoint when seeing.
 そこで、図5に示すように、グローバル座標系は、表示装置12の中心を原点Oとして、表示装置12の表面に対して直交する方向をZ軸とし、表示装置12の表面に沿った水平方向をX軸とし、表示装置12の表面に沿った垂直方向をYとして設定される。ここで、例えば、相手側のユーザの身長が150cmであると仮定し、表示装置12の高さをLとする。そして、相手側のユーザが、テレプレゼンスシステム11の表示装置12から0.5m離れた位置から表示装置12の中央を見た場合、相手側のユーザの視点Pの中心は、グローバル座標系における座標(0,150-L/2,-0.5)に設定することができる。また、視点Pのローカル座標系のx軸、y軸、およびz軸は、それぞれグローバル座標系のX軸、Y軸、およびZ軸に平行となるように設定される。 Therefore, as shown in FIG. 5, the global coordinate system uses the center of the display device 12 as the origin O, the direction orthogonal to the surface of the display device 12 as the Z axis, and the horizontal direction along the surface of the display device 12. Is set as the X axis, and the vertical direction along the surface of the display device 12 is set as Y. Here, for example, it is assumed that the height of the other user is 150 cm, and the height of the display device 12 is L. When the partner user views the center of the display device 12 from a position 0.5 m away from the display device 12 of the telepresence system 11, the center of the viewpoint P of the partner user is the coordinate in the global coordinate system. (0, 150-L / 2, -0.5) can be set. Further, the x-axis, y-axis, and z-axis of the local coordinate system of the viewpoint P are set to be parallel to the X-axis, Y-axis, and Z-axis of the global coordinate system, respectively.
 このように、ベース画像生成部31は、相手側のユーザの視点Pを、ベース画像を生成する際の仮想的な視点として用いることで、自分側のユーザの手が表示装置12に近すぎても、相手側のユーザに違和感がないようなベース画像を生成することができる。 In this way, the base image generation unit 31 uses the viewpoint P of the other user as a virtual viewpoint when generating the base image, so that the user's hand is too close to the display device 12. In addition, it is possible to generate a base image that does not make the other user feel uncomfortable.
 なお、テレプレゼンスシステム11は、撮像装置13a乃至13dにより撮像された画像に基づいて、例えば、表示装置12からユーザまでの距離や、ユーザの視点の高さなどのユーザの視点位置を特定するユーザ情報を求めて、随時、ネットワークを介して送信することができる。そして、ベース画像生成部31は、相手側のユーザ情報に基づいて、仮想的な視点Pの座標を決定することができる。 Note that the telepresence system 11 is a user who specifies the user's viewpoint position such as the distance from the display device 12 to the user and the height of the user's viewpoint based on the images captured by the imaging devices 13a to 13d. Information can be sought and transmitted over the network at any time. Then, the base image generation unit 31 can determine the coordinates of the virtual viewpoint P based on the other-party user information.
 図6は、ベース画像生成部31がベース画像を生成するベース画像生成処理について説明するフローチャートである。 FIG. 6 is a flowchart illustrating a base image generation process in which the base image generation unit 31 generates a base image.
 ステップS21において、ベース画像生成部31は、画像取得部21により取得された4枚の画像と、それらの4枚の画像に対してデプス生成部22により生成されたデプスを、図5に示したグローバル座標系のポイントクラウドに変換する。これにより、撮像装置13a乃至13d側から見たユーザの表面を、点の集合により立体的に表す1つのポイントクラウドが合成される。 In step S21, the base image generation unit 31 shows the four images acquired by the image acquisition unit 21 and the depths generated by the depth generation unit 22 for these four images as shown in FIG. Convert to a point cloud in the global coordinate system. Thereby, one point cloud that three-dimensionally represents the surface of the user viewed from the imaging devices 13a to 13d side by a set of points is synthesized.
 ステップS22において、ベース画像生成部31は、ステップS21で合成したグローバル座標系のポイントクラウドに基づいて、図5に示した仮想的な視点P(相手側のユーザの視点)から自分側のユーザを見た画像を、ベース画像として生成する。 In step S22, the base image generation unit 31 selects the user on the own side from the virtual viewpoint P (the other user's viewpoint) shown in FIG. 5 based on the point cloud of the global coordinate system synthesized in step S21. The viewed image is generated as a base image.
 以上のように、ベース画像生成部31は、撮像装置13a乃至13それぞれの視点からユーザを撮像した4枚の画像と、それぞれの画像に対するデプスとに基づいて、仮想的な視点Pから見たベース画像を生成することができる。 As described above, the base image generation unit 31 is based on the virtual viewpoint P based on the four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13 and the depths of the images. An image can be generated.
 <画像合成部の処理の説明> <Description of processing of image composition unit>
 次に、図7乃至図11を参照して、画像合成部33において行われる処理について説明する。 Next, processing performed in the image composition unit 33 will be described with reference to FIGS.
 画像合成部33は、図4を参照して上述した状態Aの場合、即ち、ユーザの手が表示装置12から十分に遠くにある状態の場合、ベース画像生成部31が選択した画像を、そのまま出力する。また、画像合成部33は、図4を参照して上述した状態Bの場合、即ち、画面の周辺ではあるがユーザの手が見えている場合、ベース画像生成部31が生成したベース画像を、そのまま出力する。 In the case of the state A described above with reference to FIG. 4, that is, in a state where the user's hand is sufficiently far from the display device 12, the image composition unit 33 directly selects the image selected by the base image generation unit 31. Output. Further, in the case of the state B described above with reference to FIG. 4, that is, when the user's hand is visible in the periphery of the screen, the image composition unit 33 displays the base image generated by the base image generation unit 31. Output as is.
 これに対し、画像合成部33は、図4を参照して上述した状態Cの場合、即ち、ユーザの手の一部が撮像できない程に近い状態の場合、ユーザの手の一部が欠けているベース画像にユーザの手を合成して出力する。例えば、画像合成部33は、ベース画像生成部31が生成したベース画像およびグローバル座標系のポイントクラウド、並びに、データ記録部32に記録されている手の3Dモデルを用いて、ベース画像にユーザの手を合成する画像合成処理を行う。 On the other hand, in the case of the state C described above with reference to FIG. 4, that is, in a state close enough that a part of the user's hand cannot be imaged, a part of the user's hand is missing. The user's hand is combined with the existing base image and output. For example, the image synthesizing unit 33 uses the base image generated by the base image generating unit 31 and the point cloud of the global coordinate system, and the 3D model of the hand recorded in the data recording unit 32, to the user's base image. Perform image composition processing to synthesize hands.
 画像合成部33は、ベース画像にユーザの手を合成する際、まず、手の3Dモデルを配置するためのデプスZ0を推測する。 When combining the user's hand with the base image, the image combining unit 33 first estimates a depth Z 0 for placing the 3D model of the hand.
 図7を参照して、手の3DモデルのデプスZ0を推測する方法について説明する。 A method for estimating the depth Z 0 of the 3D model of the hand will be described with reference to FIG.
 図7のAには、状態Bにおいてユーザが表示装置12に手を伸ばしている様子が示されており、図7のBには、状態Cにおいてユーザが表示装置12に手を伸ばしている様子が示されている。そして、状態Bから状態Cに変化したときに、ユーザの身体から手までの相対距離が変わらないと仮定して、画像合成部33は、状態Bの時点における身体と手とのデプス差L1を参照し、状態Cの時点における身体のデプスZsから手のデプスZ0を推測することができる。 FIG. 7A shows a state in which the user reaches for the display device 12 in the state B, and FIG. 7B shows a state in which the user reaches for the display device 12 in the state C. It is shown. Then, assuming that the relative distance from the user's body to the hand does not change when the state B changes to the state C, the image compositing unit 33 calculates the depth difference L1 between the body and the hand at the time of the state B. The hand depth Z 0 can be inferred from the body depth Zs at the time of the state C by referring to FIG.
 即ち、画像合成部33は、状態Bであるときに、画像取得部21により取得される画像からユーザの身体が映された領域とユーザの手が映された領域とを検出し、デプス生成部22により生成される画像に対するデプスから、それぞれの領域の平均デプスを算出する。なお、ユーザの身体または手が映された領域の検出には、例えば、事前に行われる学習を用いることができる。そして、画像合成部33は、ユーザの身体が映された領域の平均デプスとユーザの手が映された領域の平均デプスとの差分を算出することでデプス差L1を取得し、データ記録部32に記録しておく。 That is, when in the state B, the image composition unit 33 detects an area in which the user's body is shown and an area in which the user's hand is shown from the image acquired by the image acquisition unit 21, and the depth generation unit The average depth of each region is calculated from the depth of the image generated by 22. In addition, for example, learning performed in advance can be used to detect the region where the user's body or hand is shown. Then, the image composition unit 33 obtains the depth difference L1 by calculating the difference between the average depth of the region where the user's body is shown and the average depth of the region where the user's hand is shown, and the data recording unit 32 Keep a record.
 その後、状態Bから状態Cに変化すると、画像合成部33は、画像取得部21により取得される画像からユーザの身体が映された領域を検出し、デプス生成部22により生成される画像に対するデプスから、ユーザの身体が映された領域の平均デプスZsを算出する。そして、画像合成部33は、データ記録部32からデプス差L1を読み出して、算出した平均デプスZsからデプス差L1を減算することにより、撮像装置13a乃至13dにより撮像可能な画角の外となったユーザの手のデプスZ0(Z0=Zs-L1)を算出する。 Thereafter, when the state B changes to the state C, the image composition unit 33 detects an area in which the user's body is reflected from the image acquired by the image acquisition unit 21, and the depth for the image generated by the depth generation unit 22. From this, the average depth Zs of the area in which the user's body is shown is calculated. Then, the image composition unit 33 reads out the depth difference L1 from the data recording unit 32, and subtracts the depth difference L1 from the calculated average depth Zs, so that it is outside the angle of view that can be imaged by the imaging devices 13a to 13d. The depth Z 0 (Z 0 = Zs−L1) of the user's hand is calculated.
 なお、例えば、図8に示すようなデプスカメラ15を用いて、手の3DモデルのデプスZ0を推測してもよい。 For example, the depth Z 0 of the 3D model of the hand may be estimated using a depth camera 15 as shown in FIG.
 図8に示す例では、テレプレゼンスシステム11は、表示装置12が配置された部屋の上方の天井に、デプスカメラ15を備えるように構成されている。従って、画像合成部33は、デプスカメラ15により計測される表示装置12からユーザの手までの距離に基づいて、手の3Dモデルを配置するデプスZ0を推測することができる。 In the example shown in FIG. 8, the telepresence system 11 is configured to include a depth camera 15 on the ceiling above the room where the display device 12 is arranged. Therefore, the image composition unit 33 can estimate the depth Z 0 where the 3D model of the hand is arranged based on the distance from the display device 12 measured by the depth camera 15 to the user's hand.
 図9には、状態Cのときにベース画像生成部31により生成されるベース画像のイメージが示されている。 FIG. 9 shows an image of the base image generated by the base image generation unit 31 in the state C.
 上述したように、状態Cの場合、ユーザの手の一部が撮像装置13a乃至13dにより撮像可能な画角の外となり、ユーザの手の一部が映らない画像が取得されるため、ユーザの手の一部が欠けたベース画像が生成される。 As described above, in the state C, a part of the user's hand is outside the angle of view that can be captured by the imaging devices 13a to 13d, and an image in which the part of the user's hand is not reflected is acquired. A base image lacking a part of the hand is generated.
 そこで、画像合成部33は、ベース画像において欠けている手の領域の中心位置(uh,vh)を、ベース画像における画素(ui,vi)の輝度I(ui,vi)として、次の式(1)で表されるように推定する。 Therefore, the image compositing unit 33 uses the center position (u h , v h ) of the hand region lacking in the base image as the luminance I (u i , v i ) of the pixel (u i , v i ) in the base image. As shown in the following equation (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 そして、画像合成部33は、ベース画像生成部31により生成されたベース画像における手の領域の中心位置(uh,vh)、および、上述の図7を参照して説明したように求めた手の3DモデルのデプスZ0に基づいて、手の3Dモデルをベース画像に投影する。 Then, the image composition unit 33 obtains the center position (u h , v h ) of the hand region in the base image generated by the base image generation unit 31 as described with reference to FIG. Based on the depth Z 0 of the 3D model of the hand, the 3D model of the hand is projected onto the base image.
 図10に示すように、手の3Dモデルのローカル座標系Tは、例えば、手の重心を原点(XT0,YT0,ZT0)として、手のひらの逆方向をz軸とし、中指の指方向をy軸とした右手座標系で定義することができる。そして、手の3Dモデルの重心が、図5に示した視点Pから見たベース画像における手の領域の中心位置(uh,vh)に投影されるように、手の3Dモデルが丸ごとベース画像に投影される。 As shown in FIG. 10, the local coordinate system T of the 3D model of the hand has, for example, the center of gravity of the hand as the origin (X T0 , Y T0 , Z T0 ), the reverse direction of the palm as the z axis, and the finger direction of the middle finger Can be defined in a right-handed coordinate system with y as the y-axis. Then, the 3D model of the hand is fully base so that the center of gravity of the 3D model of the hand is projected on the center position (u h , v h ) of the hand region in the base image viewed from the viewpoint P shown in FIG. Projected on the image.
 画像合成部33は、視点Pの座標系における手の3Dモデルの重心の座標(XT0,YT0,ZT0)を、次の式(2)に基づいて計算する。 The image composition unit 33 calculates the coordinates (X T0 , Y T0 , Z T0 ) of the center of gravity of the 3D model of the hand in the coordinate system of the viewpoint P based on the following equation (2).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 なお、式(2)におけるfは、視点Pのベース画像を生成したときの焦点距離である。 Note that f in Expression (2) is a focal length when the base image of the viewpoint P is generated.
 ここで、ユーザが、自分の手を、表示装置12に表示されている相手側のユーザの手に合わせようとするとき、手の3Dモデルのローカル座標系Tのx軸、y軸、およびz軸と、仮想的な視点Pの座標系のx軸、y軸、およびz軸とが、互いに平行であると仮定する。これにより、手の3Dモデルのローカル座標系Tにおける各視点を、仮想的な視点Pの座標系に変換する際、回転を行う必要がなく、平行移動を行うだけでよくなる。 Here, when the user tries to align his / her hand with the other user's hand displayed on the display device 12, the x-axis, y-axis, and z of the local coordinate system T of the 3D model of the hand It is assumed that the axis and the x-axis, y-axis, and z-axis of the virtual viewpoint P coordinate system are parallel to each other. Thereby, when each viewpoint in the local coordinate system T of the 3D model of the hand is converted into the coordinate system of the virtual viewpoint P, it is not necessary to perform the rotation, and only the translation is required.
 そして、画像合成部33は、手の3Dモデルにおける各点iについて、手の3Dモデルのローカル座標系Tにおける座標(XTi,YTi,ZTi)を、次の式(3)に従って、仮想的な視点Pの座標系における座標(XPi,YPi,ZPi)に変換する。 Then, for each point i in the 3D model of the hand, the image composition unit 33 calculates the coordinates (X Ti , Y Ti , Z Ti ) in the local coordinate system T of the 3D model of the hand according to the following equation (3). Is converted into coordinates (X Pi , Y Pi , Z Pi ) in the coordinate system of a typical viewpoint P.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 さらに、画像合成部33は、手の3Dモデルにおける各点iを、仮想的な視点Pの座標系からベース画像に投影した画素(ui,vi)を、次の式(4)を演算することにより求める。 Further, the image composition unit 33 calculates the following equation (4) for the pixel (u i , v i ) obtained by projecting each point i in the 3D model of the hand from the coordinate system of the virtual viewpoint P onto the base image. To find out.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 ここで、式(4)に応じて、ベース画像のデプスにおいて、画素(ui,vi)のデプスはZPiになる。なお、手の3Dモデル上の複数の点がベース画像の同一の画素に投影される可能性があるが、その場合、複数の点のうちデプスの小さい点が、ベース画像に投影するものとして選択される。 Here, according to the equation (4), the depth of the pixel (u i , v i ) is Z Pi in the depth of the base image. Note that multiple points on the 3D model of the hand may be projected onto the same pixel of the base image. In this case, a point with a small depth among the multiple points is selected as the one to be projected onto the base image. Is done.
 図11は、画像合成部33がベース画像にユーザの手を合成する画像合成処理について説明するフローチャートである。 FIG. 11 is a flowchart illustrating an image composition process in which the image composition unit 33 composes the user's hand with the base image.
 ステップS31において、画像合成部33は、例えば、上述の図7を参照して説明したように、手の3Dモデルを配置するデプスZ0を推測する。 In step S31, for example, as described with reference to FIG. 7 described above, the image composition unit 33 estimates the depth Z 0 where the 3D model of the hand is placed.
 ステップS32において、画像合成部33は、ベース画像生成部31により生成されるベース画像において欠けている手の領域の中心位置、即ち、上述の図9に示す中心位置(uh,vh)を推定する。 In step S32, the image composition unit 33 determines the center position of the hand region lacking in the base image generated by the base image generation unit 31, that is, the center position (u h , v h ) shown in FIG. presume.
 ステップS33において、画像合成部33は、ステップS31で推測した手の3Dモデルを配置するデプスZ0、および、ステップS32で推定したベース画像における手の領域の中心位置(uh,vh)に基づいて、手の3Dモデルをベース画像に投影する。これにより、画像合成部33は、ユーザの手の一部が欠けているベース画像に手を合成して、ユーザの手が見えるような画像を生成することができる。 In step S33, the image composition unit 33 sets the depth Z 0 for placing the 3D model of the hand estimated in step S31 and the center position (u h , v h ) of the hand region in the base image estimated in step S32. Based on this, a 3D model of the hand is projected onto the base image. Thereby, the image composition unit 33 can generate an image in which the user's hand can be seen by composing the hand with the base image in which a part of the user's hand is missing.
 以上のように、画像合成部33は、手の3Dモデルを配置するデプス、および、ベース画像において欠けている手の領域の中心位置に基づいて、ユーザの手が適切に表示されるように画像合成を行うことができる。 As described above, the image composition unit 33 displays the image so that the user's hand is appropriately displayed based on the depth at which the 3D model of the hand is arranged and the center position of the hand region that is missing in the base image. Synthesis can be performed.
 <画像処理装置の処理の説明> <Description of processing of image processing device>
 図12に示すフローチャートを参照して、画像処理装置14において実行される処理について説明する。 Processing executed in the image processing apparatus 14 will be described with reference to the flowchart shown in FIG.
 例えば、テレプレゼンスシステム11が起動すると処理が開始され、ステップS41において、画像取得部21は、撮像装置13a乃至13dそれぞれの視点からユーザを撮像した4枚の画像を取得して、デプス生成部22に供給する。 For example, when the telepresence system 11 is activated, the processing is started. In step S41, the image acquisition unit 21 acquires four images obtained by capturing the user from the viewpoints of the imaging devices 13a to 13d, and the depth generation unit 22 is acquired. To supply.
 ステップS42において、デプス生成部22は、ステップS41で画像取得部21から供給される4枚の画像に対するデプスを生成するデプス生成処理(上述の図4のフローチャート)を行う。そして、デプス生成部22は、それらの4枚の画像と、それぞれの画像に対応するデプスとを、ベース画像生成部31および画像合成部33に供給する。 In step S42, the depth generation unit 22 performs depth generation processing (the flowchart of FIG. 4 described above) that generates depths for the four images supplied from the image acquisition unit 21 in step S41. Then, the depth generation unit 22 supplies the four images and the depth corresponding to each image to the base image generation unit 31 and the image synthesis unit 33.
 ステップS43において、ベース画像生成部31は、ステップS42で供給されたデプスに基づいて、表示装置12からユーザの手までの距離を求め、相手側のユーザから見て違和感がない程度にユーザの手が十分に遠い状態であるか否かを判定する。 In step S43, the base image generation unit 31 obtains the distance from the display device 12 to the user's hand based on the depth supplied in step S42, and the user's hand is not so uncomfortable as viewed from the other user. Is determined to be sufficiently far away.
 ステップS43において、ベース画像生成部31が、ユーザの手が十分に遠い状態(図4の状態A)であると判定した場合、処理はステップS44に進む。ステップS44において、ベース画像生成部31は、撮像装置13a乃至13dにより撮像された4枚の画像のうちの、いずれか1枚の画像を、相手側のテレプレゼンスシステム11の表示装置12に表示するものとして選択する。 In step S43, when the base image generation unit 31 determines that the user's hand is sufficiently far away (state A in FIG. 4), the process proceeds to step S44. In step S44, the base image generation unit 31 displays any one of the four images captured by the imaging devices 13a to 13d on the display device 12 of the telepresence system 11 on the other side. Choose as a thing.
 一方、ステップS43において、ベース画像生成部31が、ユーザの手が十分に遠い状態でない(図4の状態Bまたは状態Cである)と判定した場合、処理はステップS45に進む。ステップS45において、ベース画像生成部31は、撮像装置13a乃至13dにより撮像された4枚の画像と、ステップS42でデプス生成部22により生成されたデプスとを用いてベース画像を生成するベース画像生成処理(上述の図6のフローチャート)を行う。 On the other hand, if the base image generation unit 31 determines in step S43 that the user's hand is not sufficiently far away (state B or state C in FIG. 4), the process proceeds to step S45. In step S45, the base image generation unit 31 generates a base image using the four images captured by the imaging devices 13a to 13d and the depth generated by the depth generation unit 22 in step S42. Processing (the flowchart of FIG. 6 described above) is performed.
 ステップS45のベース画像生成処理の処理後、処理はステップS46に進む。ステップS46において、画像合成部33は、ステップS42でデプス生成部22により生成されたデプスに基づいて求められる表示装置12からユーザの手までの距離に基づいて、ユーザの手の一部が撮像できない程に近い状態であるか否かを判定する。 After the base image generation process in step S45, the process proceeds to step S46. In step S46, the image composition unit 33 cannot capture a part of the user's hand based on the distance from the display device 12 to the user's hand obtained based on the depth generated by the depth generation unit 22 in step S42. It is determined whether or not the state is almost the same.
 ステップS46において、画像合成部33が、ユーザの手の一部が撮像できない程に近い状態(図4の状態C)であると判定した場合、処理はステップS47に進む。ステップS47において、画像合成部33は、ステップS45で生成されたベース画像に、データ記録部32に記録されている手の3Dモデルに基づいてユーザの手を合成する画像合成処理(上述の図11のフローチャート)を行う。 If it is determined in step S46 that the image compositing unit 33 is in a state close enough that a part of the user's hand cannot be captured (state C in FIG. 4), the process proceeds to step S47. In step S47, the image composition unit 33 composes the user's hand with the base image generated in step S45 based on the 3D model of the hand recorded in the data recording unit 32 (see FIG. 11 described above). Flowchart).
 ステップS47の画像合成処理の処理後、処理はステップS48に進み、画像合成部33は、ベース画像にユーザの手が合成された画像を画像送信部24に供給し、画像送信部24は、その画像を送信する。一方、ステップS44の処理後、処理はステップS48に進み、画像合成部33は、ベース画像生成部31により選択された画像を画像送信部24に供給し、画像送信部24は、その画像を送信する。また、ステップS46でユーザの手の一部が撮像できない程に近い状態でない(図4の状態Bである)と判定された場合、処理はステップS48に進み、画像合成部33は、ステップS45でベース画像生成部31が生成したベース画像をそのまま画像送信部24に供給し、画像送信部24は、そのベース画像を送信する。 After the image synthesizing process in step S47, the process proceeds to step S48, and the image synthesizing unit 33 supplies the image transmitting unit 24 with an image in which the user's hand is synthesized with the base image, and the image transmitting unit 24 Send an image. On the other hand, after the process of step S44, the process proceeds to step S48, where the image composition unit 33 supplies the image selected by the base image generation unit 31 to the image transmission unit 24, and the image transmission unit 24 transmits the image. To do. If it is determined in step S46 that the part of the user's hand is not close enough to be captured (state B in FIG. 4), the process proceeds to step S48, and the image compositing unit 33 proceeds to step S45. The base image generated by the base image generation unit 31 is supplied to the image transmission unit 24 as it is, and the image transmission unit 24 transmits the base image.
 そして、ステップS48の処理後、処理はステップS41に戻り、撮像装置13a乃至13dにより撮像される次の画像を対象として、以下、同様の処理が繰り返して行われる。 Then, after the process of step S48, the process returns to step S41, and the same process is repeatedly performed for the next image captured by the imaging devices 13a to 13d.
 以上のように、画像処理装置14は、表示装置12の周囲に配置された撮像装置13a乃至13dにより撮像された画像に基づいて、ユーザどうしが手を合わせるというコミュニケーションを違和感なく行うことができる画像処理を実現することができる。これにより、テレプレゼンスシステム11は、より親切なコミュニケーションを提供することができる。 As described above, the image processing apparatus 14 can perform communication that allows users to join hands with each other based on the images captured by the imaging apparatuses 13 a to 13 d arranged around the display apparatus 12 without a sense of incongruity. Processing can be realized. Thereby, the telepresence system 11 can provide more friendly communication.
 <画像処理装置の第2の構成例> <Second configuration example of image processing apparatus>
 図13は、画像処理装置14の第2の実施の形態の構成例を示すブロック図である。なお、図13に示す画像処理装置14Aにおいて、図2の画像処理装置14と共通する構成について同一の符号を付し、その詳細な説明は省略する。 FIG. 13 is a block diagram illustrating a configuration example of the second embodiment of the image processing apparatus 14. In the image processing apparatus 14A shown in FIG. 13, the same reference numerals are given to the same components as those in the image processing apparatus 14 in FIG. 2, and detailed descriptions thereof are omitted.
 画像処理装置14Aは、図2の画像処理装置14と同様に、画像取得部21、デプス生成部22、および画像送信部24を備えて構成されており、画像生成部23Aは、ベース画像生成部31、データ記録部32、および画像合成部33を有して構成される。 Similar to the image processing apparatus 14 of FIG. 2, the image processing apparatus 14A includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24. The image generation unit 23A includes a base image generation unit. 31, a data recording unit 32, and an image composition unit 33.
 そして、画像処理装置14Aでは、画像生成部23Aにおいて、ベース画像生成部31が、グローバル座標系のポイントクラウドをデータ記録部32に供給して記録させるように構成されている点で、図2の画像処理装置14と異なる構成となっている。 In the image processing apparatus 14A, in the image generation unit 23A, the base image generation unit 31 is configured to supply the point cloud of the global coordinate system to the data recording unit 32 and record the point cloud in FIG. The configuration is different from that of the image processing apparatus 14.
 例えば、画像生成部23Aにおいて、ベース画像生成部31は、ユーザの手が画面の周辺に映される状態Bであると判断したときに、デプス生成部22により生成された4枚の画像に対するデプスから合成したポイントクラウドをデータ記録部32に供給する。そして、画像合成部33は、ユーザの手の一部が撮像できない程に近い状態Cであると判断したときに、データ記録部32に記録されている状態Bのポイントクラウドと、状態Cのポイントクラウドとの位置合わせを行って、ユーザの手をベース画像に合成する。 For example, in the image generation unit 23A, the base image generation unit 31 determines the depth for the four images generated by the depth generation unit 22 when it is determined that the user's hand is in the state B reflected on the periphery of the screen. The point cloud synthesized from is supplied to the data recording unit 32. Then, when the image composition unit 33 determines that the state C is so close that a part of the user's hand cannot be captured, the point cloud of the state B recorded in the data recording unit 32 and the point of the state C Align with the cloud and synthesize the user's hand with the base image.
 図14には、2つのポイントクラウドを位置合わせするイメージを示す図である。 FIG. 14 is a diagram showing an image of aligning two point clouds.
 例えば、図14の左上側に示されている状態Bのポイントクラウドでは、ユーザの手の部分があるのに対して、図14の左下側に示されている状態Cのポイントクラウドでは、ユーザの手の部分が欠けた状態となっている。 For example, in the point cloud in the state B shown in the upper left side of FIG. 14, there is a part of the user's hand, whereas in the point cloud in the state C shown in the lower left side of FIG. The hand part is missing.
 そこで、画像合成部33は、これらの2つのポイントクラウドを、例えば、ICP(Iterative Closest Point)などの手法を用いて位置合わせを行って、状態Cのポイントクラウドで欠けている部分を、状態Bのポイントクラウドから取得する。なお、この場合、状態Bから状態Cに変化したときに、ユーザの身体に対する手の移動がほぼなく、表示装置12までの距離だけが変化したもと仮定して、ポイントクラウドの位置合わせが行われる。 Therefore, the image composition unit 33 aligns these two point clouds using, for example, a technique such as ICP (Iterative Closest Point), and replaces the missing portion in the state C point cloud with the state B. Get from the point cloud. In this case, when the state B changes to the state C, it is assumed that there is almost no movement of the hand with respect to the user's body, and only the distance to the display device 12 is changed, and the point cloud is aligned. Is called.
 その後、画像合成部33は、上述したように、仮想的な視点Pからユーザを見たベース画像に、状態Bのポイントクラウドから手の部分が補完された状態Cのポイントクラウドを投影させ、ユーザの手を合成する。例えば、画像合成部33は、状態Bのポイントクラウドと状態Cのポイントクラウドとを、図5に示したような仮想的な視点Pの座標系に置き、上述した式(4)を演算することにより、ベース画像に手の部分のポイントクラウドを投影することができる。 After that, as described above, the image composition unit 33 projects the point cloud in the state C in which the hand portion is complemented from the point cloud in the state B onto the base image viewed from the virtual viewpoint P, and the user Synthesize your hands. For example, the image composition unit 33 places the point cloud in the state B and the point cloud in the state C in the coordinate system of the virtual viewpoint P as shown in FIG. 5 and calculates the above-described equation (4). Thus, the point cloud of the hand portion can be projected onto the base image.
 このように、画像生成部23Aでは、予め作成された手の3Dモデルを用いるのではなく、数フレーム前の状態Bのポイントクラウドを用いて、ユーザの手をベース画像に合成することにより、ユーザどうしが手を合わせるというコミュニケーションを実現させることができる。 As described above, the image generation unit 23A does not use the 3D model of the hand created in advance, but combines the user's hand with the base image by using the point cloud in the state B several frames before, so that the user You can realize communication that hands are together.
 <画像処理装置の第3の構成例> <Third configuration example of image processing apparatus>
 図15は、画像処理装置14の第3の実施の形態の構成例を示すブロック図である。なお、図15に示す画像処理装置14Bにおいて、図2の画像処理装置14と共通する構成について同一の符号を付し、その詳細な説明は省略する。 FIG. 15 is a block diagram illustrating a configuration example of the third embodiment of the image processing apparatus 14. In the image processing device 14B shown in FIG. 15, the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof is omitted.
 即ち、画像処理装置14Bは、図2の画像処理装置14と同様に、画像取得部21、デプス生成部22、および画像送信部24を備えて構成される。そして、画像処理装置14Bの画像生成部23Bは、ベース画像生成部31、データ記録部32、および画像合成部33に加えて、ユーザ認識部34を有して構成される。 That is, the image processing device 14B includes an image acquisition unit 21, a depth generation unit 22, and an image transmission unit 24, as with the image processing device 14 of FIG. The image generation unit 23B of the image processing apparatus 14B includes a user recognition unit 34 in addition to the base image generation unit 31, the data recording unit 32, and the image synthesis unit 33.
 データ記録部32には、複数のユーザごとに手の3Dモデルを記録するとともに、それぞれのユーザの特徴が記録されている。また、各ユーザに対して、ユーザを特定するためのユーザID(Identification)が設定されており、データ記録部32は、ユーザ認識部34から指定されるユーザIDに対応する手の3Dモデルを画像合成部33に供給する。 In the data recording unit 32, a 3D model of a hand is recorded for each of a plurality of users, and characteristics of each user are recorded. In addition, a user ID (Identification) for identifying the user is set for each user, and the data recording unit 32 images the 3D model of the hand corresponding to the user ID specified by the user recognition unit 34. This is supplied to the combining unit 33.
 ユーザ認識部34は、ベース画像生成部31から得られた画像に基づいてユーザの特徴を検出し、データ記録部32に記録されているユーザの特徴を参照して、検出したユーザの特徴に対応するユーザIDをデータ記録部32に指定する。例えば、ユーザ認識部34は、ユーザの特徴として顔検出法を用いて顔を検出し、画像に映されている顔の特徴と、データ記録部32に記録されている各ユーザの顔の特徴とを用いて、顔認識手法によりユーザを認識する。そして、ユーザ認識部34は、顔の特徴から同一のユーザであると認識されたユーザのユーザIDをデータ記録部32に対して指定することができる。なお、ユーザ認識部34による顔検出法および顔認識手法には、例えば、ディープラーニング(Deep learning)などの学習手法を用いることができる。 The user recognizing unit 34 detects a user feature based on the image obtained from the base image generating unit 31 and refers to the user feature recorded in the data recording unit 32 to correspond to the detected user feature. To the data recording unit 32. For example, the user recognizing unit 34 detects a face using a face detection method as a user feature, and the facial feature reflected in the image and the facial feature of each user recorded in the data recording unit 32 Is used to recognize the user by the face recognition method. Then, the user recognition unit 34 can specify the user ID of the user recognized as the same user from the facial features to the data recording unit 32. For the face detection method and the face recognition method by the user recognition unit 34, for example, a learning method such as deep learning can be used.
 このように、画像処理装置14Bは、複数のユーザの手の3Dモデルをデータ記録部32に予め記録させておき、ユーザ認識部34によりユーザを認識することで、複数のユーザごとの手をベース画像に合成することができる。 As described above, the image processing apparatus 14 </ b> B records the 3D models of a plurality of user's hands in the data recording unit 32 in advance, and recognizes the user by the user recognition unit 34. Can be combined with an image.
 <画像処理装置の第4の構成例> <Fourth configuration example of image processing apparatus>
 図16は、画像処理装置14の第4の実施の形態の構成例を示すブロック図である。なお、図16に示す画像処理装置14Cにおいて、図2の画像処理装置14と共通する構成について同一の符号を付し、その詳細な説明は省略する。 FIG. 16 is a block diagram showing a configuration example of the fourth embodiment of the image processing apparatus 14. In the image processing device 14C shown in FIG. 16, the same reference numerals are given to the same components as those in the image processing device 14 in FIG. 2, and detailed description thereof will be omitted.
 即ち、画像処理装置14Cは、図2の画像処理装置14と同様に、画像取得部21、デプス生成部22、画像生成部23、および画像送信部24を備えて構成される。さらに、画像処理装置14Cは、手合せ認識部25を備えて構成されている。 That is, the image processing device 14 </ b> C includes the image acquisition unit 21, the depth generation unit 22, the image generation unit 23, and the image transmission unit 24 similarly to the image processing device 14 of FIG. 2. Furthermore, the image processing apparatus 14 </ b> C is configured to include a hand alignment recognition unit 25.
 手合せ認識部25は、デプス生成部22により生成された4枚の画像に対するデプスのうちの任意のデプスと、そのデプスに対応する画像を用いて、ユーザが手を合わせるというコミュニケーションを行おうとしている意図を認識する。そして、手合せ認識部25は、ユーザが手を合わせるというコミュニケーションを行おうとしている意図を認識した認識結果を、図示しないネットワークを介して、相手側のテレプレゼンスシステム11に送信する。 The hand recognizing unit 25 tries to perform communication in which the user puts hands together using an arbitrary depth of the depths of the four images generated by the depth generating unit 22 and an image corresponding to the depth. Recognize your intention. Then, the hand recognizing unit 25 transmits a recognition result of recognizing the intention of the user to communicate with each other to the telepresence system 11 on the other side via a network (not shown).
 例えば、手合せ認識部25は、画像から手が映されている領域を認識し、その認識した手の領域のデプスを抽出する。そして、手合せ認識部25は、抽出した手の領域のデプスを参照して、ユーザの手が所定距離以下であると判定した場合に、ユーザが手を合わせるというコミュニケーションを行おうとしている意図があると認識することができる。 For example, the hand recognizing unit 25 recognizes the region where the hand is shown from the image, and extracts the depth of the recognized hand region. Then, the hand recognizing unit 25 refers to the depth of the extracted hand region, and when it is determined that the user's hand is equal to or less than the predetermined distance, there is an intention that the user intends to perform communication that hands are put together. It can be recognized that there is.
 または、例えば、手合せ認識部25は、抽出した手の領域のデプスについて数フレーム前のデータを記録しておき、数フレーム前における手の領域のデプスと、現在のフレームにおける手の領域のデプスとに基づいて、ユーザの手が近づいているか否かを判定する。そして、手合せ認識部25は、ユーザの手が近づいていると判定した場合に、ユーザが手を合わせるというコミュニケーションを行おうとしている意図があると認識することができる。 Alternatively, for example, the hand recognizing unit 25 records the data of several frames before the extracted depth of the hand area, and the depth of the hand area in the current frame and the depth of the hand area in the current frame. Based on the above, it is determined whether or not the user's hand is approaching. Then, when it is determined that the user's hand is approaching, the hand recognizing unit 25 can recognize that the user intends to perform communication that the hand is put together.
 このように、画像処理装置14Cは、ユーザが手を合わせるというコミュニケーションを行おうとしている意図があるか否かを認識することができ、その意図があると認識したときに、ユーザどうしの手が合うような画像処理を、より確実に行うことができる。 As described above, the image processing apparatus 14C can recognize whether or not there is an intention of performing communication that the users join hands, and when the user recognizes the intention, the hands of the users can be recognized. Suitable image processing can be performed more reliably.
 さらに、画像処理装置14では、例えば、テレプレゼンスシステム11を介してユーザどうしの手が合わさった瞬間に、特定の音声を出力したり、特定のビジュアライズを演出したりすることで、そのコミュニケーションにフィードバックを行うことができる。 Furthermore, in the image processing device 14, for example, by outputting a specific sound or producing a specific visualization at the moment when the hands of the users come together via the telepresence system 11, the communication can be performed. Provide feedback.
 なお、本実施の形態においては、テレプレゼンスシステム11を介して互いの手を合わせるようなコミュニケーションを例に説明したが、テレプレゼンスシステム11は、ベース画像にユーザの手を合成する以外の画像処理を行うことができる。例えば、テレプレゼンスシステム11は、ユーザの身体や顔などが、撮像装置13a乃至13dにより撮像可能な画角の外になったときに、それらに対応する身体や顔などの3Dモデルを、ベース画像に合成する画像処理を行うことができる。 In the present embodiment, the communication in which each other's hands are brought together via the telepresence system 11 has been described as an example. However, the telepresence system 11 performs image processing other than combining the user's hand with the base image. It can be performed. For example, when the user's body or face is out of the angle of view that can be captured by the imaging devices 13a to 13d, the telepresence system 11 uses the 3D model such as the body and face corresponding to the base image The image processing to be combined can be performed.
 なお、上述のフローチャートを参照して説明した各処理は、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理(例えば、並列処理あるいはオブジェクトによる処理)も含むものである。また、プログラムは、1のCPUにより処理されるものであっても良いし、複数のCPUによって分散処理されるものであっても良い。 Note that the processes described with reference to the flowcharts described above do not necessarily have to be processed in chronological order in the order described in the flowcharts, but are performed in parallel or individually (for example, parallel processes or objects). Processing). The program may be processed by one CPU, or may be distributedly processed by a plurality of CPUs.
 また、上述した一連の処理(画像処理方法)は、ハードウエアにより実行することもできるし、ソフトウエアにより実行することもできる。一連の処理をソフトウエアにより実行する場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、プログラムが記録されたプログラム記録媒体からインストールされる。 Further, the above-described series of processing (image processing method) can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs. For example, the program is installed in a general-purpose personal computer from a program recording medium on which the program is recorded.
 図16は、上述した一連の処理をプログラムにより実行するコンピュータのハードウエアの構成例を示すブロック図である。 FIG. 16 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
 コンピュータにおいて、CPU(Central Processing Unit)101,ROM(Read Only Memory)102,RAM(Random Access Memory)103、およびEEPROM(Electronically Erasable and Programmable Read Only Memory)104は、バス105により相互に接続されている。バス105には、さらに、入出力インタフェース106が接続されており、入出力インタフェース106が外部(例えば、図1の撮像装置13a乃至13dや、図示しない通信装置)に接続される。 In a computer, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and an EEPROM (Electronically Erasable Memory and Programmable Read Only Memory) 104 are interconnected by a bus 105. . Further, an input / output interface 106 is connected to the bus 105, and the input / output interface 106 is connected to the outside (for example, the imaging devices 13a to 13d in FIG. 1 or a communication device (not shown)).
 以上のように構成されるコンピュータでは、CPU101が、例えば、ROM102およびEEPROM104に記憶されているプログラムを、バス105を介してRAM103にロードして実行することにより、上述した一連の処理が行われる。また、コンピュータ(CPU101)が実行するプログラムは、ROM102に予め書き込んでおく他、入出力インタフェース105を介して外部からEEPROM104にインストールしたり、更新したりすることができる。 In the computer configured as described above, for example, the CPU 101 loads the program stored in the ROM 102 and the EEPROM 104 to the RAM 103 via the bus 105 and executes the program, thereby performing the above-described series of processing. A program executed by the computer (CPU 101) can be written in the ROM 102 in advance, and can be installed or updated from the outside in the EEPROM 104 via the input / output interface 105.
 また、本明細書において、システムとは、複数の装置により構成される装置全体を表すものである。 In addition, in this specification, the system represents the entire apparatus composed of a plurality of apparatuses.
 なお、本技術は以下のような構成も取ることができる。
(1)
 複数の視点からユーザが撮像された複数枚の画像を取得する画像取得部と、
 複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、
 前記画像生成部により生成された前記表示画像を送信する画像送信部と
 を備える画像処理装置。
(2)
 複数枚の前記画像に基づいて、それぞれの画像における奥行きを表すデプスを生成するデプス生成部
 をさらに備え、
 前記画像生成部は、複数枚の前記画像と、それぞれ対応する前記デプスとに基づいて、前記表示画像を生成する
 上記(1)に記載の画像処理装置。
(3)
 前記画像生成部は、複数枚の前記画像と、それぞれ対応する前記デプスとに基づいて、複数の前記視点とは異なる仮想的な視点から前記ユーザを見た画像であるベース画像を生成するベース画像生成部を有する
 上記(2)に記載の画像処理装置。
(4)
 前記仮想的な視点は、前記画像送信部が前記表示画像を送信する相手側のユーザの視点に設定される
 上記(3)に記載の画像処理装置。
(5)
 前記画像送信部が前記表示画像を送信する相手側から送信されてくる表示画像を表示する表示装置の周囲に前記ユーザを撮像する複数台の撮像装置が配置されており、
 前記ベース画像生成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像それぞれの略中央となるような第1の距離未満である場合、前記ベース画像の生成を行う
 上記(3)または(4)に記載の画像処理装置。
(6)
 前記画像送信部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離以上である場合、複数の視点からユーザが撮像された複数枚の前記画像のうちの、いずれか1枚を前記表示画像として送信する
 上記(5)に記載の画像処理装置。
(7)
 前記画像生成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像に映らなくなる第2の距離未満である場合に、複数枚の前記画像を撮像する際の画角の外となった前記ユーザの一部分を前記ベース画像に合成する画像合成部をさらに有し、
 前記画像送信部は、前記画像合成部により前記ユーザの一部分が前記ベース画像に合成された画像を前記表示画像として送信する
 上記(5)または(6)に記載の画像処理装置。
(8)
 前記画像送信部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離未満、かつ、前記第2の距離以上である場合、前記ベース画像生成部により生成された前記ベース画像を前記表示画像として送信する
 上記(7)に記載の画像処理装置。
(9)
 前記画像生成部は、前記ユーザの一部分が立体的に形成された3Dモデルを記録するデータ記録部をさらに有し、
 前記画像合成部は、前記データ記録部に記録されている前記3Dモデルを用いて、前記ユーザの一部分を前記ベース画像に合成する
 上記(7)または(8)に記載の画像処理装置。
(10)
 前記データ記録部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離未満、かつ、前記第2の距離以上であるときに、前記ベース画像生成部が前記ベース画像を生成する際に用いた前記ユーザのポイントクラウドを記録し、
 前記画像合成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像に映らなくなる第2の距離未満となったときに、前記データ記録部に記録されている前記ポイントクラウドを用いて、前記ユーザの一部分を前記ベース画像に合成する
 上記(7)から(9)までのいずれかに記載の画像処理装置。
(11)
 前記画像生成部は、前記画像に映されている前記ユーザを認識するユーザ認識部をさらに有し、
 前記データ記録部は、複数の前記ユーザごとに前記3Dモデルを記録しており、
 前記画像合成部は、前記ユーザ認識部により認識された前記ユーザに対応する前記3Dモデルを用いて、前記ユーザの一部分を前記ベース画像に合成する
 上記(9)に記載の画像処理装置。
(12)
 前記画像送信部が前記表示画像を送信する相手側から送信されてくる表示画像を表示する表示装置に表示される相手側のユーザの手と、自分側の前記ユーザの手とを合わせるコミュニケーションが行われることを認識するコミュニケーション認識部
 上記(1)から(11)までのいずれかに記載の画像処理装置。
(13)
 複数の視点からユーザが撮像された複数枚の画像を取得し、
 複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成し、
 生成された前記表示画像を送信する
 ステップを含む画像処理方法。
(14)
 複数の視点からユーザが撮像された複数枚の画像を取得し、
 複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成し、
 生成された前記表示画像を送信する
 ステップを含む処理をコンピュータに実行させるプログラム。
(15)
 相手側から送信されてくる表示画像を表示する表示装置と、
 前記表示装置の周囲に配置され、複数の視点からユーザを撮像する複数台の撮像装置と、
 複数台の前記撮像装置それぞれにより撮像された画像を取得する画像取得部と、
 複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、
 前記画像生成部により生成された前記表示画像を送信する画像送信部と
 を備えるテレプレゼンスシステム。
In addition, this technique can also take the following structures.
(1)
An image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
An image processing apparatus comprising: an image transmission unit that transmits the display image generated by the image generation unit.
(2)
A depth generation unit configured to generate a depth representing a depth in each image based on the plurality of images;
The image processing device according to (1), wherein the image generation unit generates the display image based on a plurality of the images and the corresponding depths.
(3)
The image generation unit generates a base image that is an image obtained by viewing the user from a virtual viewpoint different from the plurality of viewpoints based on the plurality of images and the corresponding depths. The image processing apparatus according to (2), including a generation unit.
(4)
The image processing apparatus according to (3), wherein the virtual viewpoint is set to a viewpoint of a counterpart user to whom the image transmission unit transmits the display image.
(5)
A plurality of imaging devices that image the user are arranged around a display device that displays a display image transmitted from the other party to which the image transmission unit transmits the display image,
When the distance from the display device to a part of the user is less than a first distance such that the part of the user is approximately the center of each of the plurality of images, the base image generation unit The image processing apparatus according to (3) or (4).
(6)
When the distance from the display device to a part of the user is equal to or greater than the first distance, the image transmission unit is any one of the plurality of images captured by the user from a plurality of viewpoints. The image processing apparatus according to (5), wherein a sheet is transmitted as the display image.
(7)
The image generation unit captures a plurality of images when a distance from the display device to a part of the user is less than a second distance at which the part of the user does not appear in the plurality of images. An image composition unit for composing a part of the user outside the angle of view to the base image;
The image processing device according to (5) or (6), wherein the image transmission unit transmits an image in which a part of the user is combined with the base image by the image combining unit as the display image.
(8)
When the distance from the display device to the part of the user is less than the first distance and greater than or equal to the second distance, the image transmission unit generates the base image generated by the base image generation unit. As the display image. The image processing device according to (7).
(9)
The image generation unit further includes a data recording unit that records a 3D model in which a part of the user is three-dimensionally formed,
The image processing device according to (7) or (8), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model recorded in the data recording unit.
(10)
The data recording unit generates the base image when the distance from the display device to a part of the user is less than the first distance and greater than or equal to the second distance. Record the point cloud of the user used when
The image synthesizing unit is recorded in the data recording unit when a distance from the display device to a part of the user is less than a second distance at which the part of the user is not reflected in a plurality of images. The image processing device according to any one of (7) to (9), wherein a part of the user is synthesized with the base image using the point cloud.
(11)
The image generation unit further includes a user recognition unit that recognizes the user shown in the image,
The data recording unit records the 3D model for each of the plurality of users,
The image processing apparatus according to (9), wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model corresponding to the user recognized by the user recognition unit.
(12)
The image transmission unit performs communication to match the other user's hand displayed on the display device that displays the display image transmitted from the other party transmitting the display image with the user's hand on the own side. The image processing apparatus according to any one of (1) to (11) above.
(13)
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
An image processing method, comprising: transmitting the generated display image.
(14)
Acquire multiple images taken by users from multiple viewpoints,
Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
A program for causing a computer to execute processing including a step of transmitting the generated display image.
(15)
A display device that displays a display image transmitted from the other party;
A plurality of imaging devices arranged around the display device and imaging a user from a plurality of viewpoints;
An image acquisition unit for acquiring images captured by each of the plurality of imaging devices;
An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
A telepresence system comprising: an image transmission unit that transmits the display image generated by the image generation unit.
 なお、本実施の形態は、上述した実施の形態に限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。 Note that the present embodiment is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present disclosure.
 11 テレプレゼンスシステム, 12 表示装置, 13a乃至13d 撮像装置, 14 画像処理装置, 15 デプスカメラ, 21 画像取得部, 22 デプス生成部, 23 画像生成部, 24 画像送信部, 25 手合せ認識部, 31 ベース画像生成部, 32 データ記録部, 33 画像合成部, 34 ユーザ認識部 11 telepresence system, 12 display device, 13a to 13d imaging device, 14 image processing device, 15 depth camera, 21 image acquisition unit, 22 depth generation unit, 23 image generation unit, 24 image transmission unit, 25 manual recognition unit, 31 base image generation unit, 32 data recording unit, 33 image composition unit, 34 user recognition unit

Claims (15)

  1.  複数の視点からユーザが撮像された複数枚の画像を取得する画像取得部と、
     複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、
     前記画像生成部により生成された前記表示画像を送信する画像送信部と
     を備える画像処理装置。
    An image acquisition unit that acquires a plurality of images captured by a user from a plurality of viewpoints;
    An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
    An image processing apparatus comprising: an image transmission unit that transmits the display image generated by the image generation unit.
  2.  複数枚の前記画像に基づいて、それぞれの画像における奥行きを表すデプスを生成するデプス生成部
     をさらに備え、
     前記画像生成部は、複数枚の前記画像と、それぞれ対応する前記デプスとに基づいて、前記表示画像を生成する
     請求項1に記載の画像処理装置。
    A depth generation unit configured to generate a depth representing a depth in each image based on the plurality of images;
    The image processing apparatus according to claim 1, wherein the image generation unit generates the display image based on a plurality of images and the corresponding depths.
  3.  前記画像生成部は、複数枚の前記画像と、それぞれ対応する前記デプスとに基づいて、複数の前記視点とは異なる仮想的な視点から前記ユーザを見た画像であるベース画像を生成するベース画像生成部を有する
     請求項2に記載の画像処理装置。
    The image generation unit generates a base image that is an image obtained by viewing the user from a virtual viewpoint different from the plurality of viewpoints based on the plurality of images and the corresponding depths. The image processing apparatus according to claim 2, further comprising a generation unit.
  4.  前記仮想的な視点は、前記画像送信部が前記表示画像を送信する相手側のユーザの視点に設定される
     請求項3に記載の画像処理装置。
    The image processing apparatus according to claim 3, wherein the virtual viewpoint is set to a viewpoint of a user on the other side to which the image transmission unit transmits the display image.
  5.  前記画像送信部が前記表示画像を送信する相手側から送信されてくる表示画像を表示する表示装置の周囲に前記ユーザを撮像する複数台の撮像装置が配置されており、
     前記ベース画像生成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像それぞれの略中央となるような第1の距離未満である場合、前記ベース画像の生成を行う
     請求項3に記載の画像処理装置。
    A plurality of imaging devices that image the user are arranged around a display device that displays a display image transmitted from the other party to which the image transmission unit transmits the display image,
    When the distance from the display device to a part of the user is less than a first distance such that the part of the user is approximately the center of each of the plurality of images, the base image generation unit The image processing device according to claim 3.
  6.  前記画像送信部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離以上である場合、複数の視点からユーザが撮像された複数枚の前記画像のうちの、いずれか1枚を前記表示画像として送信する
     請求項5に記載の画像処理装置。
    When the distance from the display device to a part of the user is equal to or greater than the first distance, the image transmission unit is any one of the plurality of images captured by the user from a plurality of viewpoints. The image processing apparatus according to claim 5, wherein a sheet is transmitted as the display image.
  7.  前記画像生成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像に映らなくなる第2の距離未満である場合に、複数枚の前記画像を撮像する際の画角の外となった前記ユーザの一部分を前記ベース画像に合成する画像合成部をさらに有し、
     前記画像送信部は、前記画像合成部により前記ユーザの一部分が前記ベース画像に合成された画像を前記表示画像として送信する
     請求項5に記載の画像処理装置。
    The image generation unit captures a plurality of images when a distance from the display device to a part of the user is less than a second distance at which the part of the user does not appear in the plurality of images. An image composition unit for composing a part of the user outside the angle of view to the base image;
    The image processing apparatus according to claim 5, wherein the image transmission unit transmits an image in which a part of the user is combined with the base image by the image combining unit as the display image.
  8.  前記画像送信部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離未満、かつ、前記第2の距離以上である場合、前記ベース画像生成部により生成された前記ベース画像を前記表示画像として送信する
     請求項7に記載の画像処理装置。
    When the distance from the display device to the part of the user is less than the first distance and greater than or equal to the second distance, the image transmission unit generates the base image generated by the base image generation unit. The image processing apparatus according to claim 7, wherein the image is transmitted as the display image.
  9.  前記画像生成部は、前記ユーザの一部分が立体的に形成された3Dモデルを記録するデータ記録部をさらに有し、
     前記画像合成部は、前記データ記録部に記録されている前記3Dモデルを用いて、前記ユーザの一部分を前記ベース画像に合成する
     請求項7に記載の画像処理装置。
    The image generation unit further includes a data recording unit that records a 3D model in which a part of the user is three-dimensionally formed,
    The image processing apparatus according to claim 7, wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model recorded in the data recording unit.
  10.  前記データ記録部は、前記表示装置から前記ユーザの一部分までの距離が、前記第1の距離未満、かつ、前記第2の距離以上であるときに、前記ベース画像生成部が前記ベース画像を生成する際に用いた前記ユーザのポイントクラウドを記録し、
     前記画像合成部は、前記表示装置から前記ユーザの一部分までの距離が、前記ユーザの一部分が複数枚の前記画像に映らなくなる第2の距離未満となったときに、前記データ記録部に記録されている前記ポイントクラウドを用いて、前記ユーザの一部分を前記ベース画像に合成する
     請求項7に記載の画像処理装置。
    The data recording unit generates the base image when the distance from the display device to a part of the user is less than the first distance and greater than or equal to the second distance. Record the point cloud of the user used when
    The image synthesizing unit is recorded in the data recording unit when a distance from the display device to a part of the user is less than a second distance at which the part of the user is not reflected in a plurality of images. The image processing apparatus according to claim 7, wherein the point cloud is used to synthesize a part of the user with the base image.
  11.  前記画像生成部は、前記画像に映されている前記ユーザを認識するユーザ認識部をさらに有し、
     前記データ記録部は、複数の前記ユーザごとに前記3Dモデルを記録しており、
     前記画像合成部は、前記ユーザ認識部により認識された前記ユーザに対応する前記3Dモデルを用いて、前記ユーザの一部分を前記ベース画像に合成する
     請求項9に記載の画像処理装置。
    The image generation unit further includes a user recognition unit that recognizes the user shown in the image,
    The data recording unit records the 3D model for each of the plurality of users,
    The image processing apparatus according to claim 9, wherein the image synthesis unit synthesizes a part of the user with the base image using the 3D model corresponding to the user recognized by the user recognition unit.
  12.  前記画像送信部が前記表示画像を送信する相手側から送信されてくる表示画像を表示する表示装置に表示される相手側のユーザの手と、自分側の前記ユーザの手とを合わせるコミュニケーションが行われることを認識するコミュニケーション認識部
     をさらに備える請求項1に記載の画像処理装置。
    The image transmission unit performs communication to match the other user's hand displayed on the display device that displays the display image transmitted from the other party transmitting the display image with the user's hand on the own side. The image processing apparatus according to claim 1, further comprising: a communication recognition unit that recognizes what is being read.
  13.  複数の視点からユーザが撮像された複数枚の画像を取得し、
     複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成し、
     生成された前記表示画像を送信する
     ステップを含む画像処理方法。
    Acquire multiple images taken by users from multiple viewpoints,
    Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
    An image processing method including a step of transmitting the generated display image.
  14.  複数の視点からユーザが撮像された複数枚の画像を取得し、
     複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成し、
     生成された前記表示画像を送信する
     ステップを含む処理をコンピュータに実行させるプログラム。
    Acquire multiple images taken by users from multiple viewpoints,
    Based on a plurality of the images, a display image is generated so that a part of the user is appropriately displayed when a part of the user is out of an angle of view when capturing the plurality of the images,
    A program for causing a computer to execute processing including a step of transmitting the generated display image.
  15.  相手側から送信されてくる表示画像を表示する表示装置と、
     前記表示装置の周囲に配置され、複数の視点からユーザを撮像する複数台の撮像装置と、
     複数台の前記撮像装置それぞれにより撮像された画像を取得する画像取得部と、
     複数枚の前記画像に基づいて、前記ユーザの一部分が複数枚の前記画像を撮像する際の画角の外となったときに、その一部分が適切に表示されるような表示画像を生成する画像生成部と、
     前記画像生成部により生成された前記表示画像を送信する画像送信部と
     を備えるテレプレゼンスシステム。
    A display device that displays a display image transmitted from the other party;
    A plurality of imaging devices arranged around the display device and imaging a user from a plurality of viewpoints;
    An image acquisition unit for acquiring images captured by each of the plurality of imaging devices;
    An image that generates a display image based on a plurality of the images so that a part of the user is appropriately displayed when the user is outside the angle of view when the plurality of images are captured. A generator,
    A telepresence system comprising: an image transmission unit that transmits the display image generated by the image generation unit.
PCT/JP2017/024571 2016-07-19 2017-07-05 Image processing device, image processing method, program, and telepresence system WO2018016316A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016141553 2016-07-19
JP2016-141553 2016-07-19

Publications (1)

Publication Number Publication Date
WO2018016316A1 true WO2018016316A1 (en) 2018-01-25

Family

ID=60992195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/024571 WO2018016316A1 (en) 2016-07-19 2017-07-05 Image processing device, image processing method, program, and telepresence system

Country Status (1)

Country Link
WO (1) WO2018016316A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123938A (en) * 1994-10-20 1996-05-17 Ohbayashi Corp Monitor device for remote control
US20100149310A1 (en) * 2008-12-17 2010-06-17 Microsoft Corporation Visual feedback for natural head positioning
JP2011138314A (en) * 2009-12-28 2011-07-14 Sharp Corp Image processor
JP2013025528A (en) * 2011-07-20 2013-02-04 Nissan Motor Co Ltd Image generation device for vehicles and image generation method for vehicles
JP2014056466A (en) * 2012-09-13 2014-03-27 Canon Inc Image processing device and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123938A (en) * 1994-10-20 1996-05-17 Ohbayashi Corp Monitor device for remote control
US20100149310A1 (en) * 2008-12-17 2010-06-17 Microsoft Corporation Visual feedback for natural head positioning
JP2011138314A (en) * 2009-12-28 2011-07-14 Sharp Corp Image processor
JP2013025528A (en) * 2011-07-20 2013-02-04 Nissan Motor Co Ltd Image generation device for vehicles and image generation method for vehicles
JP2014056466A (en) * 2012-09-13 2014-03-27 Canon Inc Image processing device and method

Similar Documents

Publication Publication Date Title
US11887234B2 (en) Avatar display device, avatar generating device, and program
JP4553362B2 (en) System, image processing apparatus, and information processing method
WO2017141511A1 (en) Information processing apparatus, information processing system, information processing method, and program
JP2011090400A (en) Image display device, method, and program
JP5237234B2 (en) Video communication system and video communication method
JP4144492B2 (en) Image display device
JP5833526B2 (en) Video communication system and video communication method
EP2661077A1 (en) System and method for eye alignment in video
JP2008140271A (en) Interactive device and method thereof
JP5478357B2 (en) Display device and display method
CN108702482A (en) Information processing equipment, information processing system, information processing method and program
JP5731462B2 (en) Video communication system and video communication method
KR20150009789A (en) Digilog space generator for tele-collaboration in an augmented reality environment and digilog space generation method using the same
JP2011097447A (en) Communication system
TW201021546A (en) Interactive 3D image display method and related 3D display apparatus
JP2011113206A (en) System and method for video image communication
JP5712737B2 (en) Display control apparatus, display control method, and program
JP5759439B2 (en) Video communication system and video communication method
WO2018016316A1 (en) Image processing device, image processing method, program, and telepresence system
JP5833525B2 (en) Video communication system and video communication method
JP5898036B2 (en) Video communication system and video communication method
WO2018173205A1 (en) Information processing system, method for controlling same, and program
JP2015184986A (en) Compound sense of reality sharing device
JP5647813B2 (en) Video presentation system, program, and recording medium
JP5485102B2 (en) COMMUNICATION DEVICE, COMMUNICATION METHOD, AND PROGRAM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17830836

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17830836

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP