CN111124107A

CN111124107A - Hand and object complex interaction scene reconstruction method and device

Info

Publication number: CN111124107A
Application number: CN201911113777.8A
Authority: CN
Inventors: 徐枫; 张�浩; 薄子豪; 杨东
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-05-08

Abstract

The invention discloses a method and a device for reconstructing a complex interaction scene of a hand and an object, wherein the method comprises the following steps: collecting an RGBD sequence of a hand-object interaction scene by using a single RGBD camera to obtain an RGBD image; the RGBD image is sent into a gesture prediction neural network for prediction, and left-hand gesture prediction data and right-hand gesture prediction data are obtained; sending the RGBD image into a segmentation recognition neural network to obtain left hand data, right hand data and segmentation data of different objects; and fusing the depth data and the color data of different objects obtained by segmentation into an object model to obtain a final object model. The method can reconstruct three-dimensional information of the interaction process from the sequence acquired by the single RGBD camera aiming at the complex interaction process of the human hand and the object, can obtain the gesture motion of the human hand and the surface of the object, and effectively solves the problems of high complexity and uncertain strength in the process of reconstructing the complex interaction process based on the single RGBD camera.

Description

Hand and object complex interaction scene reconstruction method and device

Technical Field

The invention relates to the technical field of three-dimensional reconstruction, in particular to a method and a device for reconstructing a complex interaction scene of a hand and an object.

Background

In people's daily life, it is the most common practice for a person to interact with different objects in the environment using their hands. Rebuilding the interaction process of the human hand and the object has very important value for AR/VR, human-computer interaction and intelligent robots.

The process of hand-object interaction contains rich information. In many applications, such as the field of AR and intelligent robots, reconstruction of the interaction process between an hand and an object is required to obtain three-dimensional information of the interaction process. In reality, the process of interaction between the hand and the object is complex. The complexity of the method is represented by complex actions of hands, complex interactive objects and strong ambiguity caused by mutual shielding between the hands and the objects in the interactive process.

The reconstruction using a single RGBD camera has the advantage of simple system, but the use of a single camera also makes the viewing angle of observation single, limits the amount of effective information that can be obtained, and increases the difficulty of reconstruction. In summary, reconstructing the interactive process based on a single RGBD camera is a very meaningful and at the same time very challenging task.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, one object of the present invention is to provide a hand and object complex interaction scene reconstruction method, which can reconstruct three-dimensional information of an interaction process from a sequence acquired by a single RGBD camera for a human hand and object complex interaction process, can obtain gesture motion of the human hand and a surface of the object, and effectively solve the problems of high complexity and uncertainty in the reconstruction of the complex interaction process based on the single RGBD camera.

Another object of the present invention is to provide a hand and object complex interaction scene reconstruction apparatus.

In order to achieve the above object, an embodiment of the present invention provides a method for reconstructing a complex interaction scene between a hand and an object, including the following steps: collecting an RGBD sequence of a hand-object interaction scene by using a single RGBD camera to obtain an RGBD image; the RGBD image is sent into a gesture prediction neural network for prediction, and left-hand gesture prediction data and right-hand gesture prediction data are obtained; sending the RGBD image into a segmentation recognition neural network to obtain left hand data, right hand data and segmentation data of different objects; and fusing the depth data and the color data of different objects obtained by segmentation into an object model to obtain a final object model.

According to the method for reconstructing the complex interaction scene of the hand and the object, the reconstruction is performed based on a single RGBD camera, and the system is simple; not only the movement of the human hand but also the geometry and the movement of the object are reconstructed, and the information reconstruction is complete; and can handle the complicated interactive situation comprising both hands and many objects; the method is combined with a complex interaction process reconstruction scheme of a human hand posture estimation method, an object recognition segmentation method, a unified energy optimization method and a multi-object reconstruction method, and finally complete three-dimensional information of an interaction process is obtained, so that the three-dimensional information of the interaction process can be reconstructed from a sequence acquired by a single RGBD camera aiming at the complex interaction process of a human hand and an object, the posture motion of the human hand and the surface of the object can be obtained, and the problems of high complexity and uncertainty in the process of reconstructing the complex interaction process based on the single RGBD camera are effectively solved.

In addition, the hand and object complex interaction scene reconstruction method according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the acquiring the RGBD sequence includes: the RGB information and the D information are aligned.

Further, in an embodiment of the present invention, the RGBD image is fed into a gesture prediction neural network for prediction, which includes: and using an open source library OpenPose or training on the basis of the open source library to obtain the left-hand posture prediction data and the right-hand posture prediction data.

Further, in an embodiment of the present invention, the feeding the RGBD image into a segmentation recognition neural network includes: and training by using Mask-RCNN or on the basis of the Mask-RCNN to obtain the left-hand data, the right-hand data and the segmentation data of the different objects.

Further, in an embodiment of the present invention, the fusing the depth data and the color data of the different segmented objects into the object model includes: and performing unified optimization motion solution according to the left-hand posture prediction data, the right-hand posture prediction data, the left-hand data, the right-hand data and the segmentation data of different objects to obtain a final object surface reconstruction result.

In order to achieve the above object, an embodiment of another aspect of the present invention provides a device for reconstructing a complex interaction scene between a hand and an object, including: the acquisition module is used for acquiring an RGBD sequence of a hand-object interaction scene by using a single RGBD camera to obtain an RGBD image; the prediction module is used for sending the RGBD image into a gesture prediction neural network for prediction to obtain left-hand gesture prediction data and right-hand gesture prediction data; the segmentation module is used for sending the RGBD image into a segmentation recognition neural network to obtain left hand data, right hand data and segmentation data of different objects; and the fusion module is used for fusing the depth data and the color data of the different objects obtained by segmentation into the object model to obtain the final object model.

The hand and object complex interaction scene reconstruction device provided by the embodiment of the invention is based on a single RGBD camera for reconstruction, and the system is simple; not only the movement of the human hand but also the geometry and the movement of the object are reconstructed, and the information reconstruction is complete; and can handle the complicated interactive situation comprising both hands and many objects; the method is combined with a complex interaction process reconstruction scheme of a human hand posture estimation method, an object recognition segmentation method, a unified energy optimization method and a multi-object reconstruction method, and finally complete three-dimensional information of an interaction process is obtained, so that the three-dimensional information of the interaction process can be reconstructed from a sequence acquired by a single RGBD camera aiming at the complex interaction process of a human hand and an object, the posture motion of the human hand and the surface of the object can be obtained, and the problems of high complexity and uncertainty in the process of reconstructing the complex interaction process based on the single RGBD camera are effectively solved.

In addition, the hand and object complex interaction scene reconstruction device according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the acquisition module is further configured to align the RGB information and the D information.

Further, in an embodiment of the present invention, the prediction module is further configured to use an open source library openpos, or perform training on the basis of the open source library, to obtain the left-hand posture prediction data and the right-hand posture prediction data.

Further, in an embodiment of the present invention, the segmentation module is further configured to use Mask-RCNN or train on the basis of the Mask-RCNN to obtain the left-hand data, the right-hand data, and the segmentation data of the different objects.

Further, in an embodiment of the present invention, the fusion module is further configured to perform a unified optimization motion solution according to left-hand pose prediction data, right-hand pose prediction data, left-hand data, right-hand data, and segmentation data of different objects, so as to obtain the final object surface reconstruction result.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for reconstructing a complex interaction scene between a hand and an object according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for reconstructing a complex interaction scene of a hand and an object according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a hand-object complex interaction scene reconstruction apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The method and the device for reconstructing a complex interaction scene between a hand and an object according to an embodiment of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for reconstructing a complex interaction scene between a hand and an object according to an embodiment of the present invention.

As shown in FIG. 1, the method for reconstructing the complex interaction scene of the hand and the object comprises the following steps:

in step S101, an RGBD sequence of a scene where a hand and an object interact with each other is acquired by using a single RGBD camera, so as to obtain an RGBD image.

In one embodiment of the present invention, the RGBD sequence is acquired, and includes: the RGB information and the D information are aligned.

It is understood that the single RGBD camera may be a Realsense SR300 camera, and there are many types of single RGBD cameras, which are only used as an example to avoid redundancy, and are not limited in detail. For example, as shown in fig. 2, an embodiment of the present invention may acquire an RGBD sequence of a hand-object interaction scene using a Realsense SR300 camera. It should be noted that since the RGB information and the D information originate from two cameras, the RGB-D information needs to be aligned.

It should be noted that, in the embodiment of the present invention, an RGBD image with a resolution of 640 × 480 is used, but of course, images with other resolutions may also be used, and are not limited in particular.

In step S102, the RGBD image is sent to a gesture prediction neural network for prediction, so as to obtain left-hand gesture prediction data and right-hand gesture prediction data.

It can be understood that, as shown in fig. 2, the RGBD image is input into the gesture prediction neural network for prediction, so as to obtain left-hand and right-hand posture prediction data.

Further, in an embodiment of the present invention, the RGBD image is fed into a gesture prediction neural network for prediction, which includes: and (3) training by using an open source library OpenPose or on the basis of the open source library to obtain the left-hand posture prediction data and the right-hand posture prediction data.

It can be understood that, in order to estimate the hand posture information from the input RGBD image, the open source library OpenPose may be directly used, or further training may be performed on the basis of the open source library to adapt to the input of the system, so as to obtain more accurate predicted information of the hand posture.

Specifically, the embodiment of the invention trains a network capable of predicting the hand posture in the interaction process by using a deep neural network method. The existing research proves that the deep neural network has better performance in the estimation of the human posture and the hand posture, and a feasible solution which is closer to a true value can be obtained. The estimation of the hand posture in the interaction obtained by the deep neural network can provide a better initial solution for the visible hand part of the single camera and a reasonable solution for the invisible hand part.

In step S103, the RGBD image is sent to a segmentation recognition neural network, so as to obtain left-hand data, right-hand data, and segmentation data of different objects.

It can be understood that, as shown in fig. 2, the RGBD image is fed into the segmentation recognition neural network, and left-hand data, right-hand data, and segmentation data of different objects are obtained.

Further, in an embodiment of the present invention, the feeding the RGBD image into the segmentation recognition neural network includes: and (4) performing training by using Mask-RCNN or on the basis of the Mask-RCNN to obtain left-hand data, right-hand data and segmentation data of different objects.

It will be appreciated that in order to achieve example segmentation data obtained from the input RGBD image, Mask-RCNN may be used directly or further training may be performed on the basis thereof to obtain better results. The focus of this step is on the acquisition of training data.

Specifically, the embodiment of the invention trains a network capable of performing example segmentation on data acquired by a single RGBD camera by using a neural network method. By utilizing the segmentation recognition network, the collected left-hand and right-hand data and the data of a plurality of objects in the interactive process can be separated, and guidance is provided for solving the motion of the human hand and the plurality of objects in a unified energy optimization manner and reconstructing the surfaces of the plurality of objects.

In step S104, the depth data and color data of the different objects obtained by the segmentation are fused into the object model to obtain a final object model.

In one embodiment of the present invention, fusing the depth data and the color data of the different segmented objects into an object model includes: and performing unified optimization motion solution according to the left-hand posture prediction data, the right-hand posture prediction data, the left-hand data, the right-hand data and the segmentation data of different objects to obtain a final object surface reconstruction result.

It can be understood that, as shown in fig. 2, the embodiment of the present invention sends the predicted left-hand and right-hand postures, the left-hand and right-hand depth data obtained by segmentation, and the depth and color data of the object into the unified energy optimization framework for solution, so as to obtain the accurate posture of the hand and the motion of the plurality of objects. And on the basis of obtaining accurate object motion, fusing depth data and color data of different objects obtained by example segmentation into an object model to finally obtain a complete object model.

Specifically, on the basis of obtaining the gesture of the hand in interaction for rough gesture estimation and preliminary data segmentation, the accurate gesture of the hand and the motion of each object are obtained by utilizing unified energy optimization. And then fusing the data of each object obtained by segmentation by using the motion information obtained by solving. The gestural motion of the human hand and the complete surface of the plurality of objects in the interaction are finally reconstructed.

In summary, the hand and object complex interaction scene reconstruction method provided by the embodiment of the invention is based on reconstruction by a single RGBD camera, and the system is simple; not only the movement of the human hand but also the geometry and the movement of the object are reconstructed, and the information reconstruction is complete; and can handle the complicated interactive situation comprising both hands and many objects; the method is combined with a complex interaction process reconstruction scheme of a human hand posture estimation method, an object recognition segmentation method, a unified energy optimization method and a multi-object reconstruction method, and finally complete three-dimensional information of an interaction process is obtained, so that the three-dimensional information of the interaction process can be reconstructed from a sequence acquired by a single RGBD camera aiming at the complex interaction process of a human hand and an object, the posture motion of the human hand and the surface of the object can be obtained, and the problems of high complexity and uncertainty in the process of reconstructing the complex interaction process based on the single RGBD camera are effectively solved.

The hand and object complex interaction scene reconstruction device proposed by the embodiment of the invention is described next with reference to the attached drawings.

As shown in fig. 3, the hand and object complex interaction scene reconstruction apparatus 10 includes: an acquisition module 100, a prediction module 200, a segmentation module 300, and a fusion module 400.

The acquisition module 100 is configured to acquire an RGBD sequence of a hand-object interaction scene by using a single RGBD camera to obtain an RGBD image. The prediction module 200 is configured to send the RGBD image to a gesture prediction neural network for prediction, so as to obtain left-hand gesture prediction data and right-hand gesture prediction data. The segmentation module 300 is configured to send the RGBD image to a segmentation recognition neural network, so as to obtain left-hand data, right-hand data, and segmentation data of different objects. The fusion module 400 is configured to fuse the depth data and the color data of different segmented objects into an object model to obtain a final object model. The device 10 provided by the embodiment of the invention can reconstruct three-dimensional information of the interaction process from the sequence acquired by a single RGBD camera aiming at the complex interaction process of the human hand and the object, can obtain the gesture movement of the human hand and the surface of the object, and effectively solves the problems of high complexity and uncertain strength in the process of reconstructing the complex interaction process based on the single RGBD camera.

Further, in an embodiment of the present invention, the acquisition module 100 is further configured to align the RGB information and the D information.

Further, in an embodiment of the present invention, the prediction module 200 is further configured to use an open source library openpos, or perform training on the basis of the open source library, to obtain left-hand posture prediction data and right-hand posture prediction data.

Further, in an embodiment of the present invention, the segmentation module 300 is further configured to use or train on the basis of the Mask-RCNN to obtain left-hand data, right-hand data, and segmentation data of different objects.

Further, in an embodiment of the present invention, the fusion module 400 is further configured to perform a unified optimization motion solution according to the left-hand posture prediction data, the right-hand posture prediction data, the left-hand data, the right-hand data, and the segmentation data of different objects, so as to obtain a final object surface reconstruction result.

It should be noted that the explanation of the foregoing embodiment of the method for reconstructing a complex interaction scene between an opponent and an object is also applicable to the apparatus for reconstructing a complex interaction scene between an opponent and an object in this embodiment, and is not repeated here.

According to the hand and object complex interaction scene reconstruction device provided by the embodiment of the invention, reconstruction is carried out based on a single RGBD camera, and the system is simple; not only the movement of the human hand but also the geometry and the movement of the object are reconstructed, and the information reconstruction is complete; and can handle the complicated interactive situation comprising both hands and many objects; the method is combined with a complex interaction process reconstruction scheme of a human hand posture estimation method, an object recognition segmentation method, a unified energy optimization method and a multi-object reconstruction method, and finally complete three-dimensional information of an interaction process is obtained, so that the three-dimensional information of the interaction process can be reconstructed from a sequence acquired by a single RGBD camera aiming at the complex interaction process of a human hand and an object, the posture motion of the human hand and the surface of the object can be obtained, and the problems of high complexity and uncertainty in the process of reconstructing the complex interaction process based on the single RGBD camera are effectively solved.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for reconstructing a complex interaction scene between a hand and an object, comprising the following steps:

Use a single RGBD camera to collect the RGBD sequence of the interaction scene between the hand and the object to obtain an RGBD image;

The RGBD image is sent into the gesture prediction neural network for prediction, and the gesture prediction data of the left hand and the gesture prediction data of the right hand are obtained;

The RGBD image is sent into the segmentation and recognition neural network to obtain left-hand data, right-hand data, and segmentation data of different objects; and

The depth data and color data of different objects obtained by segmentation are fused into the object model to obtain the final object model.

2. The method according to claim 1, wherein the collecting the RGBD sequence comprises:

Align RGB information and D information.

3. The method according to claim 1, wherein the RGBD image is sent into a gesture prediction neural network for prediction, comprising:

The open source library OpenPose is used, or training is performed on the basis of the open source library to obtain the posture prediction data of the left hand and the posture prediction data of the right hand.

4. The method according to claim 1, wherein the sending the RGBD image into a segmentation and recognition neural network comprises:

Use Mask-RCNN, or perform training on the basis of the Mask-RCNN described herein, to obtain the left-hand data, the right-hand data, and the segmentation data of the different objects.

5. The method according to claim 1, wherein the depth data and color data of different objects obtained by segmentation are fused into the object model, comprising:

According to the left-hand posture prediction data, the right-hand posture prediction data, the left-hand data, the right-hand data, and the segmentation data of different objects, a unified optimization motion solution is performed to obtain the final object surface reconstruction result.

6. A device for reconstructing a complex interaction scene between a hand and an object, comprising:

The acquisition module is used to collect the RGBD sequence of the interaction scene between the hand and the object by using a single RGBD camera to obtain an RGBD image;

The prediction module is used to send the RGBD image into the gesture prediction neural network for prediction, and obtain the gesture prediction data of the left hand and the gesture prediction data of the right hand;

A segmentation module for sending the RGBD image into a segmentation and recognition neural network to obtain left-hand data, right-hand data, and segmentation data of different objects; and

The fusion module is used to fuse the depth data and color data of different objects obtained by segmentation into the object model to obtain the final object model.

7. The apparatus according to claim 6, wherein the acquisition module is further configured to align the RGB information and the D information.

8. The apparatus according to claim 6, wherein the prediction module is further configured to use an open source library OpenPose, or perform training on the basis of the open source library, so as to obtain the posture prediction data of the left hand and all the parameters. Describe the right-hand pose prediction data.

9. The apparatus according to claim 6, wherein the segmentation module is further configured to use Mask-RCNN, or perform training on the basis of the Mask-RCNN, to obtain the left-hand data, the Right hand data, segmentation data of the different objects.

10. The apparatus according to claim 6, wherein the fusion module is further configured to predict data according to the posture of the left hand, the posture prediction data of the right hand, the left hand data, the right hand data, the different The segmentation data of the object is solved by a unified optimization motion, and the final surface reconstruction result of the object is obtained.