CN115222809A

CN115222809A - Target pose estimation method and device, computing equipment and storage medium

Info

Publication number: CN115222809A
Application number: CN202110743454.8A
Authority: CN
Inventors: 杨佳丽; 杜国光; 赵开勇
Original assignee: Cloudminds Beijing Technologies Co Ltd
Current assignee: Cloudminds Beijing Technologies Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2022-10-21
Anticipated expiration: 2041-06-30
Also published as: WO2023273272A1; CN115222809B

Abstract

The embodiment of the invention relates to the technical field of computer vision, and discloses a target pose estimation method, a device, a computing device and a storage medium, wherein the method comprises the following steps: performing 2D detection according to the RGB image and the depth image to obtain a detection area of a target; acquiring the RGB image in the detection area to obtain a normalized model of the target; acquiring size information of the target according to the depth image in the detection area; and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model. Through the mode, the embodiment of the invention can accurately acquire the pose information of the target object, is convenient to grab the target object and improves the user experience.

Description

Target pose estimation method and device, computing equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to a target pose estimation method, a target pose estimation device, a target pose estimation computing device and a storage medium.

Background

In addition to being able to perceive the surrounding world, intelligent robots must also be able to interact with the environment, and grabbing is an indispensable capability. The robot has great application value in both industrial scene and household scene, wherein the pose estimation of the object to be grabbed is an important factor influencing the grabbing success. The existing pose estimation methods are generally classified into a feature matching method, a template method and a deep learning-based method. The feature matching method generally calculates and matches feature points between a 3D model and a 2D image, and then calculates a pose using the PnP method. The stencil method generally models a 3D model of an object to be grasped from various viewing angles, and estimates the pose by matching the acquired image with the stencil, and the method based on deep learning generally requires first acquiring a large number of color images and depth images of the object to be grasped in various pose states, creating a data set, and then directly or indirectly estimating the pose of the object to be grasped by training a convolutional neural network based on deep learning.

However, the current algorithm still has defects in the grabbing of real objects. The feature matching method usually requires a large amount of calculation, and the algorithm running time is long. Moreover, the accuracy of pose estimation is directly affected by the success or failure of feature point selection and matching, and accurate and stable results cannot be obtained for an object algorithm with few feature points. The template matching-based method usually needs a large number of templates to be manufactured, and pose estimation is a regression problem essentially, so the algorithm accuracy and the number selection of the templates are usually in direct proportion, and balance is difficult to achieve. The deep learning-based method directly regresses the pose of an object through a convolutional neural network, and most of the existing deep learning methods are at an example level and have poor generalization capability.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present invention provide a target pose estimation method, apparatus, computing device and storage medium, which overcome or at least partially solve the above problems.

According to an aspect of an embodiment of the present invention, there is provided a target pose estimation method, including: performing 2D detection according to the RGB image and the depth image to obtain a detection area of a target; acquiring the RGB image in the detection area to obtain a normalized model of the target; acquiring size information of the target according to the depth image in the detection area; and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model.

In an optional manner, the performing 2D detection according to the RGB image and the depth image to obtain a detection area of the target includes: processing the RGB image by applying a pre-constructed first convolution neural network to obtain the detection area of the target in the RGB image; and acquiring the detection area of the target corresponding to the same RGB image in the depth image.

In an alternative mode, the acquiring the RGB image in the detection area to be normalized to the model map of the target includes: and processing the RGB image in the detection area by applying a first network structure to obtain a normalized model diagram of the target.

In an optional manner, the processing the RGB image in the detection area by using the first network structure to obtain the normalized model of the target includes:

applying a plurality of groups of convolution + downsampling combinations to carry out downsampling on the RGB image in the detection area, and then carrying out convolution operation on the characteristic diagram with the lowest resolution; and restoring the resolution of the RGB image in the detection area after operation to the original size by applying a plurality of groups of up-sampling, convolution and convolution combinations, and performing a preset number of convolution operations to obtain a normalization model of the target.

In an alternative mode, the obtaining size information of the target according to the depth image in the detection area includes: converting the depth image within the detection area to a point cloud; and processing the point cloud by applying a second network structure to acquire the size information of the target.

In an alternative manner, the fusing the size information with the normalized model to obtain a 3D model includes: calculating the 3D model from the dimensional information and the normalized model using the following relationship:

x’＝x×w，

y’＝y×l，

z’＝z×h，

wherein (x, y, z) is a coordinate of the normalized model, (x ', y ', z ') is a coordinate of the 3D model, (w, l, h) is size information of the object, and w, l, h respectively represent a width, a length, and a height of the object.

In an optional manner, the obtaining pose information of the target by applying the PnP algorithm according to the 3D model includes: and matching the coordinates of the 3D model with the coordinates of the 2D image by applying a PnP algorithm to acquire the pose information of the target.

According to another aspect of the embodiments of the present invention, there is provided an object pose estimation apparatus including: the 2D detection unit is used for carrying out 2D detection according to the RGB image and the depth image to obtain a detection area of a target;

the normalization unit is used for acquiring the RGB images in the detection area into a normalization model of the target; a size acquisition unit for acquiring size information of the target according to the depth image in the detection area; and the pose estimation unit is used for fusing the size information and the normalized model to obtain a 3D model and obtaining pose information of the target by applying a PnP algorithm according to the 3D model.

According to another aspect of embodiments of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the steps of the target pose estimation method.

According to still another aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing the processor to execute the steps of the above-mentioned target pose estimation method.

The target pose estimation method provided by the embodiment of the invention comprises the following steps: performing 2D detection according to the RGB image and the depth image to obtain a detection area of a target; acquiring the RGB image in the detection area to obtain a normalized model of the target; acquiring size information of the target according to the depth image in the detection area; and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model, so that the pose information of the target object can be accurately obtained, the target object can be conveniently grabbed, and the user experience is improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a target pose estimation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a first convolutional neural network in a target pose estimation method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a first network structure in the target pose estimation method according to the embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating size information acquisition in a target pose estimation method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an object pose estimation apparatus provided by an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a schematic flow chart of a target pose estimation method provided by an embodiment of the present invention, and as shown in fig. 1, the target pose estimation method includes:

step S11: and carrying out 2D detection according to the RGB image and the depth image to obtain a detection area of the target.

In the embodiment of the present invention, optionally, a pre-constructed first convolutional neural network is applied to process the RGB image, so as to obtain the detection area of the target in the RGB image; and acquiring the detection area of the target corresponding to the same RGB image in the depth image. The first convolution neural network is not limited to a specific detection or segmentation method, and the target is a specific region of a target (an object to be grabbed) in an image, so that background interference factors are reduced for subsequent pose estimation.

Before applying the first convolutional neural network for processing, the first convolutional neural network needs to be constructed. Firstly, a data set is constructed: collecting RGB images of an object to be captured under different environment backgrounds, and marking an optimal bounding box (x, y, w, h) and an object type id for each RGB image; secondly, training a large amount of RGB image data by using a Convolutional Neural Network (CNN) to obtain a first Convolutional Neural Network model. The network structure of the first convolutional neural network is shown in fig. 2, the number of network layers is 31, and the image block is scaled to obtain a block of 448x448 pixels as the network input.

Step S12: and acquiring the RGB image in the detection area to obtain a normalized model of the target.

In this embodiment of the present invention, optionally, a first network structure is applied to process the RGB image in the detection area, so as to obtain a normalized model map of the target. The specific structure of the first network structure is as shown in fig. 3, and a plurality of groups of convolution + downsampling combinations are applied to downsample the RGB image in the detection area, and then convolution operation is performed on the lowest resolution feature map; and restoring the resolution of the RGB image in the detection area after operation to the original size by applying a plurality of groups of up-sampling, convolution and convolution combinations, and performing a preset number of convolution operations to obtain a normalization model of the target. Preferably, after 4 sets of convolution + downsampling combination processing are performed on the RGB image in the detection area, a convolution operation is performed on the lowest resolution feature map once, and after 4 sets of upsampling + convolution combination processing are performed, a normalization model of the target is output after two continuous convolution operations are performed. The embodiment of the invention uses the U-net network structure regression normalization precision graph on the basis of 2D detection, and can greatly improve the precision of the algorithm.

Step S13: and acquiring the size information of the target according to the depth image in the detection area.

Optionally, as shown in fig. 4, the depth image within the detection area is converted to a point cloud. The following conversion formula is specifically applied for conversion:

wherein (X, Y, Z) is a point cloud coordinate, (X ', Y') is an image coordinate, D is a depth value, f _x And f _y Is the focal length, c _x ，c _y Is the principal point offset.

And then, processing the point cloud by applying a second network structure to acquire the size information of the target. The second network structure is preferably formed by a PointNet + + network and a convolutional layer and a full connection layer located behind the PointNet + + network. The size of the object can be recovered through a PointNet + + network, and is represented by S (w, l, h), and then the size information of the object is recovered by adding a convolution layer and a full connection layer.

Step S14: and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model.

In the embodiment of the invention, the normalized model and the object size information are fused, so that the complete 3D information of the target (the object to be grabbed) can be obtained. Optionally, the 3D model is calculated from the size information and the normalized model applying the following relation:

x’＝x×w，

y’＝y×l，

z’＝z×h，

And then matching the coordinates of the 3D model with the coordinates of the 2D image by applying a PnP algorithm to acquire the pose information of the target. The pose information of the targets includes a placement matrix R and a translation matrix T. The PnP algorithm may be any one of the existing PnP algorithms that can implement the above functions, and will not be described herein again. According to the embodiment of the invention, the dimension information is recovered through PointNet + + by means of the depth map, the prior information of the post-processing algorithm is added, and higher precision information can be obtained.

The method comprises the steps of obtaining object types, segmentation results and normalized model graphs by using RGB images through a convolutional neural network, obtaining object size information by using Depth (Depth) images and segmentation results, obtaining a 3D model by fusing the size information and the normalized model graphs, and finally obtaining final pose information through PnP. Where T (x, y, z) is used to represent position information in three-dimensional space and a rotation matrix R is used to represent three-axis rotation in three-dimensional space. The problem that the sizes of objects of the same kind are not uniform and the accurate sizes of the objects cannot be obtained due to camera scaling can be well solved by using the normalized model graph, and the problem that the current deep learning method is mostly at an instance level can be solved by combining the sizes recovered by the depth graph to carry out unsmooth information.

The following description exemplifies the steps of applying the target pose estimation method of the embodiment of the present invention to a robot:

1) Preparing robot equipment which comprises a robot base, a mechanical arm, a depth camera and the like;

2) Placing an object on a desktop in front of a mechanical arm, and collecting an RGB (red, green and blue) image and a Depth image at the current position;

3) Aiming at the RGB image of the target object, obtaining the region of the object to be grabbed under the current grabbing visual angle by using a target detection method;

4) Generating a network by using the normalized model, and generating a standard normalized model graph of the object to be grabbed;

5) Calculating size information of the object to be grabbed by using a size estimation network;

6) Fusing the size information and the normalized model graph, and calculating the position and orientation information of the object to be grabbed by using a PnP algorithm;

7) According to the pose, the robot arm is caused to perform grasping.

The target pose estimation method provided by the embodiment of the invention comprises the following steps: performing 2D detection according to the RGB image and the depth image to obtain a detection area of a target; acquiring a normalization model of the target from the RGB image in the detection area; acquiring size information of the target according to the depth image in the detection area; and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model, so that the pose information of the target object can be accurately obtained, the target object can be conveniently grabbed, and the user experience is improved.

Fig. 5 is a schematic structural view of the target pose estimation apparatus according to the embodiment of the present invention, and as shown in fig. 5, the target pose estimation apparatus includes: a 2D detection unit 501, a normalization unit 502, a size acquisition unit 503, and a pose estimation unit 504.

The 2D detection unit 501 is configured to perform 2D detection according to the RGB image and the depth image to obtain a detection area of a target; the normalization unit 502 is configured to obtain a normalized model of the target from the RGB image in the detection area; the size obtaining unit 503 is configured to obtain size information of the target according to the depth image in the detection area; the pose estimation unit 504 is configured to fuse the size information and the normalized model to obtain a 3D model, and obtain pose information of the target by applying a PnP algorithm according to the 3D model.

In an alternative manner, the 2D detection unit 501 is configured to: processing the RGB image by applying a pre-constructed first convolution neural network to obtain the detection area of the target in the RGB image; and acquiring the detection area of the target corresponding to the same RGB image in the depth image.

In an optional manner, the normalization unit 502 is configured to: and processing the RGB image in the detection area by applying a first network structure to obtain a normalized model graph of the target.

In an optional manner, the normalization unit 502 is configured to: applying a plurality of groups of convolution + downsampling combinations to carry out downsampling on the RGB image in the detection area, and then carrying out convolution operation on the characteristic diagram with the lowest resolution; and restoring the resolution of the RGB image in the detection area after operation to the original size by applying a plurality of groups of up-sampling + convolution combinations, and performing a preset number of convolution operations to obtain a normalization model of the target.

In an alternative manner, the size obtaining unit 503 is configured to: converting the depth image within the detection area to a point cloud; and processing the point cloud by applying a second network structure to acquire the size information of the target.

In an optional manner, the pose estimation unit 504 is configured to: calculating the 3D model from the dimensional information and the normalized model using the following relationship:

x’＝x×w，

y’＝y×l，

z’＝z×h，

In an alternative manner, the pose estimation unit 504 is configured to: and matching the coordinates of the 3D model with the coordinates of the 2D image by applying a PnP algorithm to acquire the pose information of the target.

An embodiment of the present invention provides a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, and the computer executable instruction may execute the target pose estimation method in any of the above method embodiments.

The executable instructions may be specifically configured to cause the processor to perform the following operations:

performing 2D detection according to the RGB image and the depth image to obtain a detection area of a target;

acquiring a normalization model of the target from the RGB image in the detection area;

acquiring size information of the target according to the depth image in the detection area;

and fusing the size information and the normalized model to obtain a 3D model, and applying a PnP algorithm to obtain the target according to the 3D model.

In an alternative, the executable instructions cause the processor to:

processing the RGB image by applying a pre-constructed first convolution neural network to obtain the detection area of the target in the RGB image;

and acquiring the detection area of the target corresponding to the same RGB image in the depth image.

In an alternative, the executable instructions cause the processor to:

and processing the RGB image in the detection area by applying a first network structure to obtain a normalized model diagram of the target.

In an alternative, the executable instructions cause the processor to:

applying a plurality of groups of convolution + downsampling combinations to carry out downsampling on the RGB image in the detection area, and then carrying out convolution operation on the characteristic diagram with the lowest resolution;

and restoring the resolution of the RGB image in the detection area after operation to the original size by applying a plurality of groups of up-sampling, convolution and convolution combinations, and performing a preset number of convolution operations to obtain a normalization model of the target.

In an alternative, the executable instructions cause the processor to:

converting the depth image within the detection area to a point cloud;

and processing the point cloud by applying a second network structure to acquire the size information of the target.

In an alternative, the executable instructions cause the processor to:

calculating the 3D model from the dimensional information and the normalized model using the following relation:

x’＝x×w，

y’＝y×l，

z’＝z×h，

In an alternative, the executable instructions cause the processor to:

and matching the coordinates of the 3D model with the coordinates of the 2D image by applying a PnP algorithm to acquire the pose information of the target.

Fig. 6 shows a schematic structural diagram of an embodiment of the apparatus according to the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the apparatus.

As shown in fig. 6, the apparatus may include: a processor (processor) 602, a communication Interface 604, a memory 606, and a communication bus 608.

Wherein: the processor 602, communication interface 604, and memory 606 communicate with one another via a communication bus 608. A communication interface 604 for communicating with network elements of other devices, such as clients or other servers. The processor 602 is configured to execute the program 610, and may specifically perform the relevant steps in the above-described embodiment of the target pose estimation method.

In particular, program 610 may include program code comprising computer operating instructions.

The processor 602 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention. The device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 606 for storing a program 610. Memory 606 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 may specifically be configured to cause the processor 602 to perform the following operations:

acquiring the RGB image in the detection area to obtain a normalized model of the target;

and fusing the size information and the normalized model to obtain a 3D model, and obtaining the target by applying a PnP algorithm according to the 3D model.

In an alternative, the program 610 causes the processor to:

and restoring the resolution of the RGB image in the detection area after operation to the original size by applying a plurality of groups of up-sampling + convolution combinations, and performing a preset number of convolution operations to obtain a normalization model of the target.

In an alternative, the program 610 causes the processor to:

converting the depth image within the detection area to a point cloud;

In an alternative, the program 610 causes the processor to:

calculating the 3D model from the dimensional information and the normalized model using the following relationship:

x’＝x×w，

y’＝y×l，

z’＝z×h，

wherein (x, y, z) is coordinates of the normalized model, (x ', y ', z ') is coordinates of the 3D model, (w, l, h) is size information of the object, and w, l, h respectively represent width, length, and height of the object.

In an alternative, the program 610 causes the processor to:

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: rather, the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method of target pose estimation, the method comprising:

and fusing the size information and the normalized model to obtain a 3D model, and obtaining the pose information of the target by applying a PnP algorithm according to the 3D model.

2. The object pose estimation method according to claim 1, wherein the performing 2D detection based on the RGB image and the depth image to obtain a detection area of the object comprises:

3. The object pose estimation method according to claim 1, wherein the acquiring the RGB images within the detection area into a normalized model map of the object comprises:

4. The object pose estimation method according to claim 3, wherein the applying a first network structure to the RGB images in the detection area to obtain a normalized model of the object comprises:

applying a plurality of groups of convolution + downsampling combinations to downsample the RGB image in the detection area, and then performing convolution operation on the feature map with the lowest resolution;

5. The object pose estimation method according to claim 1, wherein the acquiring size information of the object from the depth image in the detection area includes:

converting the depth image within the detection area to a point cloud;

6. The object pose estimation method according to claim 1, wherein the fusing the size information with the normalized model to obtain a 3D model includes:

x’＝x×w，

y’＝y×l，

z’＝z×h，

7. The pose estimation method of an object according to claim 6, wherein the applying the PnP algorithm to obtain pose information of the object based on the 3D model comprises:

8. An object pose estimation apparatus, characterized in that the apparatus comprises:

the 2D detection unit is used for carrying out 2D detection according to the RGB image and the depth image to obtain a detection area of a target;

the normalization unit is used for acquiring the RGB images in the detection area into a normalization model of the target;

a size acquisition unit for acquiring size information of the target according to the depth image in the detection area;

and the pose estimation unit is used for fusing the size information and the normalized model to obtain a 3D model and obtaining pose information of the target by applying a PnP algorithm according to the 3D model.

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the steps of the object pose estimation method according to any one of claims 1-7.

10. A computer storage medium having stored therein at least one executable instruction that causes a processor to perform the steps of the object pose estimation method according to any one of claims 1-7.