CN117745825A

CN117745825A - Monocular object grade pose estimation method for cascade refinement corresponding diagram

Info

Publication number: CN117745825A
Application number: CN202311773270.1A
Authority: CN
Inventors: 谢越琛; 谢晋; 钱建军
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-22

Abstract

The invention belongs to the technical field of image recognition, and discloses a monocular object grade pose estimation method of a cascade refined corresponding graph, which comprises the following steps of S001, obtaining a visible light image of a target object scene; s002, cutting out the target image block from the visible light image of the target object scene; s003, obtaining rough dense corresponding relation between a visible light image of a target object scene and a target model; s004, inputting the obtained rough dense corresponding relation into a cascade network, and refining to obtain a fine corresponding relation; s005, the obtained fine corresponding relation is passed through a pose regression network to obtain the pose of the target object. The method establishes an end-to-end pose estimation network, effectively utilizes the existing RGB information, improves corresponding generation precision by extracting pooling features and adopting a cascading multi-step generation mode, thereby remarkably improving the 6D pose estimation accuracy and effectively solving the problem that the high-precision requirement cannot be met.

Description

Monocular object grade pose estimation method for cascade refinement corresponding diagram

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a monocular object grade pose estimation method for cascade refinement corresponding graphs.

Background

With the continuous development of the robot technology, the robot environment sensing technology based on three-dimensional vision has been penetrated into various fields such as intelligent manufacturing, intelligent logistics and the like due to the characteristic of wide application range. The task of the pose estimation of the object 6D is to estimate the rigid transformation from the object coordinate system to the camera coordinate system, i.e. the rotation and translation transformation of the object coordinate system to the camera coordinate system. Object pose estimation has strong practical significance, and the estimation of the pose of a 6D object draws more attention from researchers, and is also becoming more and more widely used in many practical applications, such as robotic operations, autopilot and augmented reality.

On the approach aimed at predicting the pose of an object instance based on a three-dimensional model of the owning object, the mainstream approaches can be divided into the following categories: a correspondence-based method, a template-based method, etc. The corresponding method is based on searching for a corresponding relation between the observation point cloud and the object model, and further obtaining the 6D pose of the object through a PnP algorithm. The template-based mode is to select the template closest to the current observed object from templates marked with 6D poses, and take the 6D poses as predicted 6D poses.

Most of the current uses require depth camera information to obtain geometric information, which leads to the requirement of additional sensors, limits the application range and increases the detection cost; the existing RGB-based methods have great development prospects. The corresponding relation is learned from RGB information mainly through a coding and decoding framework based on a convolutional neural network based on a corresponding method; however, the current mainstream method only adopts a single-step encoding and decoding generation mode to obtain correspondence, and is limited by the characteristics of convolution downsampling operation, the single-step generation method often has the problems of edge blurring and the like, and cannot generate a high-quality corresponding relation, so that pose estimation precision is low indirectly, and high precision requirements cannot be met.

Based on the method, a monocular object grade pose estimation method related to a cascade refinement corresponding graph is researched and developed.

Disclosure of Invention

The invention provides a monocular object grade pose estimation method for cascade refinement corresponding graphs, which constructs an end-to-end pose estimation network, effectively utilizes the existing RGB information, improves the corresponding generation precision by extracting pooling features and adopting a cascade multi-step generation mode, thereby remarkably improving the 6D pose estimation accuracy, and effectively solving the problems that the corresponding is obtained by adopting a single-step encoding and decoding generation method, the edge is fuzzy, the high-quality corresponding relation cannot be generated, and the requirement of high precision cannot be met.

A monocular object grade posture estimation method of cascade refinement mapping, which comprises the following steps,

s001, obtaining a visible light image of a target object scene;

s002, cutting the target image block from the visible light image of the target object scene to obtain a cut picture i ₀ ；

S003, the picture i is displayed ₀ Scaled to a resolution of 256×256, input ResNet34 encoding network E ₁ Inputting the obtained feature map with the size of 8 multiplied by 8 into a decoding network to obtain a rough dense corresponding relation c with the size of 64 multiplied by 64 between a visible light image of a target object scene and a target model ₀ ；

S004, the rough dense corresponding relation c obtained in the step S003 is processed ₀ Inputting into a cascade network, and refining to obtain a fine corresponding relation;

s005, obtaining the pose of the target object through a pose regression network according to the fine corresponding relation obtained in the step S004.

Optionally, in step S001, a visible light image of the target object scene is obtained by a visible light camera.

Optionally, in step S002, the target image block is cropped from the visible light image of the target object scene by the target detection algorithm.

Optionally, in step S005, the pose regression network is a pose regression network with a convolution structure.

Alternatively, the process may be carried out in a single-stage,in step S004, the rough dense correspondence c obtained in step S003 is obtained ₀ Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:

step S0041, the rough dense correspondence c obtained in step S003 is obtained ₀ Sampling to the same resolution as the cut picture as c _{0_256} And the picture i ₀ All input ResNet34 encoding network E ₀ Obtaining features f0 of different scales, wherein the calculating step is represented by the following formula:

{f _{0_32} ，f _{0_64} ，f _{0_128} }＝E ₀ (i ₀ ，c _{0_256} )，

f ₀ ＝{f _{0_32} ，f _{0_64} ，f _{0_128} }，

will f _{0_32} Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f _{0_32} The characteristic diagram f _{0_32} Input decoding network D ₁ Obtain the corresponding c of the first refinement ₁ ，c ₁ Represented by the formula:

c ₁ ＝D ₁ (pool(f _{0_32} ))

step S0042, the rough dense correspondence c obtained in step S0041 _{0_256} Rough dense correspondence c obtained in step S0041 ₁ Sampling to the same resolution c as the cropped picture by bilinear interpolation _{1_256} Same as i ₀ Together with input to ResNet34 encoding network E ₂ Obtaining features f of different scales ₁ The calculation step is represented by the following formula:

{f _{1_32} ，f _{1_64} ，f _{1_128} }＝E ₂ (i ₀ ,c _{0_256} ，c _{1_256} )，

f ₁ ＝{f _{1_32} ，f _{1_64} ，f _{1_128} }，

will f _{1_32} Downsampling to 1 by the pooling operation of the different steps respectively,2,3,6 to obtain features of different scales, up-sampling to the same resolution 32×32 by bilinear interpolation and stitching to obtain feature map f _{1_32_1} Combining the feature map with f _{1_64} Decoding network D of input pure convolution architecture ₂ Obtaining the corresponding c of the second refinement ₂ The calculation step is represented by the following formula:

c ₂ ＝D ₂ (pool(f _{1_32} )，f _{1_64} )

step S0044, setting a gap L between the reconstructed corresponding relationship and the true corresponding relationship, and the expression is as follows:

M _vis representing the portion of pixels visible in the target image block during the reconstruction step,representing the multiplication of the matrix element by element, c _i C for the corresponding graph obtained in step S004 for the ith refinement ₀ The rough map obtained in step S003 is shown.

Optionally, in step S0044, a difference L between the reconstructed correspondence and the true correspondence is set as a reconstruction loss, which is the output c of each step _i And a sum of norms between the true correspondence relation c.

Optionally, the operation of step S0042 further comprises

Step S0043, the rough dense correspondence c obtained in step S003 is obtained _{0_256} Dense correspondence c obtained in step S0041 ₁ Dense correspondence c obtained in step S0042 ₂ Sampling to the same resolution as the cut picture by bilinear interpolation, and obtaining the same picture i ₀ Together with input to ResNet34 encoding network E ₃ Obtaining features f of different scales ₂ The calculation step is represented by the following formula:

{f _{2_32} ，f _{2_64} ，f _{2_128} }＝E ₂ (i ₀ ，c _{0_256} ，c _{1_256} ，c _{2_256} )，

f ₂ ＝{f _{2_32} ，f _{2_64} ，f _{2_128} }，

will f _{2_32} Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f _{2_32_1} The characteristic diagram is identical to f _{2_64} ，f _{2_128} Decoding network D of input pure convolution architecture ₃ Obtaining the corresponding c of the third refinement ₃ The calculation step is represented by the following formula:

c ₃ ＝D ₃ (pool(f _{2_32} )，f _{2_64} ，f _{2_128} )

optionally, the ResNet34 encoding network is a ResNet34 architecture encoding network.

Optionally, in step S005, inputting the cascade-refined correspondence relationship into a pose regression network of the convolution structure or solving by using PnP algorithm to obtain a pose parameter P, where when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, where 3-dimensional represents a translation parameter t _site The expression is as follows:

t _site ＝(dx，dy，t _z )，

wherein dx and dy represent the offset from the center of the target detection frame to the center of the object, t _z Representing a zoom depth;

according to the target detection result obtained in the step S002, namely, the picture i ₀ The position (C) of the target detection frame in the original image can be obtained _x ，C _Y ) And target image block size (h, w), t _site The respective parameter of (a) represents:

wherein, (O) _X ，O _Y )，(C _X ，C _Y ) Is in the target object image blockThe center of the object and the center of the object image block are (H, W) the image size of the scene where the object is located, and r=max (W, H)/max (W, H) is the scaling ratio.

Optionally, in step S005, the other 6 dimensions of the pose parameter P are used to represent the rotation parameter, and the first two columns of the rotation parameter matrix R ε SO (3) are taken to represent the rotation R _6d ＝[R ₁ |R ₂ ]The predicted 6-dimensional vector is denoted as r ₁ |r ₂ ]Thus, r= [ R ₁ |R ₂ |R ₃ ]The calculation can be made by the following formula:

wherein f represents vector normalization operation, which uses the center offset, the predicted depth error and the average distance of the point cloud after rotation transformation as supervisory signals to train the network loss function expression as follows:

wherein R represents a rotation matrix, and M represents that the target object is in a picture i ₀ X represents a point cloud representation of the object model, typically an N x 3 shape, where N represents the number of points and 3 represents the coordinates of the points in three-dimensional space.

Compared with the prior art, the invention has the beneficial effects that:

according to the technical scheme, an end-to-end pose estimation network is established, existing RGB information is effectively utilized, pooling features are extracted, a cascading multi-step generation mode is adopted to improve corresponding generation precision, and therefore 6D pose estimation accuracy is remarkably improved, the problem that the corresponding is obtained only by adopting a single-step encoding and decoding generation mode and the characteristic of convolution downsampling operation is limited is effectively solved, the problem that edge blurring and the like often exist in the single-step generation method, and high-quality corresponding relation cannot be generated, so that pose estimation precision is low and high precision requirements cannot be met is indirectly caused.

Drawings

FIG. 1 is a high-precision graph generated by a simulation experiment in example 1;

FIG. 2 is a schematic block flow diagram of the method described herein;

fig. 3 is a schematic block diagram of the logic flow of the method described in this embodiment 1.

The information of the present embodiment is provided in the form of,

the following further details the invention with reference to examples for the purpose of making the objects, technical solutions and advantages of the invention more clear, the illustrative embodiments of the invention and their description are only for the purpose of explaining the invention and are not to be construed as limiting the invention.

Example 1:

as shown in fig. 1-3, a monocular object pose estimation method for cascade refinement of a corresponding graph, comprising the steps of,

s001, obtaining a visible light image of a target object scene;

in step S001, a visible light image of the target object scene is obtained by a visible light camera.

In step S002, the target image block is cut out from the visible light image of the target object scene by the target detection algorithm.

in step S004, the coarse density obtained in step S003 is obtainedCorrespondence c ₀ Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:

step S0041, the rough dense correspondence c obtained in step S003 is obtained ₀ Sampling to the same resolution as the cut picture as c _{0_256} And the picture i ₀ All input ResNet34 encoding network E ₀ Obtaining features f of different scales ₀ The calculation step is represented by the following formula:

{f _{0_32} ，f _{0_64} ，f _{0_128} }＝E ₀ (i ₀ ，c _{0_256} )，

f ₀ ＝{f _{0_32} ，f _{0_64} ，f _{0_128} }，

will f _{0_32} Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f _{0_32_1} The characteristic diagram f _{0_32_1} Input decoding network D ₁ Obtain the corresponding c of the first refinement ₁ ，c ₁ Represented by the formula:

c ₁ ＝D ₁ (pool(f _{0_32} ))

{f _{1_32} ，f _{1_64} ，f _{1_128} }＝E ₂ (i ₀ ，c _{0_256} ，c _{1_256} )，

f ₁ ＝{f _{1_32} ，f _{1_64} ，f _{1_128} }，

will f _{1_32} Downsampling to 1,2,3,6 resolution by pooling operations of different steps to obtain different scalesThe characteristic of the degree is up-sampled to the same resolution of 32 multiplied by 32 through bilinear interpolation and spliced to obtain a characteristic diagram f _{1_32_1} Combining the feature map with f _{1_64} Decoding network D of input pure convolution architecture ₂ Obtaining the corresponding c of the second refinement ₂ The calculation step is represented by the following formula:

c ₂ ＝D ₂ (pool(f _{1_32} )，f _{1_64} )

M _vis representing the portion of pixels visible in the target image block during the reconstruction step,representing the multiplication of the matrix element by element, c _i C for the corresponding graph obtained in step S004 for the ith refinement ₀ Representing the rough map obtained in step S003;

step S0043, the rough dense correspondence c obtained in step S003 is obtained _{0_256} Dense correspondence c obtained in step S0041 ₁ Dense correspondence c obtained in step S0042 ₂ Sampling to the same resolution as the cut picture by bilinear interpolation, and obtaining the same picture i ₀ Together with input to ResNet34 encoding network E ₃ Obtaining features f2 of different scales, the calculating step being represented by:

f ₂ ＝{f _{2_32} ，f _{2_64} ，f _{2_128} }，

will f _{2_32} Downsampling to 1,2,3,6 resolution by pooling operations of different steps, respectively, to obtain features of different scales, and by bilinearInterpolation up-sampling to the same resolution 32×32 and splicing to obtain a feature map f _{2_32_1} The characteristic diagram is identical to f _{2_64} ，f _{2_128} Decoding network D of input pure convolution architecture ₃ Obtaining the corresponding c of the third refinement ₃ The calculation step is represented by the following formula:

c ₃ ＝D ₃ (pool(f _{2_32} )，f _{2_64} ，f _{2_128} )

wherein, the ResNet34 coding network is a ResNet34 architecture coding network.

In step S0044, a difference L between the reconstructed corresponding relation and the true corresponding relation is set to reconstruct loss, which is the output c of each step _i And a sum of norms between the true correspondence relation c.

In step S005, the pose regression network is a pose regression network with a convolution structure.

In step S005, inputting the cascade-refined correspondence into a pose regression network of a convolution structure or solving by using PnP algorithm to obtain a pose parameter P, wherein when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, and 3-dimensional represents a translation parameter t _site The expression is as follows:

t _site ＝(dx，dy，t _z )，

wherein dx and dy represent the offset from the center of the target detection frame to the center of the object, t _Z The depth of the scaling is indicated as,

wherein, (O) _X ，O _Y )，(C _X ，C _Y ) The method comprises the steps of setting (H, W) the center of a target object in the target object image block and the center of the target object image block as the image size of a scene where a target object is located, and setting r=max (W, H)/max (W, H) as the scaling ratio.

In step S005, the other 6 dimensions of the pose parameter P are used to represent rotation parameters, and the first two columns of the rotation parameter matrix R ε SO (3) are used to represent rotation R _6d ＝[R ₁ |R ₂ ]The predicted 6-dimensional vector is denoted as r ₁ |r ₂ ]Thus, r= [ R ₁ |R ₂ |R ₃ ]The calculation can be made by the following formula:

In this embodiment, the adopted corresponding method is to learn the corresponding relationship from RGB information mainly based on a coding and decoding framework of a convolutional neural network, establish an end-to-end pose estimation network, effectively utilize the existing RGB information, improve the corresponding generation precision by extracting pooling features and adopting a cascading multi-step generation mode, and generate a high-quality corresponding relationship, thereby remarkably improving the 6D pose estimation accuracy and meeting the high-precision requirement.

Among these, convolutional neural networks are a common layer type in neural networks, also referred to as convolutional layers. The convolution layer uses convolution operation to extract characteristics in input data to gradually abstract and compress information so as to realize effective analysis and processing of data such as images.

The following describes simulation experiments for the effects achieved in practical application of the present embodiment.

Simulation conditions:

the simulation experiment of this embodiment adopts a mainstream actual data set:

LINEMOD data set

The LINEMOD dataset is a dataset for 6D object pose estimation, proposed by Hinterstonisiser et al in 2012 at ACCV meeting. The dataset contained 15 home items with no texture or insignificant color, such as coffee cups, soy sauce bottles, cans, etc. Each item has its corresponding 3D model and rendered image, as well as real images taken from different angles. Each image is annotated with a 6D pose of the object including a rotation matrix and translation vector, a 2D bounding box and a 2D binary mask.

The simulation content:

the present embodiment uses a real dataset to verify the effect of the method of the present invention. In order to test the performance of the algorithm, the proposed monocular object grade and pose estimation method of the cascade refined corresponding graph is compared with the currently internationally popular visible light image object grade and pose estimation method algorithm. The comparative method includes the method described in the paper GDR-Net published by Gu Wang et al, international top-level visual conference CVPR 2021.

The evaluation indexes adopted by the invention are ADD indexes and translational rotation metrics, wherein ADD is a metric for evaluating the performance of a 6D object attitude estimation algorithm, and is proposed by Hiterstonisser et al in 2012 at ACCV meeting. The ADD index means Average Distance, i.e., average Distance. The calculation method is that the real object posture (rotation matrix and translation vector) and the predicted object posture are known, the 3D model point cloud of the object is respectively transformed under a camera coordinate system by the two postures, the Euclidean distance between each corresponding point is calculated, and the average value of all the point distances is calculated. If this average is less than some set threshold, typically 10% of the object diameter, denoted ADD0.1D, the predicted pose is considered correct. Another evaluation index is that the transformed relative translation and rotation are smaller than the specified values, and the predicted gesture is considered to be correct.

Simulation experiment result analysis:

evaluation index on MOD dataset

The comparison in the index value table 1 shows that the method has better index value, whether the neural network is adopted to return the pose parameter or the traditional PnP algorithm is adopted to solve the pose parameter, the method is obviously superior to other methods, namely the method can generate finer correspondence, and further the precision of pose estimation is improved. It can be seen intuitively in fig. 1 that the corresponding map generated by the method described in this embodiment 1 has higher accuracy.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A monocular object grade pose estimation method of a cascade refinement corresponding graph is characterized by comprising the following steps of: comprises the steps of,

s001, obtaining a visible light image of a target object scene;

2. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S001, a visible light image of the target object scene is obtained by a visible light camera.

3. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S002, the target image block is cut out from the visible light image of the target object scene by the target detection algorithm.

4. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S005, the pose regression network is a pose regression network with a convolution structure.

5. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S004, the rough dense correspondence c obtained in step S003 is obtained ₀ Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:

step S0041, the rough dense correspondence c obtained in step S003 is obtained ₀ Sampling to the same resolution as the cut picture as c _{0_256} And the describedPicture i ₀ All input ResNet34 encoding network E ₀ Obtaining features f of different scales ₀ The calculation step is represented by the following formula:

{f _{0_32} ，f _{0_64} ，f _{0_128} }＝E ₀ (i ₀ ，c _{0_256} )，

f ₀ ＝{f _{0_32} ，f _{0_64} ，f _{0_128} }，

c ₁ ＝D ₁ (pool(f _{0_32} ))

f ₁ ＝{f _{1_32} ，f _{1_64} ，f _{1_128} }，

will f _{1_32} Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f _{1_32_1} Combining the feature map with f _{1_64} Decoding network D of input pure convolution architecture ₂ Obtaining the corresponding c of the second refinement ₂ The calculation step is represented by the following formula:

c ₂ ＝D ₂ (pool(f _{1_32} )，f _{1_64} )

6. The monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: in step S0044, a difference L between the reconstructed corresponding relation and the true corresponding relation is set to reconstruct loss, which is the output c of each step _i And a sum of norms between the true correspondence relation c.

7. The monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: step S0042 further comprises

f ₂ ＝{f _{2_32} ，f _{2_64} ，f _{2_128} }，

c ₃ ＝D ₃ (pool(f _{2_32} )，f _{2_64} ，f _{2_128} )。

8. the monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: the ResNet34 encoding network is a ResNet34 architecture encoding network.

9. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S005, inputting the cascade-refined correspondence into a pose regression network of a convolution structure or solving by using PnP algorithm to obtain a pose parameter P, wherein when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, and 3-dimensional represents a translation parameter t _site The expression is as follows:

t _site ＝(dx，dy，t _z )，

wherein, (O) _X ,O _Y )，(C _X ,C _Y ) The method comprises the steps of setting (H, W) the center of a target object in the target object image block and the center of the target object image block as the image size of a scene where a target object is located, and setting r=max (W, H)/max (W, H) as the scaling ratio.

10. The monocular object pose estimation method of a cascade refinement map according to claim 9, wherein: in step S005, the other 6 dimensions of the pose parameter P are used to represent rotation parameters, and the first two columns of the rotation parameter matrix R ε SO (3) are used to represent rotation R _6d ＝[R _.1 |R _.2 ]The predicted 6-dimensional vector is denoted as r ₁ |r ₂ ]Thus, r= [ R _.1 |R _.2 |R _.3 ]The calculation can be made by the following formula: