CN117745825A - Monocular object grade pose estimation method for cascade refinement corresponding diagram - Google Patents

Monocular object grade pose estimation method for cascade refinement corresponding diagram Download PDF

Info

Publication number
CN117745825A
CN117745825A CN202311773270.1A CN202311773270A CN117745825A CN 117745825 A CN117745825 A CN 117745825A CN 202311773270 A CN202311773270 A CN 202311773270A CN 117745825 A CN117745825 A CN 117745825A
Authority
CN
China
Prior art keywords
network
pose
cascade
target
refinement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311773270.1A
Other languages
Chinese (zh)
Inventor
谢越琛
谢晋
钱建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202311773270.1A priority Critical patent/CN117745825A/en
Publication of CN117745825A publication Critical patent/CN117745825A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of image recognition, and discloses a monocular object grade pose estimation method of a cascade refined corresponding graph, which comprises the following steps of S001, obtaining a visible light image of a target object scene; s002, cutting out the target image block from the visible light image of the target object scene; s003, obtaining rough dense corresponding relation between a visible light image of a target object scene and a target model; s004, inputting the obtained rough dense corresponding relation into a cascade network, and refining to obtain a fine corresponding relation; s005, the obtained fine corresponding relation is passed through a pose regression network to obtain the pose of the target object. The method establishes an end-to-end pose estimation network, effectively utilizes the existing RGB information, improves corresponding generation precision by extracting pooling features and adopting a cascading multi-step generation mode, thereby remarkably improving the 6D pose estimation accuracy and effectively solving the problem that the high-precision requirement cannot be met.

Description

Monocular object grade pose estimation method for cascade refinement corresponding diagram
Technical Field
The invention belongs to the technical field of image recognition, and particularly relates to a monocular object grade pose estimation method for cascade refinement corresponding graphs.
Background
With the continuous development of the robot technology, the robot environment sensing technology based on three-dimensional vision has been penetrated into various fields such as intelligent manufacturing, intelligent logistics and the like due to the characteristic of wide application range. The task of the pose estimation of the object 6D is to estimate the rigid transformation from the object coordinate system to the camera coordinate system, i.e. the rotation and translation transformation of the object coordinate system to the camera coordinate system. Object pose estimation has strong practical significance, and the estimation of the pose of a 6D object draws more attention from researchers, and is also becoming more and more widely used in many practical applications, such as robotic operations, autopilot and augmented reality.
On the approach aimed at predicting the pose of an object instance based on a three-dimensional model of the owning object, the mainstream approaches can be divided into the following categories: a correspondence-based method, a template-based method, etc. The corresponding method is based on searching for a corresponding relation between the observation point cloud and the object model, and further obtaining the 6D pose of the object through a PnP algorithm. The template-based mode is to select the template closest to the current observed object from templates marked with 6D poses, and take the 6D poses as predicted 6D poses.
Most of the current uses require depth camera information to obtain geometric information, which leads to the requirement of additional sensors, limits the application range and increases the detection cost; the existing RGB-based methods have great development prospects. The corresponding relation is learned from RGB information mainly through a coding and decoding framework based on a convolutional neural network based on a corresponding method; however, the current mainstream method only adopts a single-step encoding and decoding generation mode to obtain correspondence, and is limited by the characteristics of convolution downsampling operation, the single-step generation method often has the problems of edge blurring and the like, and cannot generate a high-quality corresponding relation, so that pose estimation precision is low indirectly, and high precision requirements cannot be met.
Based on the method, a monocular object grade pose estimation method related to a cascade refinement corresponding graph is researched and developed.
Disclosure of Invention
The invention provides a monocular object grade pose estimation method for cascade refinement corresponding graphs, which constructs an end-to-end pose estimation network, effectively utilizes the existing RGB information, improves the corresponding generation precision by extracting pooling features and adopting a cascade multi-step generation mode, thereby remarkably improving the 6D pose estimation accuracy, and effectively solving the problems that the corresponding is obtained by adopting a single-step encoding and decoding generation method, the edge is fuzzy, the high-quality corresponding relation cannot be generated, and the requirement of high precision cannot be met.
A monocular object grade posture estimation method of cascade refinement mapping, which comprises the following steps,
s001, obtaining a visible light image of a target object scene;
s002, cutting the target image block from the visible light image of the target object scene to obtain a cut picture i 0
S003, the picture i is displayed 0 Scaled to a resolution of 256×256, input ResNet34 encoding network E 1 Inputting the obtained feature map with the size of 8 multiplied by 8 into a decoding network to obtain a rough dense corresponding relation c with the size of 64 multiplied by 64 between a visible light image of a target object scene and a target model 0
S004, the rough dense corresponding relation c obtained in the step S003 is processed 0 Inputting into a cascade network, and refining to obtain a fine corresponding relation;
s005, obtaining the pose of the target object through a pose regression network according to the fine corresponding relation obtained in the step S004.
Optionally, in step S001, a visible light image of the target object scene is obtained by a visible light camera.
Optionally, in step S002, the target image block is cropped from the visible light image of the target object scene by the target detection algorithm.
Optionally, in step S005, the pose regression network is a pose regression network with a convolution structure.
Alternatively, the process may be carried out in a single-stage,in step S004, the rough dense correspondence c obtained in step S003 is obtained 0 Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:
step S0041, the rough dense correspondence c obtained in step S003 is obtained 0 Sampling to the same resolution as the cut picture as c 0_256 And the picture i 0 All input ResNet34 encoding network E 0 Obtaining features f0 of different scales, wherein the calculating step is represented by the following formula:
{f 0_32 ,f 0_64 ,f 0_128 }=E 0 (i 0 ,c 0_256 ),
f 0 ={f 0_32 ,f 0_64 ,f 0_128 },
will f 0_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 0_32 The characteristic diagram f 0_32 Input decoding network D 1 Obtain the corresponding c of the first refinement 1 ,c 1 Represented by the formula:
c 1 =D 1 (pool(f 0_32 ))
step S0042, the rough dense correspondence c obtained in step S0041 0_256 Rough dense correspondence c obtained in step S0041 1 Sampling to the same resolution c as the cropped picture by bilinear interpolation 1_256 Same as i 0 Together with input to ResNet34 encoding network E 2 Obtaining features f of different scales 1 The calculation step is represented by the following formula:
{f 1_32 ,f 1_64 ,f 1_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ),
f 1 ={f 1_32 ,f 1_64 ,f 1_128 },
will f 1_32 Downsampling to 1 by the pooling operation of the different steps respectively,2,3,6 to obtain features of different scales, up-sampling to the same resolution 32×32 by bilinear interpolation and stitching to obtain feature map f 1_32_1 Combining the feature map with f 1_64 Decoding network D of input pure convolution architecture 2 Obtaining the corresponding c of the second refinement 2 The calculation step is represented by the following formula:
c 2 =D 2 (pool(f 1_32 ),f 1_64 )
step S0044, setting a gap L between the reconstructed corresponding relationship and the true corresponding relationship, and the expression is as follows:
M vis representing the portion of pixels visible in the target image block during the reconstruction step,representing the multiplication of the matrix element by element, c i C for the corresponding graph obtained in step S004 for the ith refinement 0 The rough map obtained in step S003 is shown.
Optionally, in step S0044, a difference L between the reconstructed correspondence and the true correspondence is set as a reconstruction loss, which is the output c of each step i And a sum of norms between the true correspondence relation c.
Optionally, the operation of step S0042 further comprises
Step S0043, the rough dense correspondence c obtained in step S003 is obtained 0_256 Dense correspondence c obtained in step S0041 1 Dense correspondence c obtained in step S0042 2 Sampling to the same resolution as the cut picture by bilinear interpolation, and obtaining the same picture i 0 Together with input to ResNet34 encoding network E 3 Obtaining features f of different scales 2 The calculation step is represented by the following formula:
{f 2_32 ,f 2_64 ,f 2_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ,c 2_256 ),
f 2 ={f 2_32 ,f 2_64 ,f 2_128 },
will f 2_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 2_32_1 The characteristic diagram is identical to f 2_64 ,f 2_128 Decoding network D of input pure convolution architecture 3 Obtaining the corresponding c of the third refinement 3 The calculation step is represented by the following formula:
c 3 =D 3 (pool(f 2_32 ),f 2_64 ,f 2_128 )
optionally, the ResNet34 encoding network is a ResNet34 architecture encoding network.
Optionally, in step S005, inputting the cascade-refined correspondence relationship into a pose regression network of the convolution structure or solving by using PnP algorithm to obtain a pose parameter P, where when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, where 3-dimensional represents a translation parameter t site The expression is as follows:
t site =(dx,dy,t z ),
wherein dx and dy represent the offset from the center of the target detection frame to the center of the object, t z Representing a zoom depth;
according to the target detection result obtained in the step S002, namely, the picture i 0 The position (C) of the target detection frame in the original image can be obtained x ,C Y ) And target image block size (h, w), t site The respective parameter of (a) represents:
wherein, (O) X ,O Y ),(C X ,C Y ) Is in the target object image blockThe center of the object and the center of the object image block are (H, W) the image size of the scene where the object is located, and r=max (W, H)/max (W, H) is the scaling ratio.
Optionally, in step S005, the other 6 dimensions of the pose parameter P are used to represent the rotation parameter, and the first two columns of the rotation parameter matrix R ε SO (3) are taken to represent the rotation R 6d =[R 1 |R 2 ]The predicted 6-dimensional vector is denoted as r 1 |r 2 ]Thus, r= [ R 1 |R 2 |R 3 ]The calculation can be made by the following formula:
wherein f represents vector normalization operation, which uses the center offset, the predicted depth error and the average distance of the point cloud after rotation transformation as supervisory signals to train the network loss function expression as follows:
wherein R represents a rotation matrix, and M represents that the target object is in a picture i 0 X represents a point cloud representation of the object model, typically an N x 3 shape, where N represents the number of points and 3 represents the coordinates of the points in three-dimensional space.
Compared with the prior art, the invention has the beneficial effects that:
according to the technical scheme, an end-to-end pose estimation network is established, existing RGB information is effectively utilized, pooling features are extracted, a cascading multi-step generation mode is adopted to improve corresponding generation precision, and therefore 6D pose estimation accuracy is remarkably improved, the problem that the corresponding is obtained only by adopting a single-step encoding and decoding generation mode and the characteristic of convolution downsampling operation is limited is effectively solved, the problem that edge blurring and the like often exist in the single-step generation method, and high-quality corresponding relation cannot be generated, so that pose estimation precision is low and high precision requirements cannot be met is indirectly caused.
Drawings
FIG. 1 is a high-precision graph generated by a simulation experiment in example 1;
FIG. 2 is a schematic block flow diagram of the method described herein;
fig. 3 is a schematic block diagram of the logic flow of the method described in this embodiment 1.
The information of the present embodiment is provided in the form of,
the following further details the invention with reference to examples for the purpose of making the objects, technical solutions and advantages of the invention more clear, the illustrative embodiments of the invention and their description are only for the purpose of explaining the invention and are not to be construed as limiting the invention.
Example 1:
as shown in fig. 1-3, a monocular object pose estimation method for cascade refinement of a corresponding graph, comprising the steps of,
s001, obtaining a visible light image of a target object scene;
in step S001, a visible light image of the target object scene is obtained by a visible light camera.
S002, cutting the target image block from the visible light image of the target object scene to obtain a cut picture i 0
In step S002, the target image block is cut out from the visible light image of the target object scene by the target detection algorithm.
S003, the picture i is displayed 0 Scaled to a resolution of 256×256, input ResNet34 encoding network E 1 Inputting the obtained feature map with the size of 8 multiplied by 8 into a decoding network to obtain a rough dense corresponding relation c with the size of 64 multiplied by 64 between a visible light image of a target object scene and a target model 0
S004, the rough dense corresponding relation c obtained in the step S003 is processed 0 Inputting into a cascade network, and refining to obtain a fine corresponding relation;
in step S004, the coarse density obtained in step S003 is obtainedCorrespondence c 0 Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:
step S0041, the rough dense correspondence c obtained in step S003 is obtained 0 Sampling to the same resolution as the cut picture as c 0_256 And the picture i 0 All input ResNet34 encoding network E 0 Obtaining features f of different scales 0 The calculation step is represented by the following formula:
{f 0_32 ,f 0_64 ,f 0_128 }=E 0 (i 0 ,c 0_256 ),
f 0 ={f 0_32 ,f 0_64 ,f 0_128 },
will f 0_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 0_32_1 The characteristic diagram f 0_32_1 Input decoding network D 1 Obtain the corresponding c of the first refinement 1 ,c 1 Represented by the formula:
c 1 =D 1 (pool(f 0_32 ))
step S0042, the rough dense correspondence c obtained in step S0041 0_256 Rough dense correspondence c obtained in step S0041 1 Sampling to the same resolution c as the cropped picture by bilinear interpolation 1_256 Same as i 0 Together with input to ResNet34 encoding network E 2 Obtaining features f of different scales 1 The calculation step is represented by the following formula:
{f 1_32 ,f 1_64 ,f 1_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ),
f 1 ={f 1_32 ,f 1_64 ,f 1_128 },
will f 1_32 Downsampling to 1,2,3,6 resolution by pooling operations of different steps to obtain different scalesThe characteristic of the degree is up-sampled to the same resolution of 32 multiplied by 32 through bilinear interpolation and spliced to obtain a characteristic diagram f 1_32_1 Combining the feature map with f 1_64 Decoding network D of input pure convolution architecture 2 Obtaining the corresponding c of the second refinement 2 The calculation step is represented by the following formula:
c 2 =D 2 (pool(f 1_32 ),f 1_64 )
step S0044, setting a gap L between the reconstructed corresponding relationship and the true corresponding relationship, and the expression is as follows:
M vis representing the portion of pixels visible in the target image block during the reconstruction step,representing the multiplication of the matrix element by element, c i C for the corresponding graph obtained in step S004 for the ith refinement 0 Representing the rough map obtained in step S003;
step S0043, the rough dense correspondence c obtained in step S003 is obtained 0_256 Dense correspondence c obtained in step S0041 1 Dense correspondence c obtained in step S0042 2 Sampling to the same resolution as the cut picture by bilinear interpolation, and obtaining the same picture i 0 Together with input to ResNet34 encoding network E 3 Obtaining features f2 of different scales, the calculating step being represented by:
{f 2_32 ,f 2_64 ,f 2_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ,c 2_256 ),
f 2 ={f 2_32 ,f 2_64 ,f 2_128 },
will f 2_32 Downsampling to 1,2,3,6 resolution by pooling operations of different steps, respectively, to obtain features of different scales, and by bilinearInterpolation up-sampling to the same resolution 32×32 and splicing to obtain a feature map f 2_32_1 The characteristic diagram is identical to f 2_64 ,f 2_128 Decoding network D of input pure convolution architecture 3 Obtaining the corresponding c of the third refinement 3 The calculation step is represented by the following formula:
c 3 =D 3 (pool(f 2_32 ),f 2_64 ,f 2_128 )
wherein, the ResNet34 coding network is a ResNet34 architecture coding network.
In step S0044, a difference L between the reconstructed corresponding relation and the true corresponding relation is set to reconstruct loss, which is the output c of each step i And a sum of norms between the true correspondence relation c.
S005, obtaining the pose of the target object through a pose regression network according to the fine corresponding relation obtained in the step S004.
In step S005, the pose regression network is a pose regression network with a convolution structure.
In step S005, inputting the cascade-refined correspondence into a pose regression network of a convolution structure or solving by using PnP algorithm to obtain a pose parameter P, wherein when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, and 3-dimensional represents a translation parameter t site The expression is as follows:
t site =(dx,dy,t z ),
wherein dx and dy represent the offset from the center of the target detection frame to the center of the object, t Z The depth of the scaling is indicated as,
according to the target detection result obtained in the step S002, namely, the picture i 0 The position (C) of the target detection frame in the original image can be obtained X ,C Y ) And target image block size (h, w), t site The respective parameter of (a) represents:
wherein, (O) X ,O Y ),(C X ,C Y ) The method comprises the steps of setting (H, W) the center of a target object in the target object image block and the center of the target object image block as the image size of a scene where a target object is located, and setting r=max (W, H)/max (W, H) as the scaling ratio.
In step S005, the other 6 dimensions of the pose parameter P are used to represent rotation parameters, and the first two columns of the rotation parameter matrix R ε SO (3) are used to represent rotation R 6d =[R 1 |R 2 ]The predicted 6-dimensional vector is denoted as r 1 |r 2 ]Thus, r= [ R 1 |R 2 |R 3 ]The calculation can be made by the following formula:
wherein f represents vector normalization operation, which uses the center offset, the predicted depth error and the average distance of the point cloud after rotation transformation as supervisory signals to train the network loss function expression as follows:
wherein R represents a rotation matrix, and M represents that the target object is in a picture i 0 X represents a point cloud representation of the object model, typically an N x 3 shape, where N represents the number of points and 3 represents the coordinates of the points in three-dimensional space.
In this embodiment, the adopted corresponding method is to learn the corresponding relationship from RGB information mainly based on a coding and decoding framework of a convolutional neural network, establish an end-to-end pose estimation network, effectively utilize the existing RGB information, improve the corresponding generation precision by extracting pooling features and adopting a cascading multi-step generation mode, and generate a high-quality corresponding relationship, thereby remarkably improving the 6D pose estimation accuracy and meeting the high-precision requirement.
Among these, convolutional neural networks are a common layer type in neural networks, also referred to as convolutional layers. The convolution layer uses convolution operation to extract characteristics in input data to gradually abstract and compress information so as to realize effective analysis and processing of data such as images.
The following describes simulation experiments for the effects achieved in practical application of the present embodiment.
Simulation conditions:
the simulation experiment of this embodiment adopts a mainstream actual data set:
LINEMOD data set
The LINEMOD dataset is a dataset for 6D object pose estimation, proposed by Hinterstonisiser et al in 2012 at ACCV meeting. The dataset contained 15 home items with no texture or insignificant color, such as coffee cups, soy sauce bottles, cans, etc. Each item has its corresponding 3D model and rendered image, as well as real images taken from different angles. Each image is annotated with a 6D pose of the object including a rotation matrix and translation vector, a 2D bounding box and a 2D binary mask.
The simulation content:
the present embodiment uses a real dataset to verify the effect of the method of the present invention. In order to test the performance of the algorithm, the proposed monocular object grade and pose estimation method of the cascade refined corresponding graph is compared with the currently internationally popular visible light image object grade and pose estimation method algorithm. The comparative method includes the method described in the paper GDR-Net published by Gu Wang et al, international top-level visual conference CVPR 2021.
The evaluation indexes adopted by the invention are ADD indexes and translational rotation metrics, wherein ADD is a metric for evaluating the performance of a 6D object attitude estimation algorithm, and is proposed by Hiterstonisser et al in 2012 at ACCV meeting. The ADD index means Average Distance, i.e., average Distance. The calculation method is that the real object posture (rotation matrix and translation vector) and the predicted object posture are known, the 3D model point cloud of the object is respectively transformed under a camera coordinate system by the two postures, the Euclidean distance between each corresponding point is calculated, and the average value of all the point distances is calculated. If this average is less than some set threshold, typically 10% of the object diameter, denoted ADD0.1D, the predicted pose is considered correct. Another evaluation index is that the transformed relative translation and rotation are smaller than the specified values, and the predicted gesture is considered to be correct.
Simulation experiment result analysis:
evaluation index on MOD dataset
The comparison in the index value table 1 shows that the method has better index value, whether the neural network is adopted to return the pose parameter or the traditional PnP algorithm is adopted to solve the pose parameter, the method is obviously superior to other methods, namely the method can generate finer correspondence, and further the precision of pose estimation is improved. It can be seen intuitively in fig. 1 that the corresponding map generated by the method described in this embodiment 1 has higher accuracy.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A monocular object grade pose estimation method of a cascade refinement corresponding graph is characterized by comprising the following steps of: comprises the steps of,
s001, obtaining a visible light image of a target object scene;
s002, cutting the target image block from the visible light image of the target object scene to obtain a cut picture i 0
S003, the picture i is displayed 0 Scaled to a resolution of 256×256, input ResNet34 encoding network E 1 Inputting the obtained feature map with the size of 8 multiplied by 8 into a decoding network to obtain a rough dense corresponding relation c with the size of 64 multiplied by 64 between a visible light image of a target object scene and a target model 0
S004, the rough dense corresponding relation c obtained in the step S003 is processed 0 Inputting into a cascade network, and refining to obtain a fine corresponding relation;
s005, obtaining the pose of the target object through a pose regression network according to the fine corresponding relation obtained in the step S004.
2. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S001, a visible light image of the target object scene is obtained by a visible light camera.
3. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S002, the target image block is cut out from the visible light image of the target object scene by the target detection algorithm.
4. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S005, the pose regression network is a pose regression network with a convolution structure.
5. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S004, the rough dense correspondence c obtained in step S003 is obtained 0 Inputting a cascade network, and obtaining a fine corresponding relation through refinement, wherein the method comprises the following steps of:
step S0041, the rough dense correspondence c obtained in step S003 is obtained 0 Sampling to the same resolution as the cut picture as c 0_256 And the describedPicture i 0 All input ResNet34 encoding network E 0 Obtaining features f of different scales 0 The calculation step is represented by the following formula:
{f 0_32 ,f 0_64 ,f 0_128 }=E 0 (i 0 ,c 0_256 ),
f 0 ={f 0_32 ,f 0_64 ,f 0_128 },
will f 0_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 0_32 The characteristic diagram f 0_32 Input decoding network D 1 Obtain the corresponding c of the first refinement 1 ,c 1 Represented by the formula:
c 1 =D 1 (pool(f 0_32 ))
step S0042, the rough dense correspondence c obtained in step S0041 0_256 Rough dense correspondence c obtained in step S0041 1 Sampling to the same resolution c as the cropped picture by bilinear interpolation 1_256 Same as i 0 Together with input to ResNet34 encoding network E 2 Obtaining features f of different scales 1 The calculation step is represented by the following formula:
{f 1_32 ,f 1_64 ,f 1_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ),
f 1 ={f 1_32 ,f 1_64 ,f 1_128 },
will f 1_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 1_32_1 Combining the feature map with f 1_64 Decoding network D of input pure convolution architecture 2 Obtaining the corresponding c of the second refinement 2 The calculation step is represented by the following formula:
c 2 =D 2 (pool(f 1_32 ),f 1_64 )
step S0044, setting a gap L between the reconstructed corresponding relationship and the true corresponding relationship, and the expression is as follows:
M vis representing the portion of pixels visible in the target image block during the reconstruction step,representing the multiplication of the matrix element by element, c i C for the corresponding graph obtained in step S004 for the ith refinement 0 The rough map obtained in step S003 is shown.
6. The monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: in step S0044, a difference L between the reconstructed corresponding relation and the true corresponding relation is set to reconstruct loss, which is the output c of each step i And a sum of norms between the true correspondence relation c.
7. The monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: step S0042 further comprises
Step S0043, the rough dense correspondence c obtained in step S003 is obtained 0_256 Dense correspondence c obtained in step S0041 1 Dense correspondence c obtained in step S0042 2 Sampling to the same resolution as the cut picture by bilinear interpolation, and obtaining the same picture i 0 Together with input to ResNet34 encoding network E 3 Obtaining features f of different scales 2 The calculation step is represented by the following formula:
{f 2_32 ,f 2_64 ,f 2_128 }=E 2 (i 0 ,c 0_256 ,c 1_256 ,c 2_256 ),
f 2 ={f 2_32 ,f 2_64 ,f 2_128 },
will f 2_32 Respectively downsampling to 1,2,3 and 6 resolutions by pooling operation of different step sizes to obtain features of different scales, upsampling to the same resolution of 32×32 by bilinear interpolation and splicing to obtain a feature map f 2_32_1 The characteristic diagram is identical to f 2_64 ,f 2_128 Decoding network D of input pure convolution architecture 3 Obtaining the corresponding c of the third refinement 3 The calculation step is represented by the following formula:
c 3 =D 3 (pool(f 2_32 ),f 2_64 ,f 2_128 )。
8. the monocular object pose estimation method of the cascade refinement map of claim 5, wherein the method comprises the following steps of: the ResNet34 encoding network is a ResNet34 architecture encoding network.
9. The monocular object pose estimation method of the cascade refinement map according to claim 1, wherein the method is characterized by comprising the following steps of: in step S005, inputting the cascade-refined correspondence into a pose regression network of a convolution structure or solving by using PnP algorithm to obtain a pose parameter P, wherein when the pose parameter is obtained by adopting a regression method, the pose parameter is represented by a 9-dimensional vector, and 3-dimensional represents a translation parameter t site The expression is as follows:
t site =(dx,dy,t z ),
wherein dx and dy represent the offset from the center of the target detection frame to the center of the object, t Z Representing a zoom depth;
according to the target detection result obtained in the step S002, namely, the picture i 0 The position (C) of the target detection frame in the original image can be obtained X ,C Y ) And target image block size (h, w), t site The respective parameter of (a) represents:
wherein, (O) X ,O Y ),(C X ,C Y ) The method comprises the steps of setting (H, W) the center of a target object in the target object image block and the center of the target object image block as the image size of a scene where a target object is located, and setting r=max (W, H)/max (W, H) as the scaling ratio.
10. The monocular object pose estimation method of a cascade refinement map according to claim 9, wherein: in step S005, the other 6 dimensions of the pose parameter P are used to represent rotation parameters, and the first two columns of the rotation parameter matrix R ε SO (3) are used to represent rotation R 6d =[R .1 |R .2 ]The predicted 6-dimensional vector is denoted as r 1 |r 2 ]Thus, r= [ R .1 |R .2 |R .3 ]The calculation can be made by the following formula:
wherein f represents vector normalization operation, which uses the center offset, the predicted depth error and the average distance of the point cloud after rotation transformation as supervisory signals to train the network loss function expression as follows:
wherein R represents a rotation matrix, and M represents that the target object is in a picture i 0 X represents a point cloud representation of the object model, typically an N x 3 shape, where N represents the number of points and 3 represents the coordinates of the points in three-dimensional space.
CN202311773270.1A 2023-12-21 2023-12-21 Monocular object grade pose estimation method for cascade refinement corresponding diagram Pending CN117745825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311773270.1A CN117745825A (en) 2023-12-21 2023-12-21 Monocular object grade pose estimation method for cascade refinement corresponding diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311773270.1A CN117745825A (en) 2023-12-21 2023-12-21 Monocular object grade pose estimation method for cascade refinement corresponding diagram

Publications (1)

Publication Number Publication Date
CN117745825A true CN117745825A (en) 2024-03-22

Family

ID=90254316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311773270.1A Pending CN117745825A (en) 2023-12-21 2023-12-21 Monocular object grade pose estimation method for cascade refinement corresponding diagram

Country Status (1)

Country Link
CN (1) CN117745825A (en)

Similar Documents

Publication Publication Date Title
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN110533721B (en) Indoor target object 6D attitude estimation method based on enhanced self-encoder
CN107067473B (en) Method, device and system for reconstructing 3D modeling object
CN111899301A (en) Workpiece 6D pose estimation method based on deep learning
CN113283525B (en) Image matching method based on deep learning
CN113628348A (en) Method and equipment for determining viewpoint path in three-dimensional scene
Jin et al. 3d reconstruction using deep learning: a survey
Kumari et al. A survey on stereo matching techniques for 3D vision in image processing
CN114332214A (en) Object attitude estimation method and device, electronic equipment and storage medium
Holzmann et al. Semantically aware urban 3d reconstruction with plane-based regularization
Liu et al. Pseudo-lidar point cloud interpolation based on 3d motion representation and spatial supervision
CN113538569A (en) Weak texture object pose estimation method and system
CN110889868B (en) Monocular image depth estimation method combining gradient and texture features
Samavati et al. Deep learning-based 3D reconstruction: a survey
Hoang et al. 3ONet: 3D Detector for Occluded Object under Obstructed Conditions
CN113989441A (en) Three-dimensional cartoon model automatic generation method and system based on single face image
Cheng et al. GaussianPro: 3D Gaussian Splatting with Progressive Propagation
Yin et al. Virtual reconstruction method of regional 3D image based on visual transmission effect
CN117351078A (en) Target size and 6D gesture estimation method based on shape priori
CN113160315A (en) Semantic environment map representation method based on dual quadric surface mathematical model
CN117011380A (en) 6D pose estimation method of target object
CN115330935A (en) Three-dimensional reconstruction method and system based on deep learning
CN117745825A (en) Monocular object grade pose estimation method for cascade refinement corresponding diagram
CN115880419A (en) Neural implicit surface generation and interaction method based on voxels
CN114742954A (en) Method for constructing large-scale diversified human face image and model data pairs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination