CN115861418A - Single-view attitude estimation method and system based on multi-mode input and attention mechanism - Google Patents

Single-view attitude estimation method and system based on multi-mode input and attention mechanism Download PDF

Info

Publication number
CN115861418A
CN115861418A CN202211380719.3A CN202211380719A CN115861418A CN 115861418 A CN115861418 A CN 115861418A CN 202211380719 A CN202211380719 A CN 202211380719A CN 115861418 A CN115861418 A CN 115861418A
Authority
CN
China
Prior art keywords
pose
network
module
prediction
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211380719.3A
Other languages
Chinese (zh)
Inventor
史金龙
张文睿
钱强
欧镇
白素琴
钱萍
田朝晖
邓权耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202211380719.3A priority Critical patent/CN115861418A/en
Publication of CN115861418A publication Critical patent/CN115861418A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a single-view pose estimation method based on multi-modal input and attention mechanism, which constructs a single-view pose estimation system comprising a prediction module and a pose regression module, learns multiple intermediate representation characteristics of an object from a two-dimensional image by combining multi-modal input and attention characteristic enhancement technology, and further regresses the 6D pose of the object, and comprises the following steps: the prediction module adopts ResNet-18 as a backbone network, and the module adds a channel attention mechanism to estimate various intermediate representations of the 6D pose of the object, including key points, edge vectors among the key points and symmetrical corresponding relations among pixel points; the pose regression module regresses the 6D pose of the object from the intermediate representation result by using an EPnP algorithm and singular value decomposition. The invention provides an accurate and convenient technology for quickly estimating the 6D pose of the object from a single view.

Description

Single-view attitude estimation method and system based on multi-mode input and attention mechanism
Technical Field
The invention belongs to the technical field of estimation of 6D positions of objects from a single-view, and relates to a single-view position and posture estimation method and a single-view position and posture estimation system based on multi-modal input and attention mechanism.
Background
In applications such as robot grabbing or augmented reality, it is an important task to estimate the 6D pose of an object from an RGB image. Although the introduction of depth images may provide a significant improvement to this task, depth images are not always readily available. Such as cell phones, tablet computers, industrial cameras, do not provide depth data for the most part. Therefore, much research is devoted to estimating the 6D pose of a known object using only RGB images. Traditional methods match RGB image features with 3D models of objects to solve this problem, rely on artificial labeling of features, and lack robustness to illumination variations, background clutter, or low-texture objects. The development of deep learning accelerates the research of estimating the pose of the object 6D from the RGB image. At present, a popular key point method is to use key points as intermediate supervision signals to perform model training, and then combine 2D key points predicted by a neural network with a PnP algorithm to estimate a 6D pose of an object, for example, pvNet [ Peng S ], pixel-wise relating network for 6dof position estimation IEEE/CVF 2019 ], pixel-wise relating resolution of objects for6D position estimation IEEE/CVF 2019 ], and the like.
But the performance of the keypoint approach relies on the following two assumptions: 1) The deep learning model can accurately predict the position of the two-dimensional key point; 2) The predicted two-dimensional keypoints provide sufficient constraints to regress the pose of the object 6D. However, the key point prediction is not accurate due to factors such as partial occlusion of the object, and thus the two assumptions are not easy to be established in many real-world environments.
Disclosure of Invention
The invention aims to overcome the defects of the existing single-view 6D pose estimation method and provides a single-view pose estimation method and a single-view pose estimation system based on a multi-mode input and attention mechanism.
In order to solve the technical problems, the invention adopts the following technical scheme.
The invention discloses a single-view pose estimation method based on multi-modal input and attention mechanism, which constructs a single-view pose estimation system comprising a prediction module and a pose regression module, learns multiple intermediate representation characteristics of an object from a two-dimensional image by combining multi-modal input and attention characteristic enhancement technology, and further regresses the 6D pose of the object, and comprises the following steps:
step 1, a prediction module utilizes a plurality of intermediate representations to express geometric information in an RGB image, and an attention mechanism is introduced to improve the network training efficiency; the intermediate representations comprise key points kappa, edge vectors epsilon and dense pixel-by-pixel corresponding relations S of the RGB images;
the prediction module comprises a first prediction network
Figure SMS_1
Second prediction network->
Figure SMS_2
The third prediction network->
Figure SMS_3
A channel attention module is embedded between each down-sampling module of the three prediction networks; />
Figure SMS_4
The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network &>
Figure SMS_5
Used for optimizing the pose of the object;
step 2, the pose regression module acquires an intermediate representation result obtained by the prediction module, combines key points, edge vectors and dense pixel-by-pixel corresponding relation information, and regresses the pose of the object 6D from the intermediate representation result through EPnP calculation and singular value decomposition;
the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r is I ∈SO(3),
Figure SMS_6
Specifically, the step 1 comprises:
constructing a complete connected graph epsilon by taking key points as nodes:
Figure SMS_7
is a network that predicts the edge vectors along the graph of the graph, using ResNet-18 as the backbone network; ε explicitly expresses the displacement between each pair of keypoints, and |. ε | represents the number of edges in the predefined graph, and therefore | _ H |, and |, the value of the edge is expressed>
Figure SMS_8
Figure SMS_9
Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels to reflect the potential reflection symmetry of the object; />
Figure SMS_10
Expanding a network architecture of FlowNet, fusing dense pixel flow predicted by FlowNet and a mask map predicted by PVNet, and predicting the symmetrical corresponding relation of each pixel in a mask area;
Figure SMS_11
loss of (l) 1 ,/>
Figure SMS_12
Loss of (l) 2 ,/>
Figure SMS_13
Loss of (l) 3 Are smoothed out in Fast RCNN>
Figure SMS_14
Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, x, y and z parameters are used for weighting the losses of the three intermediate representations respectively, wherein x + y + z =1, so that the total loss is as follows:
L=xl 1 +yl 2 +zl 3 (1)
channel attention module, will
Figure SMS_15
And &>
Figure SMS_16
Each residual block having an output dimension of +>
Figure SMS_17
Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ≥ by performing a one-dimensional convolution with a convolution kernel size n =5>
Figure SMS_18
The channel weights of (a); the obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.
Specifically, the step 2 includes:
expressing three-dimensional key point truth coordinates in a standard coordinate system as
Figure SMS_20
The true value of the edge vector is expressed as->
Figure SMS_22
Figure SMS_24
Expressing the coordinates of the key points output by the prediction module as/>
Figure SMS_21
The edge vector is represented as
Figure SMS_26
Figure SMS_27
Symmetric correspondence is expressed as->
Figure SMS_28
For ease of calculation, homogeneous coordinates are used>
Figure SMS_19
Figure SMS_23
And->
Figure SMS_25
Corresponds to p k ,ν e ,q s,1 And q is s,2 These homogeneous coordinates are normalized by known camera parameters;
calculating the pose of the object 6D by using an EPnP algorithm and combining the constraint of the intermediate representation; the following difference vectors are first introduced for the three prediction elements:
Figure SMS_29
Figure SMS_30
/>
Figure SMS_31
wherein e s And e t Is the end point of the edge e and,
Figure SMS_32
is the normal of the reflection symmetry plane under the base coordinate system;
secondly, (2) converting the formula into the formA 1 x,A 2 x,A 3 The form of x is similar to the conversion of formula (3) to A 4 x,A 5 x,A 6 x, (4) to A 7 x; a is to be 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;
next, using the EPnP algorithm, calculate:
Figure SMS_33
(5) In the formula v i Is the right singular vector corresponding to the ith minimum singular value of A; ideally when the prediction element is noiseless, N =1,x = v i Is the optimal solution; selecting the same N =4 as EPnP; to calculate the optimal x, the hidden variable λ is optimized in an alternating optimization process using the following objective function i And a rotation matrix R:
Figure SMS_34
(6) In the formula
Figure SMS_35
Including v i The first 9 elements of (c); at the time of obtaining the optimum lambda i Then, apply SVD decomposition to->
Figure SMS_36
Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V T (ii) a And finally, obtaining a corresponding translation vector t by using Ax = 0:
Figure SMS_37
(7) In the formula A 1 =A [:,1:9] ,A 2 =A [:,10:12]
Figure SMS_38
Obtained by R flattening.
The invention relates to a single-view pose estimation system based on multi-mode input and attention mechanism, which comprises:
the prediction module is used for expressing geometric information in the RGB image by using various intermediate representations and introducing an attention mechanism so as to improve the network training efficiency; comprising a first predictive network
Figure SMS_39
Second prediction network->
Figure SMS_40
The third prediction network->
Figure SMS_41
A channel attention module is embedded between each down-sampling module of the three prediction networks; />
Figure SMS_42
The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network->
Figure SMS_43
Used for optimizing the pose of the object;
the pose regression module is used for acquiring the intermediate representation result obtained by the prediction module, combining the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regressing the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition; the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r is I ∈SO(3),
Figure SMS_44
Further, in the prediction module:
Figure SMS_45
the PVNet is used as a backbone network, the pose estimation network is based on key points, the key points are used as nodes to construct a fully connected graph, and the key points are used as the nodes to be judged>
Figure SMS_46
Is a network for predicting the edge vector of the graph along the graph, and the structure of the network adopts ResNet-18 as a backbone network; epsilon explicitly expresses the displacement between each pair of key points, | epsilon | represents the number of edges in the predefined graph; ε is a complete connection diagram, therefore->
Figure SMS_47
Figure SMS_48
Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels, and reflecting the potential reflection symmetry of the object; />
Figure SMS_49
The network architecture of FlowNet is expanded, dense pixel flow predicted by FlowNet and a mask graph predicted by PVNet are fused, and the symmetrical corresponding relation of each pixel in a mask area is predicted;
Figure SMS_50
loss of (l) 1 、/>
Figure SMS_51
Loss of (l) 2 、/>
Figure SMS_52
Loss of (l) 3 Are smoothed out in Fast RCNN>
Figure SMS_53
Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is respectively reduced by using three parameters of x, y and zA line weighting process, where x + y + z =1; the total loss is therefore:
L=xl 1 +yl 2 +zl 3 (1)
channel attention module, will
Figure SMS_54
And &>
Figure SMS_55
Each residual block having an output dimension of +>
Figure SMS_56
Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ≥ by performing a one-dimensional convolution with a convolution kernel size n =5>
Figure SMS_57
The channel weight of (a); and the obtained weight is processed by an activation function and subjected to dimensionality reduction to obtain a result F ', and the result F' is used as the input of the next residual block.
Further, the pose regression module:
expressing the three-dimensional key point truth value coordinate in the standard coordinate system as
Figure SMS_60
The true value of the edge vector is expressed as->
Figure SMS_62
Figure SMS_64
Representing keypoint coordinates output by a prediction module as ≥ er>
Figure SMS_59
The edge vector is represented as
Figure SMS_63
Figure SMS_65
Symmetric correspondence expressed as>
Figure SMS_67
For ease of calculation, homogeneous coordinates are used>
Figure SMS_58
Figure SMS_61
And->
Figure SMS_66
Corresponds to p k ,ν e ,q s,1 And q is s,2 These homogeneous coordinates are normalized by known camera parameters;
calculating the pose of the object 6D by using an EPnP algorithm and combining the constraint of the intermediate representation; the following difference vectors are first introduced for the three prediction elements:
Figure SMS_68
Figure SMS_69
Figure SMS_70
wherein e s And e t Is the end point of the edge e and,
Figure SMS_71
is the normal of the reflection symmetry plane under the base coordinate system;
secondly, converting the formula (2) into the form A 1 x,A 2 x,A 3 The form of x is similar to the conversion of formula (3) to A 4 x,A 5 x,A 6 x, (4) is converted into A 7 x; a is prepared from 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is oneA vector containing the parameters of the rotation matrix R and the translational vector t in affine space;
next, x is calculated using the EPnP algorithm:
Figure SMS_72
(5) In the formula v i Is the right singular vector corresponding to the ith minimum singular value of A; ideally when the prediction element is noiseless, N =1,x = v i Is the optimal solution; selecting the same N =4 as EPnP; to calculate the optimal x, the hidden variable λ is optimized in an alternating optimization process using the following objective function i And a rotation matrix R:
Figure SMS_73
(6) In the formula
Figure SMS_74
Including v i The first 9 elements of (a); at the time of obtaining the optimum lambda i Then, apply SVD decomposition to->
Figure SMS_75
Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V T (ii) a And finally, obtaining a corresponding translation vector t by using Ax = 0:
Figure SMS_76
(7) In the formula A 1 =A [:,1:9] ,A 2 =A [:,10:12]
Figure SMS_77
Obtained by R flattening.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a new 6D pose estimation method which mainly comprises a 6D pose estimation network based on channel attention. The convolutional neural network model (CNN) has certain limitations in predicting the 6D pose due to the size of its convolutional kernel and the pooling of features. When the ratio of the target area image in the network input to the input whole image is small, a large amount of noise is generated in the process of multiple iterative convolutions by the background information, and the feature extraction of the target area is influenced. In order to overcome the limitation of a convolutional network and improve the network training efficiency and accuracy, the invention introduces a channel attention module, can better extract the detail characteristics in the image and obtain a more accurate pose estimation result.
2. The method uses key points, edge vectors among the key points, and a plurality of intermediate representations of dense pixel-by-pixel corresponding relations to express different geometric information in the input image, and improves the accuracy of the generated object 6D pose estimation to the maximum extent. Symmetrical corresponding constraints are added on the basis of the key points, so that the performance of the model in the aspect of estimating the rotation matrix can be improved; adding edge vectors to the keypoints and symmetric correspondences provides more constraints on translation and rotation. Edge vectors provide more translational constraints than keypoints, as they represent neighboring keypoint displacements and provide gradient information for regression. Unlike the symmetric correspondence, the edge vector constrains 3 degrees of freedom of the rotation parameter, which further improves the performance of estimating the rotation matrix.
3. Even under a complex environment and a shielding condition, the method can predict the 6D pose of the object quite accurately. The 6D Pose estimation network performance of the invention is evaluated by using two popular reference data sets Linemod and Occlusion Linemod in the 6D Pose estimation test, and compared with the current advanced 6D Pose estimation algorithm (such as Pix2Pose, PVNet, posecNN, CDPN and the like), the average ADD (-S) precision of the invention on the Linemod data set is 92.8; the average ADD (-S) precision on the Occlusion Linemod dataset was 48.0. In the aspect of processing the occlusion problem, the method is improved by 16% compared with the Pix2Pose, and the advantages of the method in predicting the 6D Pose of the occlusion object are clearly displayed.
Drawings
FIG. 1 is a system block diagram of one embodiment of the present invention.
FIG. 2 is a channel attention module diagram of one embodiment of the present invention.
FIG. 3 is a diagram of the intermediate representation results and pose estimation results of different objects in the experiment of the present invention.
FIG. 4 is a diagram of pose estimation results under occlusion of different proportions in the experiment of the invention.
FIG. 5 is a comparison graph of network training loss before and after the model adds the attention module in the experiment of the present invention.
Fig. 6 is a comparison graph of network training loss before and after the cat model adds the attention module in the experiment of the present invention.
Detailed Description
The invention discloses a single-view pose estimation method and a single-view pose estimation system based on multi-modal input and attention mechanism, wherein the constructed system comprises a prediction module and a pose regression module, and combines the multi-modal input and attention feature enhancement technology to learn various intermediate representation features of an object from a two-dimensional image so as to regress the 6D pose of the object, and the method comprises the following steps: the prediction module adopts ResNet-18 as a backbone network, and the module adds a channel attention mechanism to estimate various intermediate representations of the 6D pose of the object, including key points, edge vectors among the key points and symmetrical corresponding relations among pixel points; the pose regression module regresses the 6D pose of the object from the intermediate representation result by using an EPnP algorithm and singular value decomposition. The invention provides an accurate and convenient technology for quickly estimating the 6D pose of the object from a single view.
The present invention will be described in further detail with reference to the accompanying drawings.
Fig. 1 is a system block diagram of an embodiment of the present invention, and at the same time, embodies the technical principles of the present invention: aiming at estimating the 6D pose of an object from a single RGB image, the method comprises the following steps:
step 1, a prediction module expresses geometric information in an RGB image by using various intermediate representations and introduces an attention mechanism to improve the network training efficiency;
and 2, the pose regression module acquires the intermediate representation result obtained by the prediction module, combines the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regresses the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition.
The prediction module comprises three prediction networks: first prediction network
Figure SMS_78
Second prediction network->
Figure SMS_79
And a third prediction network>
Figure SMS_80
And a fully-connected network, wherein a channel attention module is embedded between each down-sampling module of the three prediction networks.
Figure SMS_81
The PVNet is used as a backbone network, and the pose estimation network is based on key points and predicts visible and invisible k key points by adopting a voting method. In order to solve the problem that the pose error is large when the key point prediction is inaccurate, another two prediction networks are introduced to make a judgment on whether the pose is correct or not>
Figure SMS_82
To optimize the pose of the object.
The invention takes key points as nodes to construct a complete connected graph,
Figure SMS_83
is a network that predicts the vectors along the graph edges of the graph, whose structure uses ResNet-18 as the backbone network. ε explicitly expresses the displacement between each pair of keypoints, and | ε | represents the number of edges in the predefined graph. In the present invention epsilon is a complete communication map and therefore ^ s>
Figure SMS_84
Figure SMS_85
The prediction generates a third intermediate representation S reflecting pixel symmetry correspondences reflecting the underlying reflection symmetries of the object。/>
Figure SMS_86
And the network architecture of FlowNet is expanded, dense pixel flow predicted by FlowNet is fused with a mask map predicted by PVNet, and the symmetrical corresponding relation of each pixel in a mask area is predicted. Compared with the first two intermediate representation modes, the data size corresponding to symmetry is larger, the constraint on the pose estimation result is stronger, and particularly, rich constraint is provided for a shielded object, so that the shielding problem can be better solved.
In the invention
Figure SMS_87
Loss of (l) 1 ,/>
Figure SMS_88
Loss of (l) 2 ,/>
Figure SMS_89
Loss of (l) 3 Are smoothed out in Fast RCNN>
Figure SMS_90
Training is performed with loss. In order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1. The total loss is therefore:
L=xl 1 +yl 2 +zl 3 (1)
as shown in fig. 2, the channel attention module will
Figure SMS_91
And &>
Figure SMS_92
Each residual block output dimension is ≥ v>
Figure SMS_93
Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension by performing a one-dimensional convolution with a convolution kernel size of n =5Degree is->
Figure SMS_94
The channel weights of (c). The obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.
A pose regression module: taking the predicted intermediate representation { kappa, epsilon, S } of the network as input, and outputting the 6D pose (R) of the object I I ∈SO(3),
Figure SMS_95
)。
The invention expresses the three-dimensional key point truth value coordinate in the standard coordinate system as
Figure SMS_96
The true value of the edge vector is expressed as->
Figure SMS_97
Expressing the key point coordinate output by the prediction module as ≥ er>
Figure SMS_98
Edge vector representation as +>
Figure SMS_99
Symmetric correspondence is expressed as->
Figure SMS_100
The use of homogeneous coordinates for the present invention to facilitate the calculation thereof>
Figure SMS_101
And->
Figure SMS_102
Corresponds to p k ,ν e ,q s,1 And q is s,2 These homogeneous coordinates are normalized by known camera parameters.
The present invention uses the EPnP algorithm in combination with intermediate representation constraints to compute the pose of the object 6D. Firstly, the following differential vectors are introduced for three prediction elements:
Figure SMS_103
Figure SMS_104
Figure SMS_105
wherein e s And e t Is the end point of the edge e and,
Figure SMS_106
is the normal of the reflection symmetry plane under the base coordinate system.
Secondly, converting the formula (2) into the form A 1 x,A 2 x,A 3 The form of x is the same as that of formula (3) to A 4 x,A 5 x,A 6 x, (4) to A 7 x. A is to be 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 And are combined into A. To describe the relationship between predicted and true values, the present invention introduces a linear system of the form Ax =0, where a is a matrix with dimensions (3 | κ | +3| epsilon | + | S |) × 12. x is a vector containing the rotation matrix R and translation vector t parameters in affine space.
Next, x is calculated using the EPnP algorithm:
Figure SMS_107
(5) In the formula v i Is the right singular vector corresponding to the ith minimum singular value of a. Ideally when the prediction element is noiseless, N =1,x = v i Is the optimal solution. However, in real-world situations, the performance of this approach is poor. The present invention selects the same N =4 as EPnP. To calculate the optimal x, the present invention optimizes the hidden variable λ in an alternating optimization process using the following objective function i And a rotation matrix R:
Figure SMS_108
(6) In the formula
Figure SMS_109
Including v i The first 9 elements of (a). At the time of obtaining the optimum lambda i Then, apply SVD decomposition to->
Figure SMS_110
Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V T . And finally, obtaining a corresponding translation vector t by using Ax = 0:
Figure SMS_111
(7) In the formula A 1 =A [:,1:9] ,A 2 =A [:,10:12]
Figure SMS_112
Obtained by R flattening.
Experimental verification of the invention shows that:
FIG. 3 shows three intermediate representation results of pose estimation of the present invention and single target object 6D pose estimation results: (a) representing an input image; (b) (c) and (d) respectively represent the predicted key points, and the edge vectors correspond to the symmetrical corresponding images; (e) And representing the 6D pose calculated by regression, wherein a green box represents a real value of the 6D pose of the object, and a blue box represents the 6D pose of the object predicted by the invention. FIG. 4 shows the estimation result of the object 6D pose under the occlusion condition of different proportions. It can be found that the method can accurately predict the 6D pose of the object even under a complex environment and a shielding condition. The invention uses two popular reference data sets Linemod and Occupusion Linemod in the 6D Pose estimation test to evaluate the 6D Pose estimation network performance of the invention, and compares the performance with the current advanced 6D Pose estimation algorithm (such as Pix2Pose, PVNet, posecNN, CDPN and the like). As shown in Table 1 and Table 2, the average ADD (-S) precision of the Linemod data set of the present invention is 92.8; the average ADD (-S) precision on the Occlusion Linemod dataset was 48.0.
As can be seen from fig. 3, fig. 4, table 1 and table 2, the present invention can achieve accurate pose estimation, and the test results on Linemod and occupancy Linemod are superior to most of the most advanced methods. The method is superior to a main stem model PVNet for predicting key points, and the performance is improved in all object classes, which shows that the method using various intermediate representation methods is obviously superior to the method using only key points. In terms of processing the occlusion problem, the method of the invention is improved by 16% relative to the Pix2Pose, which clearly shows the advantages of various intermediate representations in predicting the 6D Pose of the occluded object.
TABLE 1 ADD (-S) Performance comparison on Linemod dataset
Figure SMS_113
TABLE 2 ADD (-S) Performance comparison on Occlusion Linemod dataset
Figure SMS_114
To verify the performance of the attention mechanism module, the present invention tests the training effect before and after the attention module addition on the same data set, and fig. 5 is a comprehensive loss curve before and after the improvement.
As can be seen from fig. 5, the final training result of the estimation of the object pose can be optimized by adding the attention module in the backbone network, and the loss convergence speed is faster and the loss is smaller in the training process. The experimental result shows that the ADD (-S) index under the Linemod data set is averagely improved by 3.1% and the ADD (-S) index under the Occlusion Linemod data set is averagely improved by 1.4% when the improved ResNet-18 is used by the basic neural network compared with the prior art, and the method can be used for more accurately estimating the 6D pose of the object.
In summary, the present invention provides a single-view pose estimation method and system based on multi-modal input and attention mechanism, wherein the method includes two modules. The first module utilizes various intermediate representations to express the geometric information in the RGB image, and introduces an attention mechanism to improve the network training efficiency. In addition to predicting keypoints, the module can predict edge vectors between neighboring keypoints. In addition, the present invention utilizes a predictive dense pixel-by-pixel correspondence to reflect the basic symmetry between pixels. Compared with a method only using key point representation, the pose estimation method integrating multiple intermediate representations provides more constraints and can realize accurate pose prediction under the conditions of occlusion, shadow and the like. The second module obtains the intermediate representation result obtained by the previous module, combines the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regresses the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition. The technology is suitable for multiple fields of comprehensive ship guarantee, automatic sorting, automatic driving, medical treatment, virtual reality, augmented reality, industrial manufacturing and the like, can accurately acquire the 6D position and pose of an object from an image, provides technical support for intellectualization and informatization of industrial manufacturing, and has wide market prospect.

Claims (6)

1. A single-view pose estimation method based on multi-modal input and attention mechanism is characterized by constructing a single-view pose estimation system comprising a prediction module and a pose regression module, learning multiple intermediate representation features of an object from a two-dimensional image by combining multi-modal input and attention feature enhancement technology, and further regressing the pose of the object to be 6D, and comprises the following steps:
step 1, a prediction module expresses geometric information in an RGB image by using various intermediate representations and introduces an attention mechanism to improve the network training efficiency; the intermediate representations comprise key points kappa, edge vectors epsilon and dense pixel-by-pixel corresponding relations S of the RGB images;
the prediction module comprises a first prediction network
Figure FDA0003926719400000011
Second prediction network->
Figure FDA0003926719400000012
The third prediction network->
Figure FDA0003926719400000013
A channel attention module is embedded between each down-sampling module of the three prediction networks; />
Figure FDA0003926719400000014
The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network->
Figure FDA0003926719400000015
To optimize the pose of the object;
step 2, a pose regression module acquires an intermediate representation result obtained by the prediction module, combines key points, edge vectors and dense pixel-by-pixel corresponding relation information, and regresses the pose of the object 6D from the intermediate representation result through EPnP calculation and singular value decomposition;
the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r I ∈SO(3),
Figure FDA0003926719400000016
2. The method for single-view pose estimation based on multi-modal input and attention mechanism according to claim 1, wherein the step 1 comprises:
constructing a complete connected graph epsilon by taking key points as nodes:
Figure FDA0003926719400000017
is a network that predicts the edge vectors along the graph of the graph, using ResNet-18 as the backbone network; ε explicitly expresses the displacement between each pair of keypoints, and |. ε | represents the number of edges in the predefined graph, and therefore | _ H |, and |, the value of the edge is expressed>
Figure FDA0003926719400000018
Figure FDA0003926719400000019
Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels to reflect the potential reflection symmetry of the object; />
Figure FDA00039267194000000110
Expanding a network architecture of FlowNet, fusing a dense pixel flow predicted by FlowNet and a mask image predicted by PVNet, and predicting the symmetrical corresponding relation of each pixel in a mask area;
Figure FDA00039267194000000111
loss of (l) 1 ,/>
Figure FDA00039267194000000112
Loss of (l) 2 ,/>
Figure FDA00039267194000000113
Loss of (l) 3 Are smoothed out in Fast RCNN>
Figure FDA00039267194000000120
Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1, so that the total loss is:
L=xl 1 +yl 2 +zl 3 (1)
channel attention module, will
Figure FDA00039267194000000114
And &>
Figure FDA00039267194000000115
Each residual block output dimension is ≥ v>
Figure FDA00039267194000000116
Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ÷ based by performing a one-dimensional convolution with a convolution kernel size n =5>
Figure FDA00039267194000000117
The channel weight of (a); the obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.
3. The method according to claim 1, wherein the step 2 comprises:
expressing three-dimensional key point truth coordinates in a standard coordinate system as
Figure FDA00039267194000000118
K is more than or equal to 1 and less than or equal to kappa, and the true value of the edge vector is expressed as ^ er>
Figure FDA00039267194000000119
E is more than or equal to 1 and less than or equal to epsilon, expressing the key point coordinate output by the prediction module as ≥ er>
Figure FDA0003926719400000021
K is more than or equal to 1 and less than or equal to | k |; the edge vector is represented as
Figure FDA0003926719400000022
E is more than or equal to 1 and less than or equal to | Epsilon |; symmetric correspondence is expressed as->
Figure FDA0003926719400000023
S is more than or equal to 1 and less than or equal to | S |; for ease of calculation, homogeneous coordinates are used>
Figure FDA0003926719400000024
Figure FDA0003926719400000025
And->
Figure FDA0003926719400000026
Corresponds to p k ,ν e ,q s,1 And q is s,2 These homogeneous coordinates are normalized by known camera parameters;
calculating the pose of the object 6D by using an EPnP algorithm and combining the constraint of the intermediate representation; the following difference vectors are first introduced for the three prediction elements:
Figure FDA0003926719400000027
Figure FDA0003926719400000028
Figure FDA0003926719400000029
wherein e s And e t Is the end point of the edge e and,
Figure FDA00039267194000000210
Figure FDA00039267194000000211
is the normal of the reflection symmetry plane under the base coordinate system;
secondly, the formula (2) is converted into the form of A 1 x,A 2 x,A 3 The form of x is the same as that of formula (3) to A 4 x,A 5 x,A 6 x, (4) to A 7 x; a is to be 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;
next, using the EPnP algorithm, calculate:
Figure FDA00039267194000000212
(5) In the formula v i Is the right singular vector corresponding to the ith minimum singular value of A; ideally when the prediction element is noiseless, N =1,x = v i Is the optimal solution; selecting the same N =4 as EPnP; to calculate the optimal x, the hidden variable λ is optimized in an alternating optimization process using the following objective function i And a rotation matrix R:
Figure FDA00039267194000000213
(6) In the formula
Figure FDA00039267194000000214
Including v i The first 9 elements of (c); at the time of obtaining the optimum lambda i Then, apply SVD decomposition to->
Figure FDA00039267194000000215
Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V T (ii) a And finally, obtaining a corresponding translation vector t by using Ax = 0:
Figure FDA00039267194000000216
(7) In the formula A 1 =A [:,1:9] ,A 2 =A [:,10:12]
Figure FDA00039267194000000217
Obtained by R flattening.
4. A single-view pose estimation system based on multi-modal input and attention mechanisms, comprising:
the prediction module is used for expressing geometric information in the RGB image by using various intermediate representations and introducing an attention mechanism so as to improve the network training efficiency; including a first predictive network
Figure FDA00039267194000000218
Second prediction network->
Figure FDA00039267194000000219
The third prediction network->
Figure FDA00039267194000000220
A channel attention module is embedded between each down-sampling module of the three prediction networks; />
Figure FDA00039267194000000221
The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network->
Figure FDA00039267194000000222
To optimize the pose of the object; />
The pose regression module is used for acquiring the intermediate representation result obtained by the prediction module, combining the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regressing the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition; the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r I ∈SO(3),
Figure FDA0003926719400000031
5. The system according to claim 4, wherein in the prediction module:
Figure FDA0003926719400000032
the PVNet is used as a backbone network, the pose estimation network is based on key points, the key points are used as nodes to construct a complete connected graph, and the nodes are combined to form a complete connected graph>
Figure FDA0003926719400000033
Is a network for predicting the edge vector of the graph along the graph, and the structure of the network adopts ResNet-18 as a backbone network; epsilon explicitly expresses the displacement between each pair of key points, and epsilon represents the number of edges in the predefined graph; ε is a complete connection diagram, therefore->
Figure FDA0003926719400000034
Figure FDA0003926719400000035
Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels, and reflecting the potential reflection symmetry of the object;
Figure FDA0003926719400000036
the network architecture of FlowNet is expanded, dense pixel flow predicted by FlowNet and a mask graph predicted by PVNet are fused, and the symmetrical corresponding relation of each pixel in a mask area is predicted;
Figure FDA0003926719400000037
loss of (l) 1 、/>
Figure FDA0003926719400000038
Loss of (l) 2 、/>
Figure FDA0003926719400000039
Loss of (l) 3 Are smoothed out in Fast RCNN>
Figure FDA00039267194000000327
Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1; the total loss is therefore:
L=xl 1 +yl 2 +zl 3 (1)
channel attention module, will
Figure FDA00039267194000000310
And &>
Figure FDA00039267194000000311
Each residual block output dimension is ≥ v>
Figure FDA00039267194000000312
Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ÷ based by performing a one-dimensional convolution with a convolution kernel size n =5>
Figure FDA00039267194000000313
The channel weight of (a); the obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.
6. The system according to claim 4, wherein the pose regression module:
expressing three-dimensional key point truth coordinates in a standard coordinate system as
Figure FDA00039267194000000314
K is more than or equal to 1 and less than or equal to | k |, and the true value of the edge vector is expressed as
Figure FDA00039267194000000315
E is more than or equal to 1 and less than or equal to epsilon, expressing the key point coordinate output by the prediction module as ≥ er>
Figure FDA00039267194000000316
K is more than or equal to 1 and less than or equal to | k |; the edge vector is represented as
Figure FDA00039267194000000317
E is more than or equal to 1 and less than or equal to | Epsilon |; symmetric correspondence is expressed as->
Figure FDA00039267194000000318
S is more than or equal to 1 and less than or equal to S; for ease of calculation, homogeneous coordinates are used>
Figure FDA00039267194000000319
Figure FDA00039267194000000320
And->
Figure FDA00039267194000000321
Corresponds to p k ,ν e ,q s,1 And q is s,2 These homogeneous coordinates are normalized by known camera parameters;
calculating the pose of the object 6D by using an EPnP algorithm and combining the constraint of the intermediate representation; the following difference vectors are first introduced for the three prediction elements:
Figure FDA00039267194000000322
Figure FDA00039267194000000323
Figure FDA00039267194000000324
/>
wherein e s And e t Is the end point of the edge e and,
Figure FDA00039267194000000325
Figure FDA00039267194000000326
is the normal of the reflection symmetry plane under the base coordinate system;
secondly, converting the formula (2) into the form A 1 x,A 2 x,A 3 The form of x is similar to the conversion of formula (3) to A 4 x,A 5 x,A 6 x, (4) is converted into A 7 x; a is prepared from 1 ,A 2 ,A 3 ,A 4 ,A 5 ,A 6 Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;
next, x is calculated using the EPnP algorithm:
Figure FDA0003926719400000041
(5) In the formula v i Is the right singular vector corresponding to the ith minimum singular value of A; ideally when the prediction element is noiseless, N =1,x = v i Is the optimal solution; selecting the same N =4 as EPnP; to calculate the optimal x, the hidden variable λ is optimized in an alternating optimization process using the following objective function i And a rotation matrix R:
Figure FDA0003926719400000042
(6) In the formula
Figure FDA0003926719400000043
Including v i The first 9 elements of (c); at the time of obtaining the optimum lambda i Then, apply SVD decomposition to->
Figure FDA0003926719400000044
Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V T (ii) a And finally, obtaining a corresponding translation vector t by using Ax = 0:
Figure FDA0003926719400000045
(7) In the formula A 1 =A [:,1:9] ,A 2 =A [:,10:12]
Figure FDA0003926719400000046
Obtained by R flattening. />
CN202211380719.3A 2022-11-04 2022-11-04 Single-view attitude estimation method and system based on multi-mode input and attention mechanism Pending CN115861418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211380719.3A CN115861418A (en) 2022-11-04 2022-11-04 Single-view attitude estimation method and system based on multi-mode input and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211380719.3A CN115861418A (en) 2022-11-04 2022-11-04 Single-view attitude estimation method and system based on multi-mode input and attention mechanism

Publications (1)

Publication Number Publication Date
CN115861418A true CN115861418A (en) 2023-03-28

Family

ID=85662588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211380719.3A Pending CN115861418A (en) 2022-11-04 2022-11-04 Single-view attitude estimation method and system based on multi-mode input and attention mechanism

Country Status (1)

Country Link
CN (1) CN115861418A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance
CN117351157A (en) * 2023-12-05 2024-01-05 北京渲光科技有限公司 Single-view three-dimensional scene pose estimation method, system and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance
CN117237451B (en) * 2023-09-15 2024-04-02 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance
CN117351157A (en) * 2023-12-05 2024-01-05 北京渲光科技有限公司 Single-view three-dimensional scene pose estimation method, system and equipment
CN117351157B (en) * 2023-12-05 2024-02-13 北京渲光科技有限公司 Single-view three-dimensional scene pose estimation method, system and equipment

Similar Documents

Publication Publication Date Title
CN113673307B (en) Lightweight video action recognition method
Chen et al. Point-based multi-view stereo network
US11763433B2 (en) Depth image generation method and device
WO2020192568A1 (en) Facial image generation method and apparatus, device and storage medium
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
Chen et al. Visibility-aware point-based multi-view stereo network
CN115861418A (en) Single-view attitude estimation method and system based on multi-mode input and attention mechanism
CN111627065A (en) Visual positioning method and device and storage medium
CN109377530A (en) A kind of binocular depth estimation method based on deep neural network
US8610712B2 (en) Object selection in stereo image pairs
WO2021143264A1 (en) Image processing method and apparatus, server and storage medium
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN113962858A (en) Multi-view depth acquisition method
CN112365511B (en) Point cloud segmentation method based on overlapped region retrieval and alignment
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN111325778A (en) Improved Census stereo matching algorithm based on window cross-correlation information
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
Xu et al. Depth map denoising network and lightweight fusion network for enhanced 3d face recognition
CN116758219A (en) Region-aware multi-view stereo matching three-dimensional reconstruction method based on neural network
CN116485892A (en) Six-degree-of-freedom pose estimation method for weak texture object
CN114708315A (en) Point cloud registration method and system based on depth virtual corresponding point generation
CN115731344A (en) Image processing model training method and three-dimensional object model construction method
CN114155406A (en) Pose estimation method based on region-level feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination