CN115861418A

CN115861418A - Single-view attitude estimation method and system based on multi-mode input and attention mechanism

Info

Publication number: CN115861418A
Application number: CN202211380719.3A
Authority: CN
Inventors: 史金龙; 张文睿; 钱强; 欧镇; 白素琴; 钱萍; 田朝晖; 邓权耀
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-03-28

Abstract

The invention discloses a single-view pose estimation method based on multi-modal input and attention mechanism, which constructs a single-view pose estimation system comprising a prediction module and a pose regression module, learns multiple intermediate representation characteristics of an object from a two-dimensional image by combining multi-modal input and attention characteristic enhancement technology, and further regresses the 6D pose of the object, and comprises the following steps: the prediction module adopts ResNet-18 as a backbone network, and the module adds a channel attention mechanism to estimate various intermediate representations of the 6D pose of the object, including key points, edge vectors among the key points and symmetrical corresponding relations among pixel points; the pose regression module regresses the 6D pose of the object from the intermediate representation result by using an EPnP algorithm and singular value decomposition. The invention provides an accurate and convenient technology for quickly estimating the 6D pose of the object from a single view.

Description

Single-view attitude estimation method and system based on multi-mode input and attention mechanism

Technical Field

The invention belongs to the technical field of estimation of 6D positions of objects from a single-view, and relates to a single-view position and posture estimation method and a single-view position and posture estimation system based on multi-modal input and attention mechanism.

Background

In applications such as robot grabbing or augmented reality, it is an important task to estimate the 6D pose of an object from an RGB image. Although the introduction of depth images may provide a significant improvement to this task, depth images are not always readily available. Such as cell phones, tablet computers, industrial cameras, do not provide depth data for the most part. Therefore, much research is devoted to estimating the 6D pose of a known object using only RGB images. Traditional methods match RGB image features with 3D models of objects to solve this problem, rely on artificial labeling of features, and lack robustness to illumination variations, background clutter, or low-texture objects. The development of deep learning accelerates the research of estimating the pose of the object 6D from the RGB image. At present, a popular key point method is to use key points as intermediate supervision signals to perform model training, and then combine 2D key points predicted by a neural network with a PnP algorithm to estimate a 6D pose of an object, for example, pvNet [ Peng S ], pixel-wise relating network for 6dof position estimation IEEE/CVF 2019 ], pixel-wise relating resolution of objects for6D position estimation IEEE/CVF 2019 ], and the like.

But the performance of the keypoint approach relies on the following two assumptions: 1) The deep learning model can accurately predict the position of the two-dimensional key point; 2) The predicted two-dimensional keypoints provide sufficient constraints to regress the pose of the object 6D. However, the key point prediction is not accurate due to factors such as partial occlusion of the object, and thus the two assumptions are not easy to be established in many real-world environments.

Disclosure of Invention

The invention aims to overcome the defects of the existing single-view 6D pose estimation method and provides a single-view pose estimation method and a single-view pose estimation system based on a multi-mode input and attention mechanism.

In order to solve the technical problems, the invention adopts the following technical scheme.

The invention discloses a single-view pose estimation method based on multi-modal input and attention mechanism, which constructs a single-view pose estimation system comprising a prediction module and a pose regression module, learns multiple intermediate representation characteristics of an object from a two-dimensional image by combining multi-modal input and attention characteristic enhancement technology, and further regresses the 6D pose of the object, and comprises the following steps:

step 1, a prediction module utilizes a plurality of intermediate representations to express geometric information in an RGB image, and an attention mechanism is introduced to improve the network training efficiency; the intermediate representations comprise key points kappa, edge vectors epsilon and dense pixel-by-pixel corresponding relations S of the RGB images;

the prediction module comprises a first prediction network

Second prediction network->

The third prediction network->

A channel attention module is embedded between each down-sampling module of the three prediction networks; />

The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network &>

Used for optimizing the pose of the object;

step 2, the pose regression module acquires an intermediate representation result obtained by the prediction module, combines key points, edge vectors and dense pixel-by-pixel corresponding relation information, and regresses the pose of the object 6D from the intermediate representation result through EPnP calculation and singular value decomposition;

the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r is _I ∈SO(3)，

Specifically, the step 1 comprises:

constructing a complete connected graph epsilon by taking key points as nodes:

is a network that predicts the edge vectors along the graph of the graph, using ResNet-18 as the backbone network; ε explicitly expresses the displacement between each pair of keypoints, and |. ε | represents the number of edges in the predefined graph, and therefore | _ H |, and |, the value of the edge is expressed>

Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels to reflect the potential reflection symmetry of the object; />

Expanding a network architecture of FlowNet, fusing dense pixel flow predicted by FlowNet and a mask map predicted by PVNet, and predicting the symmetrical corresponding relation of each pixel in a mask area;

loss of (l) ₁ ，/>

Loss of (l) ₂ ，/>

Loss of (l) ₃ Are smoothed out in Fast RCNN>

Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, x, y and z parameters are used for weighting the losses of the three intermediate representations respectively, wherein x + y + z =1, so that the total loss is as follows:

L＝xl ₁ +yl ₂ +zl ₃ (1)

channel attention module, will

And &>

Each residual block having an output dimension of +>

Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ≥ by performing a one-dimensional convolution with a convolution kernel size n =5>

The channel weights of (a); the obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.

Specifically, the step 2 includes:

expressing three-dimensional key point truth coordinates in a standard coordinate system as

The true value of the edge vector is expressed as->

Expressing the coordinates of the key points output by the prediction module as/>

The edge vector is represented as

Symmetric correspondence is expressed as->

For ease of calculation, homogeneous coordinates are used>

And->

Corresponds to p _k ，ν _e ，q _s,1 And q is _s,2 These homogeneous coordinates are normalized by known camera parameters;

calculating the pose of the object 6D by using an EPnP algorithm and combining the constraint of the intermediate representation; the following difference vectors are first introduced for the three prediction elements:

/>

wherein e _s And e _t Is the end point of the edge e and,

is the normal of the reflection symmetry plane under the base coordinate system;

secondly, (2) converting the formula into the formA ₁ x，A ₂ x，A ₃ The form of x is similar to the conversion of formula (3) to A ₄ x，A ₅ x，A ₆ x, (4) to A ₇ x; a is to be ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;

next, using the EPnP algorithm, calculate:

(5) In the formula v _i Is the right singular vector corresponding to the ith minimum singular value of A; ideally when the prediction element is noiseless, N =1,x = v _i Is the optimal solution; selecting the same N =4 as EPnP; to calculate the optimal x, the hidden variable λ is optimized in an alternating optimization process using the following objective function _i And a rotation matrix R:

(6) In the formula

Including v _i The first 9 elements of (c); at the time of obtaining the optimum lambda _i Then, apply SVD decomposition to->

Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V ^T (ii) a And finally, obtaining a corresponding translation vector t by using Ax = 0:

(7) In the formula A ₁ ＝A _[:,1:9] ，A ₂ ＝A _[:,10:12] ，

Obtained by R flattening.

The invention relates to a single-view pose estimation system based on multi-mode input and attention mechanism, which comprises:

the prediction module is used for expressing geometric information in the RGB image by using various intermediate representations and introducing an attention mechanism so as to improve the network training efficiency; comprising a first predictive network

Second prediction network->

The third prediction network->

The PVNet is used as a backbone network, the network is a pose estimation network based on key points, and visible and invisible k key points are predicted by adopting a voting method; predicting network->

Used for optimizing the pose of the object;

the pose regression module is used for acquiring the intermediate representation result obtained by the prediction module, combining the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regressing the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition; the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r is _I ∈SO(3)，

Further, in the prediction module:

the PVNet is used as a backbone network, the pose estimation network is based on key points, the key points are used as nodes to construct a fully connected graph, and the key points are used as the nodes to be judged>

Is a network for predicting the edge vector of the graph along the graph, and the structure of the network adopts ResNet-18 as a backbone network; epsilon explicitly expresses the displacement between each pair of key points, | epsilon | represents the number of edges in the predefined graph; ε is a complete connection diagram, therefore->

Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels, and reflecting the potential reflection symmetry of the object; />

The network architecture of FlowNet is expanded, dense pixel flow predicted by FlowNet and a mask graph predicted by PVNet are fused, and the symmetrical corresponding relation of each pixel in a mask area is predicted;

loss of (l) ₁ 、/>

Loss of (l) ₂ 、/>

Loss of (l) ₃ Are smoothed out in Fast RCNN>

Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is respectively reduced by using three parameters of x, y and zA line weighting process, where x + y + z =1; the total loss is therefore:

L＝xl ₁ +yl ₂ +zl ₃ (1)

channel attention module, will

And &>

Each residual block having an output dimension of +>

The channel weight of (a); and the obtained weight is processed by an activation function and subjected to dimensionality reduction to obtain a result F ', and the result F' is used as the input of the next residual block.

Further, the pose regression module:

expressing the three-dimensional key point truth value coordinate in the standard coordinate system as

The true value of the edge vector is expressed as->

Representing keypoint coordinates output by a prediction module as ≥ er>

The edge vector is represented as

Symmetric correspondence expressed as>

For ease of calculation, homogeneous coordinates are used>

And->

wherein e _s And e _t Is the end point of the edge e and,

secondly, converting the formula (2) into the form A ₁ x，A ₂ x，A ₃ The form of x is similar to the conversion of formula (3) to A ₄ x，A ₅ x，A ₆ x, (4) is converted into A ₇ x; a is prepared from ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is oneA vector containing the parameters of the rotation matrix R and the translational vector t in affine space;

next, x is calculated using the EPnP algorithm:

(6) In the formula

Including v _i The first 9 elements of (a); at the time of obtaining the optimum lambda _i Then, apply SVD decomposition to->

(7) In the formula A ₁ ＝A _[:,1:9] ，A ₂ ＝A _[:,10:12] ，

Obtained by R flattening.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a new 6D pose estimation method which mainly comprises a 6D pose estimation network based on channel attention. The convolutional neural network model (CNN) has certain limitations in predicting the 6D pose due to the size of its convolutional kernel and the pooling of features. When the ratio of the target area image in the network input to the input whole image is small, a large amount of noise is generated in the process of multiple iterative convolutions by the background information, and the feature extraction of the target area is influenced. In order to overcome the limitation of a convolutional network and improve the network training efficiency and accuracy, the invention introduces a channel attention module, can better extract the detail characteristics in the image and obtain a more accurate pose estimation result.

2. The method uses key points, edge vectors among the key points, and a plurality of intermediate representations of dense pixel-by-pixel corresponding relations to express different geometric information in the input image, and improves the accuracy of the generated object 6D pose estimation to the maximum extent. Symmetrical corresponding constraints are added on the basis of the key points, so that the performance of the model in the aspect of estimating the rotation matrix can be improved; adding edge vectors to the keypoints and symmetric correspondences provides more constraints on translation and rotation. Edge vectors provide more translational constraints than keypoints, as they represent neighboring keypoint displacements and provide gradient information for regression. Unlike the symmetric correspondence, the edge vector constrains 3 degrees of freedom of the rotation parameter, which further improves the performance of estimating the rotation matrix.

3. Even under a complex environment and a shielding condition, the method can predict the 6D pose of the object quite accurately. The 6D Pose estimation network performance of the invention is evaluated by using two popular reference data sets Linemod and Occlusion Linemod in the 6D Pose estimation test, and compared with the current advanced 6D Pose estimation algorithm (such as Pix2Pose, PVNet, posecNN, CDPN and the like), the average ADD (-S) precision of the invention on the Linemod data set is 92.8; the average ADD (-S) precision on the Occlusion Linemod dataset was 48.0. In the aspect of processing the occlusion problem, the method is improved by 16% compared with the Pix2Pose, and the advantages of the method in predicting the 6D Pose of the occlusion object are clearly displayed.

Drawings

FIG. 1 is a system block diagram of one embodiment of the present invention.

FIG. 2 is a channel attention module diagram of one embodiment of the present invention.

FIG. 3 is a diagram of the intermediate representation results and pose estimation results of different objects in the experiment of the present invention.

FIG. 4 is a diagram of pose estimation results under occlusion of different proportions in the experiment of the invention.

FIG. 5 is a comparison graph of network training loss before and after the model adds the attention module in the experiment of the present invention.

Fig. 6 is a comparison graph of network training loss before and after the cat model adds the attention module in the experiment of the present invention.

Detailed Description

The invention discloses a single-view pose estimation method and a single-view pose estimation system based on multi-modal input and attention mechanism, wherein the constructed system comprises a prediction module and a pose regression module, and combines the multi-modal input and attention feature enhancement technology to learn various intermediate representation features of an object from a two-dimensional image so as to regress the 6D pose of the object, and the method comprises the following steps: the prediction module adopts ResNet-18 as a backbone network, and the module adds a channel attention mechanism to estimate various intermediate representations of the 6D pose of the object, including key points, edge vectors among the key points and symmetrical corresponding relations among pixel points; the pose regression module regresses the 6D pose of the object from the intermediate representation result by using an EPnP algorithm and singular value decomposition. The invention provides an accurate and convenient technology for quickly estimating the 6D pose of the object from a single view.

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a system block diagram of an embodiment of the present invention, and at the same time, embodies the technical principles of the present invention: aiming at estimating the 6D pose of an object from a single RGB image, the method comprises the following steps:

step 1, a prediction module expresses geometric information in an RGB image by using various intermediate representations and introduces an attention mechanism to improve the network training efficiency;

and 2, the pose regression module acquires the intermediate representation result obtained by the prediction module, combines the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regresses the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition.

The prediction module comprises three prediction networks: first prediction network

Second prediction network->

And a third prediction network>

And a fully-connected network, wherein a channel attention module is embedded between each down-sampling module of the three prediction networks.

The PVNet is used as a backbone network, and the pose estimation network is based on key points and predicts visible and invisible k key points by adopting a voting method. In order to solve the problem that the pose error is large when the key point prediction is inaccurate, another two prediction networks are introduced to make a judgment on whether the pose is correct or not>

To optimize the pose of the object.

The invention takes key points as nodes to construct a complete connected graph,

is a network that predicts the vectors along the graph edges of the graph, whose structure uses ResNet-18 as the backbone network. ε explicitly expresses the displacement between each pair of keypoints, and | ε | represents the number of edges in the predefined graph. In the present invention epsilon is a complete communication map and therefore ^ s>

The prediction generates a third intermediate representation S reflecting pixel symmetry correspondences reflecting the underlying reflection symmetries of the object。/>

And the network architecture of FlowNet is expanded, dense pixel flow predicted by FlowNet is fused with a mask map predicted by PVNet, and the symmetrical corresponding relation of each pixel in a mask area is predicted. Compared with the first two intermediate representation modes, the data size corresponding to symmetry is larger, the constraint on the pose estimation result is stronger, and particularly, rich constraint is provided for a shielded object, so that the shielding problem can be better solved.

In the invention

Loss of (l) ₁ ，/>

Loss of (l) ₂ ，/>

Loss of (l) ₃ Are smoothed out in Fast RCNN>

Training is performed with loss. In order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1. The total loss is therefore:

L＝xl ₁ +yl ₂ +zl ₃ (1)

as shown in fig. 2, the channel attention module will

And &>

Each residual block output dimension is ≥ v>

Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension by performing a one-dimensional convolution with a convolution kernel size of n =5Degree is->

The channel weights of (c). The obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.

A pose regression module: taking the predicted intermediate representation { kappa, epsilon, S } of the network as input, and outputting the 6D pose (R) of the object I _I ∈SO(3)，

)。

The invention expresses the three-dimensional key point truth value coordinate in the standard coordinate system as

The true value of the edge vector is expressed as->

Expressing the key point coordinate output by the prediction module as ≥ er>

Edge vector representation as +>

Symmetric correspondence is expressed as->

The use of homogeneous coordinates for the present invention to facilitate the calculation thereof>

And->

Corresponds to p _k ，ν _e ，q _s,1 And q is _s,2 These homogeneous coordinates are normalized by known camera parameters.

The present invention uses the EPnP algorithm in combination with intermediate representation constraints to compute the pose of the object 6D. Firstly, the following differential vectors are introduced for three prediction elements:

wherein e _s And e _t Is the end point of the edge e and,

is the normal of the reflection symmetry plane under the base coordinate system.

Secondly, converting the formula (2) into the form A ₁ x，A ₂ x，A ₃ The form of x is the same as that of formula (3) to A ₄ x，A ₅ x，A ₆ x, (4) to A ₇ x. A is to be ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ And are combined into A. To describe the relationship between predicted and true values, the present invention introduces a linear system of the form Ax =0, where a is a matrix with dimensions (3 | κ | +3| epsilon | + | S |) × 12. x is a vector containing the rotation matrix R and translation vector t parameters in affine space.

Next, x is calculated using the EPnP algorithm:

(5) In the formula v _i Is the right singular vector corresponding to the ith minimum singular value of a. Ideally when the prediction element is noiseless, N =1,x = v _i Is the optimal solution. However, in real-world situations, the performance of this approach is poor. The present invention selects the same N =4 as EPnP. To calculate the optimal x, the present invention optimizes the hidden variable λ in an alternating optimization process using the following objective function _i And a rotation matrix R:

(6) In the formula

Including v _i The first 9 elements of (a). At the time of obtaining the optimum lambda _i Then, apply SVD decomposition to->

Projecting to SO (3) to obtain a rotation matrix R = Udiag (1, 1) V ^T . And finally, obtaining a corresponding translation vector t by using Ax = 0:

(7) In the formula A ₁ ＝A _[:,1:9] ，A ₂ ＝A _[:,10:12] ，

Obtained by R flattening.

Experimental verification of the invention shows that:

FIG. 3 shows three intermediate representation results of pose estimation of the present invention and single target object 6D pose estimation results: (a) representing an input image; (b) (c) and (d) respectively represent the predicted key points, and the edge vectors correspond to the symmetrical corresponding images; (e) And representing the 6D pose calculated by regression, wherein a green box represents a real value of the 6D pose of the object, and a blue box represents the 6D pose of the object predicted by the invention. FIG. 4 shows the estimation result of the object 6D pose under the occlusion condition of different proportions. It can be found that the method can accurately predict the 6D pose of the object even under a complex environment and a shielding condition. The invention uses two popular reference data sets Linemod and Occupusion Linemod in the 6D Pose estimation test to evaluate the 6D Pose estimation network performance of the invention, and compares the performance with the current advanced 6D Pose estimation algorithm (such as Pix2Pose, PVNet, posecNN, CDPN and the like). As shown in Table 1 and Table 2, the average ADD (-S) precision of the Linemod data set of the present invention is 92.8; the average ADD (-S) precision on the Occlusion Linemod dataset was 48.0.

As can be seen from fig. 3, fig. 4, table 1 and table 2, the present invention can achieve accurate pose estimation, and the test results on Linemod and occupancy Linemod are superior to most of the most advanced methods. The method is superior to a main stem model PVNet for predicting key points, and the performance is improved in all object classes, which shows that the method using various intermediate representation methods is obviously superior to the method using only key points. In terms of processing the occlusion problem, the method of the invention is improved by 16% relative to the Pix2Pose, which clearly shows the advantages of various intermediate representations in predicting the 6D Pose of the occluded object.

TABLE 1 ADD (-S) Performance comparison on Linemod dataset

TABLE 2 ADD (-S) Performance comparison on Occlusion Linemod dataset

To verify the performance of the attention mechanism module, the present invention tests the training effect before and after the attention module addition on the same data set, and fig. 5 is a comprehensive loss curve before and after the improvement.

As can be seen from fig. 5, the final training result of the estimation of the object pose can be optimized by adding the attention module in the backbone network, and the loss convergence speed is faster and the loss is smaller in the training process. The experimental result shows that the ADD (-S) index under the Linemod data set is averagely improved by 3.1% and the ADD (-S) index under the Occlusion Linemod data set is averagely improved by 1.4% when the improved ResNet-18 is used by the basic neural network compared with the prior art, and the method can be used for more accurately estimating the 6D pose of the object.

In summary, the present invention provides a single-view pose estimation method and system based on multi-modal input and attention mechanism, wherein the method includes two modules. The first module utilizes various intermediate representations to express the geometric information in the RGB image, and introduces an attention mechanism to improve the network training efficiency. In addition to predicting keypoints, the module can predict edge vectors between neighboring keypoints. In addition, the present invention utilizes a predictive dense pixel-by-pixel correspondence to reflect the basic symmetry between pixels. Compared with a method only using key point representation, the pose estimation method integrating multiple intermediate representations provides more constraints and can realize accurate pose prediction under the conditions of occlusion, shadow and the like. The second module obtains the intermediate representation result obtained by the previous module, combines the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regresses the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition. The technology is suitable for multiple fields of comprehensive ship guarantee, automatic sorting, automatic driving, medical treatment, virtual reality, augmented reality, industrial manufacturing and the like, can accurately acquire the 6D position and pose of an object from an image, provides technical support for intellectualization and informatization of industrial manufacturing, and has wide market prospect.

Claims

1. A single-view pose estimation method based on multi-modal input and attention mechanism is characterized by constructing a single-view pose estimation system comprising a prediction module and a pose regression module, learning multiple intermediate representation features of an object from a two-dimensional image by combining multi-modal input and attention feature enhancement technology, and further regressing the pose of the object to be 6D, and comprises the following steps:

step 1, a prediction module expresses geometric information in an RGB image by using various intermediate representations and introduces an attention mechanism to improve the network training efficiency; the intermediate representations comprise key points kappa, edge vectors epsilon and dense pixel-by-pixel corresponding relations S of the RGB images;

the prediction module comprises a first prediction network

Second prediction network->

The third prediction network->

To optimize the pose of the object;

step 2, a pose regression module acquires an intermediate representation result obtained by the prediction module, combines key points, edge vectors and dense pixel-by-pixel corresponding relation information, and regresses the pose of the object 6D from the intermediate representation result through EPnP calculation and singular value decomposition;

the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r _I ∈SO(3)，

2. The method for single-view pose estimation based on multi-modal input and attention mechanism according to claim 1, wherein the step 1 comprises:

constructing a complete connected graph epsilon by taking key points as nodes:

Expanding a network architecture of FlowNet, fusing a dense pixel flow predicted by FlowNet and a mask image predicted by PVNet, and predicting the symmetrical corresponding relation of each pixel in a mask area;

loss of (l) ₁ ，/>

Loss of (l) ₂ ，/>

Loss of (l) ₃ Are smoothed out in Fast RCNN>

Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1, so that the total loss is:

L＝xl ₁ +yl ₂ +zl ₃ (1)

channel attention module, will

And &>

Each residual block output dimension is ≥ v>

Is input as an attention module, performs an averaging pooling operation, and the channel attention module generates a dimension ÷ based by performing a one-dimensional convolution with a convolution kernel size n =5>

The channel weight of (a); the obtained weight is processed by an activation function and is restored by dimensionality to obtain a result F ', and the result F' is used as the input of the next residual block.

3. The method according to claim 1, wherein the step 2 comprises:

K is more than or equal to 1 and less than or equal to kappa, and the true value of the edge vector is expressed as ^ er>

E is more than or equal to 1 and less than or equal to epsilon, expressing the key point coordinate output by the prediction module as ≥ er>

K is more than or equal to 1 and less than or equal to | k |; the edge vector is represented as

E is more than or equal to 1 and less than or equal to | Epsilon |; symmetric correspondence is expressed as->

S is more than or equal to 1 and less than or equal to | S |; for ease of calculation, homogeneous coordinates are used>

And->

wherein e _s And e _t Is the end point of the edge e and,

secondly, the formula (2) is converted into the form of A ₁ x，A ₂ x，A ₃ The form of x is the same as that of formula (3) to A ₄ x，A ₅ x，A ₆ x, (4) to A ₇ x; a is to be ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;

next, using the EPnP algorithm, calculate:

(6) In the formula

(7) In the formula A ₁ ＝A _[:,1:9] ，A ₂ ＝A _[:,10:12] ，

Obtained by R flattening.

4. A single-view pose estimation system based on multi-modal input and attention mechanisms, comprising:

the prediction module is used for expressing geometric information in the RGB image by using various intermediate representations and introducing an attention mechanism so as to improve the network training efficiency; including a first predictive network

Second prediction network->

The third prediction network->

To optimize the pose of the object; />

The pose regression module is used for acquiring the intermediate representation result obtained by the prediction module, combining the key points, the edge vectors and the dense pixel-by-pixel corresponding relation information, and regressing the 6D pose of the object from the intermediate representation result through EPnP calculation and singular value decomposition; the pose regression module takes the intermediate representation { kappa, epsilon, S } predicted by each network of the prediction module as input, and outputs the 6D pose of the object I: r _I ∈SO(3)，

5. The system according to claim 4, wherein in the prediction module:

the PVNet is used as a backbone network, the pose estimation network is based on key points, the key points are used as nodes to construct a complete connected graph, and the nodes are combined to form a complete connected graph>

Is a network for predicting the edge vector of the graph along the graph, and the structure of the network adopts ResNet-18 as a backbone network; epsilon explicitly expresses the displacement between each pair of key points, and epsilon represents the number of edges in the predefined graph; ε is a complete connection diagram, therefore->

Predicting and generating a third intermediate representation S reflecting the symmetrical corresponding relation of the pixels, and reflecting the potential reflection symmetry of the object;

loss of (l) ₁ 、/>

Loss of (l) ₂ 、/>

Loss of (l) ₃ Are smoothed out in Fast RCNN>

Loss training is carried out; in order to reflect the importance degree of different intermediate representations on the pose estimation network effect, the loss of the three intermediate representations is weighted by using three parameters of x, y and z, wherein x + y + z =1; the total loss is therefore:

L＝xl ₁ +yl ₂ +zl ₃ (1)

channel attention module, will

And &>

Each residual block output dimension is ≥ v>

6. The system according to claim 4, wherein the pose regression module:

K is more than or equal to 1 and less than or equal to | k |, and the true value of the edge vector is expressed as

S is more than or equal to 1 and less than or equal to S; for ease of calculation, homogeneous coordinates are used>

And->

/>

wherein e _s And e _t Is the end point of the edge e and,

secondly, converting the formula (2) into the form A ₁ x，A ₂ x，A ₃ The form of x is similar to the conversion of formula (3) to A ₄ x，A ₅ x，A ₆ x, (4) is converted into A ₇ x; a is prepared from ₁ ，A ₂ ，A ₃ ，A ₄ ，A ₅ ，A ₆ Merging to form A; to describe the relationship between predicted and true values, a linear system of the form Ax =0 is introduced, where a is a matrix of dimension (3 | κ | +3| epsilon | + | S |) × 12; x is a vector containing the parameters of the rotation matrix R and translation vector t in affine space;

next, x is calculated using the EPnP algorithm:

(6) In the formula

(7) In the formula A ₁ ＝A _[:,1:9] ，A ₂ ＝A _[:,10:12] ，

Obtained by R flattening. />