CN112561995A

CN112561995A - Real-time efficient 6D attitude estimation network, construction method and estimation method

Info

Publication number: CN112561995A
Application number: CN202011430902.0A
Authority: CN
Inventors: 刘鹏磊; 张锲石; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-26
Anticipated expiration: 2040-12-09
Also published as: CN112561995B

Abstract

The invention discloses a real-time and efficient 6D attitude estimation network, a construction method and an estimation method, belongs to the technical field of computer vision, relates to the field of 6D attitude estimation, utilizes a multidirectional feature fusion pyramid network MFPN to fuse and express features, can effectively express and process multi-scale features, moreover, the method can effectively process the conditions of occlusion and complex background, takes the cross-phase local network CSPNet as a basic module, integrates a YOLO frame, constructs a backbone network capable of effectively extracting features, then combined with a multi-directional feature fusion pyramid network MFPN, finally designing a new network MFPN-6D for 6D attitude estimation, the problems of insufficient texture and shielding of the object can be effectively solved, the prediction precision and the calculation speed of the model are improved, and the robustness is also enhanced.

Description

Real-time efficient 6D attitude estimation network, construction method and estimation method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a real-time and efficient 6D attitude estimation network, a construction method and an estimation method.

Background

The 6D pose estimation refers to estimating the 6D pose, i.e. the 3D position and the 3D pose, of an object in the camera coordinate system, and at this time, the coordinate system of the original object itself can be regarded as a world coordinate system, i.e. R T transformation from the world system where the original object is located to the camera system is obtained. Rigid means that the object does not deform. The significance of the 6D pose estimation of the rigid body is that the accurate pose of an object can be obtained, the fine operation of the object is supported, and the method is mainly applied to the field of robot grabbing and the field of augmented reality. The latest research trend of 6D posture estimation is to train a deep neural network to directly predict the 2D projection position of a 3D key point from an image, establish a corresponding relation and finally carry out posture estimation by using a Pespective-n-Point (PnP) algorithm. The current challenges of pose estimation are that when there are few object textures, occlusion and scene clutter, the detection accuracy will be reduced, and most of the existing computational models are large and cannot meet the real-time requirements.

The 6D attitude estimation methods of the related art are mainly divided into two types: based on depth information (RGB-D) or based on image information (RGB). Although current methods of pose estimation using RGB-D cameras are reliable, depth cameras are only suitable for indoor scenes and in situations where power is insufficient. In contrast, RGB cameras are suitable for a wider range of scenes and save power. In the image-based field, the 6D pose estimation algorithm for objects has methods of key point matching and edge matching, and although objects with rich textures can be effectively processed, objects with no textures or few textures cannot be processed. To solve this problem, a deep learning based approach has recently been used in pose estimation. For example: BB8 and PVNet, which predict 2D-3D correspondences by training a deep neural network and further solve the pose by the PnP algorithm. Although they have achieved good performance, these methods either require a post-processing stage and are difficult to achieve with real-time requirements. Some algorithms have achieved good results in terms of speed, such as: YOLO-6D, but this method works poorly for objects where occlusion is present and small objects.

Therefore, the related art has the following two disadvantages with respect to the 6D pose estimation study: when the texture of the target object is few and the situation of shielding and complex scenes exists, the detection precision is reduced, and even the detection cannot be carried out; most of the existing calculation methods require a large amount of parameters, so that the model is large and the real-time requirement cannot be met mostly.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a real-time and efficient 6D attitude estimation network, a construction method and an estimation method, which can effectively solve the problem that the texture of the surface of an object is insufficient or other objects shield the target object, improve the detection precision and speed, and have higher robustness.

In order to achieve the above purpose, the invention provides a real-time and efficient 6D posture estimation network, which includes a multidirectional feature fusion pyramid network and a backbone network, wherein the multidirectional feature fusion pyramid network and the backbone network are combined to form the 6D posture estimation network, the multidirectional feature fusion pyramid network is used for fusing and expressing features, and the backbone network is used for feature extraction.

Further, the multidirectional feature fusion pyramid network comprises a residual structure, and the residual structure is fused into forward propagation and vertical propagation of the multidirectional feature fusion pyramid network.

Further, the backbone network takes a CSPNet network as a basic module and fuses a YOLO framework.

Further, the total data set of the 6D posture estimation network comprises a LINEMOD standard data set and an occupied-LINEMOD standard data set, and the 6D posture estimation network is trained and verified on the LINEMOD standard data set and the occupied-LINEMOD standard data set.

Further, the LINEMOD standard data set includes 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing CAD models of all objects; the Occluded-LINEMOD standard data set is a data set which comprises a plurality of target objects and has occlusion.

Further, the total data set of the 6D pose estimation network includes a training set and a test set, the training set accounts for 20% of the total data set, and the test set accounts for 80% of the total data set.

Further, the 6D pose estimation network operates at a speed of 56 FPS.

The invention also provides a method for constructing the real-time and efficient 6D attitude estimation network, which comprises the following steps: firstly, fusing a residual structure into forward propagation and vertical propagation to establish a multidirectional characteristic fusion pyramid network; then, a CSPNet network is used as a basic module, and a YOLO framework is fused to establish a backbone network; and finally, combining the multidirectional feature fusion pyramid network and the backbone network to form a 6D attitude estimation network.

Further, in the construction method, the 6D attitude estimation network is trained and verified on a LINEMOD standard data set and an Occluded-LINEMOD standard data set.

The invention also provides a 6D attitude estimation method, which adopts the real-time and efficient 6D attitude estimation network.

Compared with the prior art, the method can solve the problem of 6D pose estimation of a rigid body, the multidirectional feature fusion pyramid network MFPN is used for fusing and expressing features, the multidirectional feature fusion pyramid network MFPN can effectively express and process multi-scale features and can effectively process the conditions of shielding and complex background, the cross-stage local network CSPNet is used as a basic module and is fused with a YOLO frame, a backbone network capable of effectively extracting the features is constructed, then the backbone network is combined with the multidirectional feature fusion pyramid network MFPN, and finally a new network MFPN-6D for 6D pose estimation is designed, so that the problems of insufficient texture and shielding of an object can be effectively solved, the prediction accuracy and the calculation speed of the model are improved, and the robustness is also enhanced.

Drawings

FIG. 1 is a schematic diagram of a 6D pose estimation neural network MFPN-6D of the present invention;

FIG. 2a is a schematic diagram of a feature pyramid network FPN; FIG. 2b is a schematic diagram of a PANet network; FIG. 2c is a schematic diagram of a BiFPN network; fig. 2d is a schematic diagram of the multi-directional feature fusion pyramid network MFPN of the present invention.

Detailed Description

The present invention will be further explained with reference to the drawings and specific examples in the specification, and it should be understood that the examples described are only a part of the examples of the present application, and not all examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, an embodiment of the present invention provides a real-time and efficient 6D posture estimation network MFPN-6D, which includes a multidirectional feature fusion pyramid network MFPN and a backbone network, the multidirectional feature fusion pyramid network MFPN and the backbone network are combined to form the 6D posture estimation network MFPN-6D, the multidirectional feature fusion pyramid network MFPN is used for fusing and expressing features, and the backbone network is used for feature extraction. The multi-directional feature fusion pyramid network MFPN comprises a residual structure, and the residual structure is fused into the forward propagation and the vertical propagation of the multi-directional feature fusion pyramid network MFPN. The backbone network takes a cross-stage local network CSPNet network as a basic module and fuses a YOLO framework.

The total data set of the 6D posture estimation network MFPN-6D comprises a LINEMOD standard data set and an Occluded-LINEMOD standard data set, and the 6D posture estimation network MFPN-6D is trained and verified on the LINEMOD standard data set and the Occluded-LINEMOD standard data set. The LINEMOD standard dataset consists of 13 sequences, each sequence containing the true pose of a single target in a cluttered environment and providing a CAD model of the target; the Occluded-LINEMOD standard data set is a data set which comprises a plurality of target objects and has occlusion. The total data set of the 6D posture estimation network MFPN-6D comprises a training set and a testing set, wherein the training set accounts for 20% of the total data set, and the testing set accounts for 80% of the total data set. The 6D attitude estimation network MFPN-6D runs at the speed of 56FPS, and is the fastest method in the field of 6D attitude estimation at present.

One of the main difficulties in 6D pose estimation is the efficient representation and processing of multi-scale features in order to be able to efficiently represent and process multi-scale features. As shown in fig. 2a, the feature pyramid network FPN proposes a top-down path to combine multi-scale features, but FPN is inherently limited by unidirectional information flow. To solve this problem, PANet adds an additional bottom-up path aggregation network on the basis of FPN, as shown in fig. 2 b. The accuracy of PANet is high, but more parameters and calculation are needed, and in order to improve the model efficiency, Google researchers have proposed a BiFPN network, as shown in fig. 2c, which is an effective two-way cross-scale connection and weighted feature fusion network. BiFPN is more accurate and less costly than PANET. BiFPN is one of the most advanced feature networks, but only considers the problem of forward feature propagation and does not consider the problem of vertical propagation of features, which results in that the features are lost when the features propagate in the vertical direction, so that all feature information cannot be effectively used.

In order to more efficiently process and represent multi-scale features, the idea of a residual network is applied in the 6D pose estimation network MFPN-6D of the present invention. The residual structure is fused into the forward propagation and the vertical propagation of the 6D posture estimation network MFPN-6D, and finally the multidirectional feature fusion pyramid network MFPN is provided, as shown in FIG. 2D, the forward residual structure and the residual structure in the vertical direction are added on the basis of BiFPN, and the provided multidirectional feature fusion pyramid network MFPN can improve the feature utilization rate in the forward and vertical propagation and more effectively represent and process the multi-scale features.

In the aspect of designing a skeleton network, the invention adopts the most advanced cross-stage local network CSPNet network as a basic module, designs a final feature extraction backbone network by fusing the idea of a YOLO network frame, and constructs a neural network MFPN-6D for 6D attitude estimation by combining with a multidirectional feature fusion pyramid network MFPN, as shown in FIG. 1, the backbone network for feature extraction is designed on the basis of a CSPNet structure, so that the feature extraction can be efficiently carried out on pictures, the MFPN network is combined with the backbone network as a Neck network, the final detection network adopts a YOLO network, the finally designed network can efficiently and accurately carry out 6D attitude estimation on objects, and can run at the speed of 56FPS, thereby being the fastest method in the field of 6D attitude estimation at present.

The invention aims to provide 6D attitude estimation which is efficient, rapid and capable of effectively processing the occlusion problem, firstly, a multidirectional feature fusion pyramid network MFPN is designed, features can be effectively fused and expressed, then, a backbone network is designed for feature extraction by taking CSPNet as a basic module and fusing a YOLO framework, and finally, the backbone network and the multidirectional feature fusion pyramid network MFPN are combined to form the 6D attitude estimation network MFPN-6D.

The multi-directional feature fusion pyramid network MFPN can effectively represent and process multi-scale features and can effectively process the conditions of occlusion and complex background. The 6D attitude estimation network MFPN-6D constructed based on the multidirectional feature fusion pyramid network MFPN can quickly and accurately estimate the attitude of the target object. Compared with other methods, the method is far superior to other methods in the aspects of efficiency, speed and robustness, and can verify that the method is superior to other methods, effectively solve the problem that the surface texture of an object is insufficient or other objects shield the target object, improve the detection precision and simultaneously give consideration to the speed, and has higher robustness.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The real-time efficient 6D attitude estimation network is characterized by comprising a multidirectional feature fusion pyramid network and a backbone network, wherein the multidirectional feature fusion pyramid network and the backbone network are combined to form the 6D attitude estimation network, the multidirectional feature fusion pyramid network is used for fusing and expressing features, and the backbone network is used for feature extraction.

2. The real-time efficient 6D pose estimation network of claim 1, wherein the multidirectional feature fusion pyramid network comprises a residual structure that is fused into a forward propagation and a vertical propagation of the multidirectional feature fusion pyramid network.

3. The real-time efficient 6D pose estimation network of claim 2, wherein the backbone network is based on CSPNet network and integrates YOLO framework.

4. The real-time efficient 6D pose estimation network of claim 1, wherein the total dataset of the 6D pose estimation network comprises a LINEMOD standard dataset and an Occluded-LINEMOD standard dataset, and wherein the 6D pose estimation network is trained and validated on the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset.

5. The real-time efficient 6D pose estimation network of claim 4, wherein the LINEMOD standard dataset comprises 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing CAD models of all objects; the Occluded-LINEMOD standard data set is a data set which comprises a plurality of target objects and has occlusion.

6. The real-time efficient 6D pose estimation network according to claim 5, wherein a total data set of the 6D pose estimation network comprises a training set and a test set, wherein the training set accounts for 20% of the total data set, and wherein the test set accounts for 80% of the total data set.

7. The real-time efficient 6D pose estimation network of claim 1, wherein the 6D pose estimation network operates at 56 FPS.

8. A method for constructing a real-time efficient 6D pose estimation network according to any of claims 1 to 7, comprising: firstly, fusing a residual structure into forward propagation and vertical propagation to establish a multidirectional characteristic fusion pyramid network; then, a CSPNet network is used as a basic module, and a YOLO framework is fused to establish a backbone network; and finally, combining the multidirectional feature fusion pyramid network and the backbone network to form a 6D attitude estimation network.

9. The method of claim 8, wherein the 6D pose estimation network is trained and validated on a LINEMOD standard dataset and an Occluded-LINEMOD standard dataset.

10. A method of 6D pose estimation employing a real-time efficient 6D pose estimation network according to any of claims 1 to 7.