CN112561995B

CN112561995B - Real-time and efficient 6D attitude estimation network, construction method and estimation method

Info

Publication number: CN112561995B
Application number: CN202011430902.0A
Authority: CN
Inventors: 刘鹏磊; 张锲石; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2024-04-23
Anticipated expiration: 2040-12-09
Also published as: CN112561995A

Abstract

The invention discloses a real-time and efficient 6D gesture estimation network, a construction method and an estimation method, belongs to the technical field of computer vision, relates to the field of 6D gesture estimation, and utilizes a multi-directional feature fusion pyramid network MFPN to fuse and express features, wherein the multi-directional feature fusion pyramid network MFPN can effectively represent and process multi-scale features, can effectively process the complex shielding and background situations, takes a cross-stage local network CSPNet as a basic module, fuses a YOLO frame, builds a backbone network capable of effectively extracting features, is combined with the multi-directional feature fusion pyramid network MFPN, and finally designs a new network MFPN-6D for 6D gesture estimation, so that the problems of insufficient texture and shielding of objects can be effectively solved, the prediction accuracy and the calculation speed of the model are improved, and the robustness is also enhanced.

Description

Real-time and efficient 6D attitude estimation network, construction method and estimation method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a real-time and efficient 6D gesture estimation network, a construction method and an estimation method.

Background

The 6D pose estimation refers to estimating the 6D pose of the object under the camera coordinate system, i.e. the 3D position and the 3D pose, and at this time, the coordinate system of the original object itself may be regarded as the world coordinate system, i.e. the R T transformation from the world system where the original object is located to the camera system is obtained. Rigid body means that the object will not deform. The meaning of the 6D pose estimation of the rigid body is that the accurate pose of the object can be obtained, the support is used for fine operation of the object, and the method is mainly applied to the robot grabbing field and the augmented reality field. The latest research trend of 6D pose estimation is to train a deep neural network to directly predict the 2D projection positions of 3D key points from images, establish a corresponding relationship, and finally use PESPECCTIVE-n-Point (PnP) algorithm to perform pose estimation. Pose estimation currently faces challenges in that when there are few object textures, occlusion and scene clutter, detection accuracy will be reduced, and most existing computational models are large and cannot meet real-time requirements.

The 6D pose estimation method of the related art is mainly divided into two types: based on depth information (RGB-D) or based on image information (RGB). Although the current approach of pose estimation using RGB-D cameras is very reliable, depth cameras are only suitable for indoor scenes and under-power situations. In contrast, RGB cameras are suitable for a wider range of scenes and save power. In the field of image-based 6D pose estimation algorithm for objects, there are methods of key point matching and edge matching, and although the object with rich texture can be effectively processed, the object with no texture or little texture cannot be processed. To solve this problem, a deep learning-based method has recently been used in pose estimation. For example: BB8 and PVNet, which predict 2D-3D correspondence by training a deep neural network and further solve for gestures by PnP algorithm. Although they have achieved good performance, these methods either require post-processing stages and are difficult to do with real-time requirements. Some algorithms have achieved good results in terms of speed, for example: YOLO-6D, but this method works poorly for handling objects and small objects that are occluded.

Accordingly, the related art has drawbacks with respect to the 6D pose estimation study in two aspects: when the texture of the target object is less and the conditions of shielding and complex scenes exist, the detection precision is reduced and even the detection cannot be performed; the large parameter amount required by most of the existing calculation methods leads to a large model, and most of the existing calculation methods cannot meet the real-time requirement.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a real-time and efficient 6D gesture estimation network, a construction method and an estimation method, which can effectively solve the problem that the texture of the surface of an object is insufficient or other objects shield a target object, improve the detection precision and simultaneously give consideration to the speed, and have higher robustness.

In order to achieve the above purpose, the invention provides a real-time and efficient 6D gesture estimation network, which comprises a multi-directional feature fusion pyramid network and a backbone network, wherein the multi-directional feature fusion pyramid network and the backbone network are combined to form the 6D gesture estimation network, the multi-directional feature fusion pyramid network is used for fusing and expressing features, and the backbone network is used for extracting the features.

Further, the multi-directional feature fusion pyramid network includes a residual structure that fuses into forward propagation and vertical propagation of the multi-directional feature fusion pyramid network.

Further, the backbone network takes CSPNet networks as basic modules and merges YOLO frameworks.

Further, the total dataset of the 6D pose estimation network includes LINEMOD standard dataset and Occluded-LINEMOD standard dataset, and the 6D pose estimation network trains and validates on the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset.

Further, the LINEMOD standard dataset includes 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing CAD models of all objects; the Occluded-LINEMOD standard dataset is a dataset containing a plurality of target objects and having occlusions.

Further, the total data set of the 6D pose estimation network includes a training set and a test set, the training set accounting for 20% of the total data set, and the test set accounting for 80% of the total data set.

Further, the 6D pose estimation network operates at a speed of 56 FPS.

The invention also provides a method for constructing the real-time and efficient 6D attitude estimation network, which comprises the following steps: firstly, fusing a residual structure into forward propagation and vertical propagation, and establishing a multidirectional feature fusion pyramid network; then taking CSPNet network as a basic module, fusing the YOLO framework, and establishing a backbone network; and finally, combining the multi-directional feature fusion pyramid network and the backbone network to form a 6D gesture estimation network.

Further, the 6D pose estimation network in the construction method trains and verifies on LINEMOD standard data sets and Occluded-LINEMOD standard data sets.

The invention also provides a 6D gesture estimation method, which adopts the real-time and efficient 6D gesture estimation network.

Compared with the prior art, the method can solve the problem of 6D pose estimation of the rigid body, utilizes the multi-directional feature fusion pyramid network MFPN for fusing and expressing the features, can effectively express and process multi-scale features, can effectively process the conditions of shielding and complicated background, takes a cross-stage local network CSPNet as a basic module, fuses a YOLO frame, constructs a backbone network capable of effectively extracting the features, is combined with the multi-directional feature fusion pyramid network MFPN, and finally designs a new network MFPN-6D for 6D pose estimation, can effectively solve the problems of insufficient texture and shielding of objects, improves the prediction precision and calculation speed of the model, and also enhances the robustness.

Drawings

FIG. 1 is a schematic diagram of a 6D pose estimation neural network MFPN-6D of the present invention;

FIG. 2a is a schematic diagram of a feature pyramid network FPN; FIG. 2b is a schematic diagram of a PANet network; FIG. 2c is a schematic diagram of a BiFPN network; fig. 2d is a schematic diagram of the multi-directional feature fusion pyramid network MFPN of the present invention.

Detailed Description

The present application will be further illustrated by the following description of the drawings and specific embodiments, wherein it is apparent that the embodiments described are some, but not all, of the embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, an embodiment of the present invention provides a real-time and efficient 6D pose estimation network MFPN-6D, which includes a multi-directional feature fusion pyramid network MFPN and a backbone network, where the multi-directional feature fusion pyramid network MFPN and the backbone network are combined to form the 6D pose estimation network MFPN-6D, and the multi-directional feature fusion pyramid network MFPN is used for fusing and expressing features, and the backbone network is used for feature extraction. The multi-way feature fusion pyramid network MFPN includes residual structures that are fused into the forward propagation and vertical propagation of the multi-way feature fusion pyramid network MFPN. The backbone network takes a cross-phase local network CSPNet network as a base module and merges YOLO frameworks.

The total dataset of the 6D pose estimation network MFPN-6D includes the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset, and the 6D pose estimation network MFPN-6D trains and validates on the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset. The LINEMOD standard dataset includes 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing a CAD model of the object; occluded-LINEMOD standard dataset is a dataset containing a plurality of target objects and having occlusions. The total dataset of the 6D pose estimation network MFPN-6D includes a training set and a testing set, the training set comprising 20% of the total dataset and the testing set comprising 80% of the total dataset. The 6D pose estimation network MFPN-6D runs at 56FPS, which is the fastest method in the field of 6D pose estimation at present.

One of the main difficulties of 6D pose estimation is the efficient representation and processing of multi-scale features in order to be able to efficiently represent and process multi-scale features. As shown in fig. 2a, feature pyramid network FPN proposes a top-down path to combine multi-scale features, but FPN is inherently limited by unidirectional information flow. To solve this problem PANet adds an additional bottom-up path aggregation network on the basis of the FPN, as shown in fig. 2 b. PANet is highly accurate but requires more parameters and calculations, google researchers have proposed BiFPN networks, as shown in fig. 2c, efficient bi-directional cross-scale connectivity and weighted feature fusion networks, in order to improve model efficiency. BiFPN has a higher accuracy than PANet and a lower cost than PANet. BiFPN is one of the most advanced feature networks, but it only considers the problem of forward feature propagation, but does not consider the problem of vertical propagation of features, which would result in features being lost when propagated in the vertical direction, and therefore not using all feature information effectively.

In order to more effectively process and represent multi-scale features, the concept of a residual network is applied to the 6D pose estimation network MFPN-6D of the present invention. The residual structure is fused into forward propagation and vertical propagation of the 6D attitude estimation network MFPN-6D, finally, a multi-directional feature fusion pyramid network MFPN is provided, as shown in fig. 2D, the forward residual structure and the residual structure in the vertical direction are added on the basis of BiFPN, and the proposed multi-directional feature fusion pyramid network MFPN can improve the utilization rate of features in forward propagation and vertical propagation and more effectively represent and process multi-scale features.

In the aspect of skeleton network design, the invention adopts the most advanced cross-stage local network CSPNet network as a basic module, the idea of a YOLO network frame is fused to design a final feature extraction backbone network, and a neural network MFPN-6D for 6D gesture estimation is constructed by combining the final feature extraction backbone network with a multi-directional feature fusion pyramid network MFPN, as shown in fig. 1, the backbone network for the feature extraction is designed based on CSPNet structure, the feature extraction can be efficiently performed on pictures, the MFPN network is used as a Neck network to be combined with the backbone network, the final detection network adopts the YOLO network, the final designed network can efficiently and accurately perform 6D gesture estimation on objects, and meanwhile, the final feature extraction backbone network can also operate at the speed of 56FPS, and is the method with the highest speed in the current 6D gesture estimation field.

The invention aims to provide 6D gesture estimation which is efficient, quick and capable of effectively processing shielding problems, firstly, a multi-directional feature fusion pyramid network MFPN is designed, features can be effectively fused and expressed, then a CSPNet is taken as a basic module, a backbone network is designed for feature extraction by fusing a YOLO frame, and finally the backbone network and the multi-directional feature fusion pyramid network MFPN are combined to form a 6D gesture estimation network MFPN-6D.

The multi-directional feature fusion pyramid network MFPN can effectively represent and process multi-scale features and can effectively process the conditions of shielding and complex background. The 6D attitude estimation network MFPN-6D constructed based on the multidirectional feature fusion pyramid network MFPN can rapidly and accurately estimate the attitude of the target object. Compared with other methods, the method is far superior to other methods in efficiency, speed and robustness, can verify that the method is superior to other methods, can effectively solve the problem that the surface texture of an object is insufficient or other objects shield a target object, improves the detection precision, and has high speed and robustness.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. A real-time efficient 6D pose estimation network, comprising a multi-directional feature fusion pyramid network and a backbone network, wherein the multi-directional feature fusion pyramid network and the backbone network are combined to form the 6D pose estimation network, the multi-directional feature fusion pyramid network is used for fusing and expressing features, the backbone network is used for feature extraction, the multi-directional feature fusion pyramid network comprises a residual structure, the residual structure is fused into forward propagation and vertical propagation of the multi-directional feature fusion pyramid network, the backbone network takes a CSPNet network as a basic module to perform feature extraction on pictures and fuses a YOLO frame, a total data set of the 6D pose estimation network comprises a LINEMOD standard data set and a Occluded-LINEMOD standard data set, the LINEMOD standard data set comprises 13 sequences, each sequence comprises real poses of a single target in a cluttered environment, and a CAD model of all targets is provided.

2. A real-time efficient 6D pose estimation network according to claim 1 wherein said 6D pose estimation network is trained and validated on said LINEMOD standard dataset and said Occluded-LINEMOD standard dataset.

3. A real-time efficient 6D pose estimation network according to claim 2 wherein said Occluded-LINEMOD standard dataset is a dataset containing multiple target objects and occlusion is present.

4. A real-time efficient 6D pose estimation network according to claim 3, wherein the total dataset of said 6D pose estimation network comprises a training set and a test set, said training set comprising 20% of said total dataset and said test set comprising 80% of said total dataset.

5. A real-time efficient 6D pose estimation network according to claim 1 wherein said 6D pose estimation network operates at a speed of 56 FPS.

6. A method of constructing a real-time efficient 6D pose estimation network according to any of claims 1 to 5, comprising: firstly, fusing a residual structure into forward propagation and vertical propagation, and establishing a multidirectional feature fusion pyramid network; then taking CSPNet network as a basic module, fusing the YOLO framework, and establishing a backbone network; and finally, combining the multi-directional feature fusion pyramid network and the backbone network to form a 6D gesture estimation network.

7. The method of claim 6, wherein the 6D pose estimation network is trained and validated on LINEMOD standard data sets and Occluded-LINEMOD standard data sets.

8. A 6D pose estimation method, characterized in that a real-time efficient 6D pose estimation network according to any of claims 1 to 5 is employed.