CN112561995B - Real-time and efficient 6D attitude estimation network, construction method and estimation method - Google Patents

Real-time and efficient 6D attitude estimation network, construction method and estimation method Download PDF

Info

Publication number
CN112561995B
CN112561995B CN202011430902.0A CN202011430902A CN112561995B CN 112561995 B CN112561995 B CN 112561995B CN 202011430902 A CN202011430902 A CN 202011430902A CN 112561995 B CN112561995 B CN 112561995B
Authority
CN
China
Prior art keywords
network
pose estimation
real
feature fusion
linemod
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011430902.0A
Other languages
Chinese (zh)
Other versions
CN112561995A (en
Inventor
刘鹏磊
张锲石
程俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011430902.0A priority Critical patent/CN112561995B/en
Publication of CN112561995A publication Critical patent/CN112561995A/en
Application granted granted Critical
Publication of CN112561995B publication Critical patent/CN112561995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a real-time and efficient 6D gesture estimation network, a construction method and an estimation method, belongs to the technical field of computer vision, relates to the field of 6D gesture estimation, and utilizes a multi-directional feature fusion pyramid network MFPN to fuse and express features, wherein the multi-directional feature fusion pyramid network MFPN can effectively represent and process multi-scale features, can effectively process the complex shielding and background situations, takes a cross-stage local network CSPNet as a basic module, fuses a YOLO frame, builds a backbone network capable of effectively extracting features, is combined with the multi-directional feature fusion pyramid network MFPN, and finally designs a new network MFPN-6D for 6D gesture estimation, so that the problems of insufficient texture and shielding of objects can be effectively solved, the prediction accuracy and the calculation speed of the model are improved, and the robustness is also enhanced.

Description

Real-time and efficient 6D attitude estimation network, construction method and estimation method
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a real-time and efficient 6D gesture estimation network, a construction method and an estimation method.
Background
The 6D pose estimation refers to estimating the 6D pose of the object under the camera coordinate system, i.e. the 3D position and the 3D pose, and at this time, the coordinate system of the original object itself may be regarded as the world coordinate system, i.e. the R T transformation from the world system where the original object is located to the camera system is obtained. Rigid body means that the object will not deform. The meaning of the 6D pose estimation of the rigid body is that the accurate pose of the object can be obtained, the support is used for fine operation of the object, and the method is mainly applied to the robot grabbing field and the augmented reality field. The latest research trend of 6D pose estimation is to train a deep neural network to directly predict the 2D projection positions of 3D key points from images, establish a corresponding relationship, and finally use PESPECCTIVE-n-Point (PnP) algorithm to perform pose estimation. Pose estimation currently faces challenges in that when there are few object textures, occlusion and scene clutter, detection accuracy will be reduced, and most existing computational models are large and cannot meet real-time requirements.
The 6D pose estimation method of the related art is mainly divided into two types: based on depth information (RGB-D) or based on image information (RGB). Although the current approach of pose estimation using RGB-D cameras is very reliable, depth cameras are only suitable for indoor scenes and under-power situations. In contrast, RGB cameras are suitable for a wider range of scenes and save power. In the field of image-based 6D pose estimation algorithm for objects, there are methods of key point matching and edge matching, and although the object with rich texture can be effectively processed, the object with no texture or little texture cannot be processed. To solve this problem, a deep learning-based method has recently been used in pose estimation. For example: BB8 and PVNet, which predict 2D-3D correspondence by training a deep neural network and further solve for gestures by PnP algorithm. Although they have achieved good performance, these methods either require post-processing stages and are difficult to do with real-time requirements. Some algorithms have achieved good results in terms of speed, for example: YOLO-6D, but this method works poorly for handling objects and small objects that are occluded.
Accordingly, the related art has drawbacks with respect to the 6D pose estimation study in two aspects: when the texture of the target object is less and the conditions of shielding and complex scenes exist, the detection precision is reduced and even the detection cannot be performed; the large parameter amount required by most of the existing calculation methods leads to a large model, and most of the existing calculation methods cannot meet the real-time requirement.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a real-time and efficient 6D gesture estimation network, a construction method and an estimation method, which can effectively solve the problem that the texture of the surface of an object is insufficient or other objects shield a target object, improve the detection precision and simultaneously give consideration to the speed, and have higher robustness.
In order to achieve the above purpose, the invention provides a real-time and efficient 6D gesture estimation network, which comprises a multi-directional feature fusion pyramid network and a backbone network, wherein the multi-directional feature fusion pyramid network and the backbone network are combined to form the 6D gesture estimation network, the multi-directional feature fusion pyramid network is used for fusing and expressing features, and the backbone network is used for extracting the features.
Further, the multi-directional feature fusion pyramid network includes a residual structure that fuses into forward propagation and vertical propagation of the multi-directional feature fusion pyramid network.
Further, the backbone network takes CSPNet networks as basic modules and merges YOLO frameworks.
Further, the total dataset of the 6D pose estimation network includes LINEMOD standard dataset and Occluded-LINEMOD standard dataset, and the 6D pose estimation network trains and validates on the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset.
Further, the LINEMOD standard dataset includes 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing CAD models of all objects; the Occluded-LINEMOD standard dataset is a dataset containing a plurality of target objects and having occlusions.
Further, the total data set of the 6D pose estimation network includes a training set and a test set, the training set accounting for 20% of the total data set, and the test set accounting for 80% of the total data set.
Further, the 6D pose estimation network operates at a speed of 56 FPS.
The invention also provides a method for constructing the real-time and efficient 6D attitude estimation network, which comprises the following steps: firstly, fusing a residual structure into forward propagation and vertical propagation, and establishing a multidirectional feature fusion pyramid network; then taking CSPNet network as a basic module, fusing the YOLO framework, and establishing a backbone network; and finally, combining the multi-directional feature fusion pyramid network and the backbone network to form a 6D gesture estimation network.
Further, the 6D pose estimation network in the construction method trains and verifies on LINEMOD standard data sets and Occluded-LINEMOD standard data sets.
The invention also provides a 6D gesture estimation method, which adopts the real-time and efficient 6D gesture estimation network.
Compared with the prior art, the method can solve the problem of 6D pose estimation of the rigid body, utilizes the multi-directional feature fusion pyramid network MFPN for fusing and expressing the features, can effectively express and process multi-scale features, can effectively process the conditions of shielding and complicated background, takes a cross-stage local network CSPNet as a basic module, fuses a YOLO frame, constructs a backbone network capable of effectively extracting the features, is combined with the multi-directional feature fusion pyramid network MFPN, and finally designs a new network MFPN-6D for 6D pose estimation, can effectively solve the problems of insufficient texture and shielding of objects, improves the prediction precision and calculation speed of the model, and also enhances the robustness.
Drawings
FIG. 1 is a schematic diagram of a 6D pose estimation neural network MFPN-6D of the present invention;
FIG. 2a is a schematic diagram of a feature pyramid network FPN; FIG. 2b is a schematic diagram of a PANet network; FIG. 2c is a schematic diagram of a BiFPN network; fig. 2d is a schematic diagram of the multi-directional feature fusion pyramid network MFPN of the present invention.
Detailed Description
The present application will be further illustrated by the following description of the drawings and specific embodiments, wherein it is apparent that the embodiments described are some, but not all, of the embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, an embodiment of the present invention provides a real-time and efficient 6D pose estimation network MFPN-6D, which includes a multi-directional feature fusion pyramid network MFPN and a backbone network, where the multi-directional feature fusion pyramid network MFPN and the backbone network are combined to form the 6D pose estimation network MFPN-6D, and the multi-directional feature fusion pyramid network MFPN is used for fusing and expressing features, and the backbone network is used for feature extraction. The multi-way feature fusion pyramid network MFPN includes residual structures that are fused into the forward propagation and vertical propagation of the multi-way feature fusion pyramid network MFPN. The backbone network takes a cross-phase local network CSPNet network as a base module and merges YOLO frameworks.
The total dataset of the 6D pose estimation network MFPN-6D includes the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset, and the 6D pose estimation network MFPN-6D trains and validates on the LINEMOD standard dataset and the Occluded-LINEMOD standard dataset. The LINEMOD standard dataset includes 13 sequences, each sequence containing the true pose of a single object in a cluttered environment and providing a CAD model of the object; occluded-LINEMOD standard dataset is a dataset containing a plurality of target objects and having occlusions. The total dataset of the 6D pose estimation network MFPN-6D includes a training set and a testing set, the training set comprising 20% of the total dataset and the testing set comprising 80% of the total dataset. The 6D pose estimation network MFPN-6D runs at 56FPS, which is the fastest method in the field of 6D pose estimation at present.
One of the main difficulties of 6D pose estimation is the efficient representation and processing of multi-scale features in order to be able to efficiently represent and process multi-scale features. As shown in fig. 2a, feature pyramid network FPN proposes a top-down path to combine multi-scale features, but FPN is inherently limited by unidirectional information flow. To solve this problem PANet adds an additional bottom-up path aggregation network on the basis of the FPN, as shown in fig. 2 b. PANet is highly accurate but requires more parameters and calculations, google researchers have proposed BiFPN networks, as shown in fig. 2c, efficient bi-directional cross-scale connectivity and weighted feature fusion networks, in order to improve model efficiency. BiFPN has a higher accuracy than PANet and a lower cost than PANet. BiFPN is one of the most advanced feature networks, but it only considers the problem of forward feature propagation, but does not consider the problem of vertical propagation of features, which would result in features being lost when propagated in the vertical direction, and therefore not using all feature information effectively.
In order to more effectively process and represent multi-scale features, the concept of a residual network is applied to the 6D pose estimation network MFPN-6D of the present invention. The residual structure is fused into forward propagation and vertical propagation of the 6D attitude estimation network MFPN-6D, finally, a multi-directional feature fusion pyramid network MFPN is provided, as shown in fig. 2D, the forward residual structure and the residual structure in the vertical direction are added on the basis of BiFPN, and the proposed multi-directional feature fusion pyramid network MFPN can improve the utilization rate of features in forward propagation and vertical propagation and more effectively represent and process multi-scale features.
In the aspect of skeleton network design, the invention adopts the most advanced cross-stage local network CSPNet network as a basic module, the idea of a YOLO network frame is fused to design a final feature extraction backbone network, and a neural network MFPN-6D for 6D gesture estimation is constructed by combining the final feature extraction backbone network with a multi-directional feature fusion pyramid network MFPN, as shown in fig. 1, the backbone network for the feature extraction is designed based on CSPNet structure, the feature extraction can be efficiently performed on pictures, the MFPN network is used as a Neck network to be combined with the backbone network, the final detection network adopts the YOLO network, the final designed network can efficiently and accurately perform 6D gesture estimation on objects, and meanwhile, the final feature extraction backbone network can also operate at the speed of 56FPS, and is the method with the highest speed in the current 6D gesture estimation field.
The invention aims to provide 6D gesture estimation which is efficient, quick and capable of effectively processing shielding problems, firstly, a multi-directional feature fusion pyramid network MFPN is designed, features can be effectively fused and expressed, then a CSPNet is taken as a basic module, a backbone network is designed for feature extraction by fusing a YOLO frame, and finally the backbone network and the multi-directional feature fusion pyramid network MFPN are combined to form a 6D gesture estimation network MFPN-6D.
The multi-directional feature fusion pyramid network MFPN can effectively represent and process multi-scale features and can effectively process the conditions of shielding and complex background. The 6D attitude estimation network MFPN-6D constructed based on the multidirectional feature fusion pyramid network MFPN can rapidly and accurately estimate the attitude of the target object. Compared with other methods, the method is far superior to other methods in efficiency, speed and robustness, can verify that the method is superior to other methods, can effectively solve the problem that the surface texture of an object is insufficient or other objects shield a target object, improves the detection precision, and has high speed and robustness.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims (8)

1. A real-time efficient 6D pose estimation network, comprising a multi-directional feature fusion pyramid network and a backbone network, wherein the multi-directional feature fusion pyramid network and the backbone network are combined to form the 6D pose estimation network, the multi-directional feature fusion pyramid network is used for fusing and expressing features, the backbone network is used for feature extraction, the multi-directional feature fusion pyramid network comprises a residual structure, the residual structure is fused into forward propagation and vertical propagation of the multi-directional feature fusion pyramid network, the backbone network takes a CSPNet network as a basic module to perform feature extraction on pictures and fuses a YOLO frame, a total data set of the 6D pose estimation network comprises a LINEMOD standard data set and a Occluded-LINEMOD standard data set, the LINEMOD standard data set comprises 13 sequences, each sequence comprises real poses of a single target in a cluttered environment, and a CAD model of all targets is provided.
2. A real-time efficient 6D pose estimation network according to claim 1 wherein said 6D pose estimation network is trained and validated on said LINEMOD standard dataset and said Occluded-LINEMOD standard dataset.
3. A real-time efficient 6D pose estimation network according to claim 2 wherein said Occluded-LINEMOD standard dataset is a dataset containing multiple target objects and occlusion is present.
4. A real-time efficient 6D pose estimation network according to claim 3, wherein the total dataset of said 6D pose estimation network comprises a training set and a test set, said training set comprising 20% of said total dataset and said test set comprising 80% of said total dataset.
5. A real-time efficient 6D pose estimation network according to claim 1 wherein said 6D pose estimation network operates at a speed of 56 FPS.
6. A method of constructing a real-time efficient 6D pose estimation network according to any of claims 1 to 5, comprising: firstly, fusing a residual structure into forward propagation and vertical propagation, and establishing a multidirectional feature fusion pyramid network; then taking CSPNet network as a basic module, fusing the YOLO framework, and establishing a backbone network; and finally, combining the multi-directional feature fusion pyramid network and the backbone network to form a 6D gesture estimation network.
7. The method of claim 6, wherein the 6D pose estimation network is trained and validated on LINEMOD standard data sets and Occluded-LINEMOD standard data sets.
8. A 6D pose estimation method, characterized in that a real-time efficient 6D pose estimation network according to any of claims 1 to 5 is employed.
CN202011430902.0A 2020-12-09 2020-12-09 Real-time and efficient 6D attitude estimation network, construction method and estimation method Active CN112561995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011430902.0A CN112561995B (en) 2020-12-09 2020-12-09 Real-time and efficient 6D attitude estimation network, construction method and estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011430902.0A CN112561995B (en) 2020-12-09 2020-12-09 Real-time and efficient 6D attitude estimation network, construction method and estimation method

Publications (2)

Publication Number Publication Date
CN112561995A CN112561995A (en) 2021-03-26
CN112561995B true CN112561995B (en) 2024-04-23

Family

ID=75060013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011430902.0A Active CN112561995B (en) 2020-12-09 2020-12-09 Real-time and efficient 6D attitude estimation network, construction method and estimation method

Country Status (1)

Country Link
CN (1) CN112561995B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436251B (en) * 2021-06-24 2024-01-09 东北大学 Pose estimation system and method based on improved YOLO6D algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101473439A (en) * 2006-04-17 2009-07-01 全视Cdm光学有限公司 Arrayed imaging systems and associated methods
CN110533721A (en) * 2019-08-27 2019-12-03 杭州师范大学 A kind of indoor objects object 6D Attitude estimation method based on enhancing self-encoding encoder
CN111145253A (en) * 2019-12-12 2020-05-12 深圳先进技术研究院 Efficient object 6D attitude estimation algorithm
CN111968235A (en) * 2020-07-08 2020-11-20 杭州易现先进科技有限公司 Object attitude estimation method, device and system and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3075140B1 (en) * 2013-11-26 2018-06-13 FotoNation Cayman Limited Array camera configurations incorporating multiple constituent array cameras

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101473439A (en) * 2006-04-17 2009-07-01 全视Cdm光学有限公司 Arrayed imaging systems and associated methods
CN110533721A (en) * 2019-08-27 2019-12-03 杭州师范大学 A kind of indoor objects object 6D Attitude estimation method based on enhancing self-encoding encoder
CN111145253A (en) * 2019-12-12 2020-05-12 深圳先进技术研究院 Efficient object 6D attitude estimation algorithm
CN111968235A (en) * 2020-07-08 2020-11-20 杭州易现先进科技有限公司 Object attitude estimation method, device and system and computer equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"EfficientDet: Scalable and Efficient Object Detection";Mingxing Tan 等;《Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;摘要,1. Introduction第2-5段,第3.2,4.1节 *
"基于3D多视图的物体识别及姿态估计方法";《中国知网 硕士电子期刊》(第08期);第5.1节 *
MFPN-6D : Real-time One-stage Pose Estimation of Objects on RGB Images;Penglei Liu 等;《2021 IEEE International Conference on Robotics and Automation (ICRA 2021)》;第12939-12945页 *
Mingxing Tan 等."EfficientDet: Scalable and Efficient Object Detection".《Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》.2020,摘要,1. Introduction第2-5段,第3.2,4.1节. *

Also Published As

Publication number Publication date
CN112561995A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN109003325B (en) Three-dimensional reconstruction method, medium, device and computing equipment
WO2016110239A1 (en) Image processing method and device
CN109683699B (en) Method and device for realizing augmented reality based on deep learning and mobile terminal
JP2020507850A (en) Method, apparatus, equipment, and storage medium for determining the shape of an object in an image
US8610712B2 (en) Object selection in stereo image pairs
Luo et al. Real-time dense monocular SLAM with online adapted depth prediction network
CN110631554A (en) Robot posture determining method and device, robot and readable storage medium
WO2021218123A1 (en) Method and device for detecting vehicle pose
WO2023016271A1 (en) Attitude determining method, electronic device, and readable storage medium
JP7273129B2 (en) Lane detection method, device, electronic device, storage medium and vehicle
US11367195B2 (en) Image segmentation method, image segmentation apparatus, image segmentation device
WO2019157922A1 (en) Image processing method and device and ar apparatus
Xu et al. GraspCNN: Real-time grasp detection using a new oriented diameter circle representation
CN111753739A (en) Object detection method, device, equipment and storage medium
CN112561995B (en) Real-time and efficient 6D attitude estimation network, construction method and estimation method
CN114037087B (en) Model training method and device, depth prediction method and device, equipment and medium
CN110348351B (en) Image semantic segmentation method, terminal and readable storage medium
CN112634366B (en) Method for generating position information, related device and computer program product
WO2024051591A1 (en) Method and apparatus for estimating rotation of video, and electronic device and storage medium
Xu et al. Video-object segmentation and 3D-trajectory estimation for monocular video sequences
CN107730543A (en) A kind of iteratively faster computational methods of half dense stereo matching
Lin et al. High-resolution multi-view stereo with dynamic depth edge flow
CN111192312A (en) Depth image acquisition method, device, equipment and medium based on deep learning
CN114429631B (en) Three-dimensional object detection method, device, equipment and storage medium
CN112085842A (en) Depth value determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant