CN116245940A

CN116245940A - Category-level six-degree-of-freedom object pose estimation method based on structure difference perception

Info

Publication number: CN116245940A
Application number: CN202310052012.8A
Authority: CN
Inventors: 李嘉茂; 李国威; 朱冬晨; 张广慧; 石文君; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2023-06-09
Anticipated expiration: 2043-02-02
Also published as: CN116245940B

Abstract

The invention relates to a category-level six-degree-of-freedom object pose estimation method based on structural difference perception, which comprises the following steps: inputting the depth map into a target detection segmentation network for recognition, obtaining an observation point cloud of an object instance according to a recognition result, and selecting a category prior corresponding to the target object based on the observation point cloud of the object instance; extracting the observation point cloud and the category priori features to obtain instance geometric features and category geometric features; inputting the instance geometric features and the category geometric features into an information interaction enhancement module to obtain enhanced instance geometric features and category geometric features; then, the semantic and geometric information are fused through the semantic dynamic fusion module, so that instance fusion characteristics and category fusion characteristics are obtained; obtaining an instance NOCS model based on the category fusion features; and matching the example NOCS model with the observation point cloud through a matching network, and calculating according to the similarity to obtain the 6D pose and the size of the target object. The method and the device can improve the accuracy of 6D pose estimation.

Description

Category-level six-degree-of-freedom object pose estimation method based on structure difference perception

Technical Field

The invention relates to the technical field of computer vision, in particular to a category-level six-degree-of-freedom object pose estimation method based on structural difference perception.

Background

Estimating the six degrees of freedom (6 d) pose of a real object from a picture is a very critical task, namely estimating the position and orientation of the object under the camera coordinate system, and consists of a three-dimensional rotation matrix and a three-dimensional translation vector. The task of estimating the 6D pose of an object is widely applied to many real scenes, such as 3D scene understanding, robot grabbing, virtual reality, augmented reality and other fields. The 6D pose estimation task can be classified into two categories according to the level of the estimated object: 1. example-level 6D pose estimation for a particular object; 2. category-level 6D pose estimation for the same category of objects. The example level 6D pose estimation task needs to know its own position in the world coordinate system in advance when calculating the pose of the object, and the center of the general world coordinate system falls at the center of the object, i.e. its CAD model. For a new object without a CAD model defined in a real scene, the example-level 6D pose estimation algorithm has no way to estimate the pose of the object, which severely limits the application of the example-level 6D pose estimation algorithm in the real scene. Therefore, to break the limitations of the example-level 6D pose estimation method, a class-level 6D pose estimation task is proposed that is able to estimate the 6D pose of different object instances under the same class, even if some object instances do not have CAD models.

Wang et al first put forward the concept of class-level object 6D pose estimation task, in order to solve the problem of lack of CAD models in estimating the 6D pose of an object, they introduced a Normalized Object Coordinate Space (NOCS), a shared canonical representation of all possible object instances under a class, by first reconstructing the object instance in the NOCS, and then calculating the pose transformation relationship of the object instance from the NOCS space to the camera coordinate system, i.e., the 6D pose of the object. Because different object instances under the same class may have large structural differences, reconstructing their NOCS model is a very difficult task, which is a difficulty for class-level object 6D pose estimation tasks. Aiming at the problem, SPD proposes to learn a category priori for each category, then deform the category priori according to different object examples, reconstruct an NOCS model of the object example, further increase the accuracy of pose estimation, but the category priori information blurring causes the reconstructed NOCS model to be inaccurate.

Disclosure of Invention

The invention aims to provide a category-level six-degree-of-freedom object pose estimation method based on structural difference perception, which can improve the accuracy of 6D pose estimation.

The technical scheme adopted for solving the technical problems is as follows: the utility model provides a category-level six-degree-of-freedom object pose estimation method based on structural difference perception, which comprises the following steps:

inputting the depth map into a target detection segmentation network to obtain an image block of a target object and a segmentation mask of the target object;

obtaining an observation point cloud of an object instance according to the segmentation mask of the target object and the depth map, and selecting a category prior corresponding to the target object based on the observation point cloud of the object instance;

extracting the observation point cloud and the category priori features to obtain instance geometric features and category geometric features;

inputting the instance geometric features and the category geometric features into an information interaction enhancement module, implicitly modeling geometric differences between the instance geometric features and the category geometric features through the information interaction enhancement module, and supplementing the instance geometric features and the category geometric features to obtain enhanced instance geometric features and category geometric features;

inputting the geometric difference between the instance geometric features and the category geometric features, the enhanced instance geometric features and the category geometric features into a semantic dynamic fusion module, and fusing semantic and geometric information through the semantic dynamic fusion module to obtain instance fusion features and category fusion features;

the category fusion features are sent to a deformation network to obtain a deformation field, and the category prior is deformed by using the deformation field to obtain an instance NOCS model;

and matching the example NOCS model with the observation point cloud through a matching network, and calculating according to the similarity to obtain the 6D pose and the size of the target object.

The target detection segmentation network adopts a Mask-RCNN network.

And the characteristics of the observation point cloud and the category priori are extracted by adopting a convolutional neural network and a PointNet++ network.

The information interaction enhancement module comprises: a full connection layer for mapping the instance geometric features and the category geometric features to the same feature subspace, respectively; the matrix multiplication unit is used for carrying out matrix multiplication operation on the instance geometric features and the category geometric features mapped to the same feature subspace to obtain a structural relation matrix between the instance geometric features and the category geometric features; the normalization unit is used for normalizing the structural relation matrix into a weight coefficient; the weighted summation unit is used for carrying out weighted summation on the geometric projection features in the structural relation matrix by adopting the weight coefficient to obtain the example geometric features and the category geometric features; and the multi-layer perceptron is used for respectively fusing the geometric difference with the instance geometric feature and the category geometric feature to obtain the enhanced instance geometric feature and the category geometric feature.

The semantic dynamic fusion module adopts a pixel-level fusion strategy to realize a corresponding point fusion module for the enhanced instance geometric features to explore internal mapping between data sources to obtain instance fusion features, adopts geometric differences between the instance geometric features and the instance geometric features to dynamically adjust the enhanced instance geometric features for the enhanced category geometric features and the instance geometric features of different individuals, and fuses the adjusted enhanced instance geometric features and the enhanced category geometric features to obtain category fusion features.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention utilizes the structural difference between the object instance and the category priori to enhance the learning of the shape information in the category, further dynamically adjusts the semantic information according to the geometric relationship between the object instance and the category priori through the semantic dynamic fusion module, and then dynamically supplements the missing of the geometric information through fusion with the enhanced category priori so as to improve the robustness to noise.

Drawings

FIG. 1 is a flow chart of a category-level six-degree-of-freedom object pose estimation method based on structural difference perception in an embodiment of the invention;

FIG. 2 is a schematic diagram of an information interaction enhancement module in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a semantic dynamic fusion module in an embodiment of the present invention;

FIG. 4 is a schematic view of an observation point cloud of different object examples;

fig. 5 is a comparison of the results of an embodiment of the present invention and an SPD method.

Detailed Description

The invention will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. Further, it is understood that various changes and modifications may be made by those skilled in the art after reading the teachings of the present invention, and such equivalents are intended to fall within the scope of the claims appended hereto.

The embodiment of the invention relates to a category-level six-degree-of-freedom object pose estimation method based on structural difference perception. As shown in fig. 1, the method comprises the following steps:

and step 1, inputting the depth map into a target detection segmentation network to obtain an image block of a target object and a segmentation mask of the target object. In this step, an existing object detection segmentation network may be used to obtain the image block of the object and its segmentation Mask, for example, a Mask-RCNN network may be used.

And 2, obtaining the observation point cloud of the object instance according to the segmentation mask of the target object and the depth map, and selecting a category prior corresponding to the target object based on the observation point cloud of the object instance.

And step 3, extracting the observation point cloud and the category priori features to obtain instance geometric features and category geometric features. In the step, a convolutional neural network and a PointNet++ can be used for extracting picture semantic features and point cloud geometric features respectively, so that instance geometric features and category geometric features are obtained.

And 4, inputting the instance geometric features and the category geometric features into an information interaction enhancement module, implicitly modeling geometric differences between the instance geometric features and the category geometric features through the information interaction enhancement module, and supplementing the instance geometric features and the category geometric features to obtain enhanced instance geometric features and category geometric features.

The information interaction enhancement module in this step aims to learn the structural relationship between the instance point cloud and the class priors to help build their structural difference information on the feature level. It leverages features of structural differences to supplement the original geometric features, such that the enhanced geometric features include the unique individuality and generic commonality of class priors of example structures. On the one hand, due to the complementation of example structural characteristics, the enhanced class geometry can reconstruct a more accurate example NOCS model. On the other hand, instance geometry increases commonalities in class shape, thereby allowing the reconstructed correspondence matrix to better correlate observed point clouds with the NOCS model. In addition, as the geometric differences between the category prior and different examples under the same category are different, the information interaction enhancement module can adapt to the examples of various shapes which are not seen before, and the generalization of the embodiment is greatly increased.

The structure of the information interaction enhancement module is shown in fig. 2, and includes: a full connection layer for mapping the instance geometric features and the category geometric features to the same feature subspace, respectively; the matrix multiplication unit is used for carrying out matrix multiplication operation on the instance geometric features and the category geometric features mapped to the same feature subspace to obtain a structural relation matrix between the instance geometric features and the category geometric features; the normalization unit is used for normalizing the structural relation matrix into a weight coefficient; the weighted summation unit is used for carrying out weighted summation on the geometric projection features in the structural relation matrix by adopting the weight coefficient to obtain the example geometric features and the category geometric features; and the multi-layer perceptron is used for respectively fusing the geometric difference with the instance geometric feature and the category geometric feature to obtain the enhanced instance geometric feature and the category geometric feature.

Thus for instance geometry and class geometry, they can be mapped to the same feature subspace using the fully connected network layer, and then their structural relationship matrix obtained by a matrix multiplication operation. And normalizing the structural relation matrix into a weight coefficient, and carrying out weighted summation on the geometric projection features to obtain structural difference features. And finally, fusing the original geometric features and the structural difference features by using a multi-layer perceptron to obtain enhanced geometric features.

And 5, inputting the geometric difference between the instance geometric features and the category geometric features, the enhanced instance geometric features and the enhanced category geometric features into a semantic dynamic fusion module, and fusing semantic and geometric information through the semantic dynamic fusion module to obtain instance fusion features and category fusion features.

As shown in fig. 4, the object example point cloud obtained through the object detection and segmentation model may contain a certain noise point. When the influence of these noise points is transferred to the class prior, it will theoretically have a negative effect on the reconstruction accuracy of the NOCS model, resulting in a deviation in the correspondence between the object instance point cloud and its NOCS model. In order to solve the problem, the embodiment designs a semantic dynamic fusion module, and the robustness of the network to noise points is improved by fully fusing geometric and semantic information.

FIG. 3 illustrates a semantic dynamic fusion module that employs a pixel-level fusion strategy to implement a corresponding point fusion module for enhanced instance geometric features to explore intrinsic mappings between data sources, resulting in instance fusion features, and employs geometric differences between instance geometric features and class geometric features to dynamically adjust the enhanced instance geometric features for enhanced class geometric features and instance geometric features of different individuals, fusing the adjusted enhanced instance geometric features with the enhanced class geometric features, resulting in class fusion features. That is, this embodiment uses a pixel-level fusion strategy to implement a corresponding point fusion module to explore the intrinsic mapping between data sources by referencing the method in DenseFile. For class geometry features and instance semantic features from different individuals, the pixel-level fusion strategy cannot be directly used because of their absence of pixel-level correspondence, so this embodiment employs two different fusion strategies. The first is the general idea of feature fusion, which is called direct fusion, by stitching the two together and then fusing by means of an MLP function. While direct fusion strategies can improve performance by absorbing semantic information, there are still under-considerations for cross-individual problems. Therefore, the embodiment also designs a semantic fusion strategy, dynamically adjusts the semantic features of the instance according to the structural relation matrix of the instance and the category, and then fuses with the geometric features of the category.

And 6, sending the category fusion characteristics to a deformation network to obtain a deformation field, and carrying out prior deformation on the category by using the deformation field to obtain an instance NOCS model.

And 7, matching the example NOCS model with the observation point cloud through a matching network, and calculating according to the similarity to obtain the 6D pose and the size of the target object.

As shown in FIG. 5, one of the two sets of frame lines outside the target object is true, and the other set is predicted. Compared with the SPD method, the pose estimated by the method in the embodiment is more accurate, particularly in the category of a camera (an object pointed by an arrow in the figure) with relatively large shape change, the estimated result of the method in the embodiment is much better than that of the SPD method, and the method in the embodiment is proved to be capable of well processing the problem of shape change in the category.

It is easy to find that the invention utilizes the structural difference between the object instance and the category priori to enhance the learning of the shape information in the category, further dynamically adjusts the semantic information according to the geometric relationship between the object instance and the category priori through the semantic dynamic fusion module, and then fuses with the enhanced category priori to dynamically supplement the missing of the geometric information so as to improve the robustness to noise.

Claims

1. The category-level six-degree-of-freedom object pose estimation method based on structural difference perception is characterized by comprising the following steps of:

inputting the depth map into a target detection segmentation network to obtain an image block of a target object and a segmentation mask of the target object; obtaining an observation point cloud of an object instance according to the segmentation mask of the target object and the depth map, and selecting a category prior corresponding to the target object based on the observation point cloud of the object instance;

2. The method for estimating the pose of the class-level six-degree-of-freedom object based on the structural difference perception according to claim 1, wherein the object detection segmentation network adopts a Mask-RCNN network.

3. The method for estimating the pose of the category-level six-degree-of-freedom object based on the structural difference perception according to claim 1, wherein the feature of the observation point cloud and the category prior is extracted by adopting a convolutional neural network and a PointNet++ network.

4. The method for estimating the pose of the category-level six-degree-of-freedom object based on the structural difference perception according to claim 1, wherein the information interaction enhancement module comprises: a full connection layer for mapping the instance geometric features and the category geometric features to the same feature subspace, respectively; the matrix multiplication unit is used for carrying out matrix multiplication operation on the instance geometric features and the category geometric features mapped to the same feature subspace to obtain a structural relation matrix between the instance geometric features and the category geometric features; the normalization unit is used for normalizing the structural relation matrix into a weight coefficient; the weighted summation unit is used for carrying out weighted summation on the geometric projection features in the structural relation matrix by adopting the weight coefficient to obtain the example geometric features and the category geometric features; and the multi-layer perceptron is used for respectively fusing the geometric difference with the instance geometric feature and the category geometric feature to obtain the enhanced instance geometric feature and the category geometric feature.

5. The method for estimating the pose of the class-level six-degree-of-freedom object based on the structural difference perception according to claim 1, wherein the semantic dynamic fusion module adopts a pixel-level fusion strategy to realize a corresponding point fusion module for the enhanced instance geometric feature to explore internal mapping between data sources, so as to obtain an instance fusion feature, adopts geometric differences between the instance geometric feature and the class geometric feature to dynamically adjust the enhanced instance geometric feature for the enhanced class geometric feature and the instance geometric feature of different individuals, and fuses the adjusted enhanced instance geometric feature and the enhanced class geometric feature to obtain the class fusion feature.