CN115761734A

CN115761734A - Object pose estimation method based on template matching and probability distribution

Info

Publication number: CN115761734A
Application number: CN202211343422.XA
Authority: CN
Inventors: 柯逍; 黄森敏
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-10-29
Filing date: 2022-10-29
Publication date: 2023-03-07

Abstract

The invention relates to an object pose estimation method based on template matching and probability distribution, which comprises the following steps of: step S1: performing semantic segmentation on the pose estimation training set; step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation; and step S3: learning a dense 2D-2D correspondence between the input image pixels and the matching template by using a deep learning network, and further generating 2D-3D correspondence between the image pixels and the 3D model; and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding with the probability distribution of the pose. The method can effectively estimate the pose of the target in the RGB image.

Description

Object pose estimation method based on template matching and probability distribution

Technical Field

The invention relates to the technical field of pattern recognition and computer vision, in particular to an object pose estimation method based on template matching and probability distribution.

Background

Rapid developments in the field of computer vision have made significant contributions to the many tasks that robot operations address, including the grasping and detection of objects. The object pose estimation refers to estimating a three-dimensional rotation matrix and a three-dimensional translation vector of an object with a camera coordinate as an origin, so that the accurate posture of the object can be obtained, the fine operation of the object is supported, the method is an important technology in the field of robot grabbing, and the method also continuously becomes a research hotspot in recent years. In the meta universe, in order to improve the immersive experience, it is essential to truly and accurately represent the spatial state of an object, including acquiring position and pose information of a target object. Because object pose estimation has wide application fields and great commercial value, new pose estimation technologies are continuously explored in both academic and industrial fields.

Although the object pose estimation technology has made great progress, the existing pose estimation method only using RGB images still faces many challenges in real environment, and due to the absence of depth information guidance, the algorithm is easily affected by factors such as camera shooting angle and illumination condition changes, and there are problems of image blurring, incomplete object due to camera view truncation, and mutual occlusion between objects, so that it becomes very difficult to extract features from images.

Disclosure of Invention

In view of this, the present invention provides an object pose estimation method based on template matching and probability distribution, which can effectively estimate the pose of a target in an RGB image.

In order to achieve the purpose, the invention adopts the following technical scheme: the object pose estimation method based on template matching and probability distribution comprises the following steps of:

step S1: performing semantic segmentation on the pose estimation training set;

step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation;

and step S3: a deep learning network is utilized to learn the dense 2D-2D correspondence between the input image pixels and the matching template, and further 2D-3D correspondence between the image pixels and the 3D model is generated;

and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding by using probability distribution of the pose.

In a preferred embodiment, the step S1 specifically includes the following steps:

step S11: acquiring a public pose estimation data set from a network to obtain an RGB image and a 3D model for model training;

step S12: segmentation of a network S using pre-training post-operation semantics _FE Extracting features from an input image I; firstly, calculating the characteristics of N depth dimensions of the network, wherein the index of the characteristics is k and satisfies the requirement

Wherein, the first and the second end of the pipe are connected with each other,

representing a set of non-negative integers, N representing a depth dimension of a semantic segmentation network; thus, S _FE The network of the kth depth is

Satisfy the requirement of

Wherein the content of the first and second substances,

representing a set of real numbers, H ^k And W ^k Representing the image width and height, D, of the kth depth input network ^k Representing the number of channels of the kth deep input network;

representative feature extractor, height and width pairs along each depth feature map

Performing space average expansion to generate length D ^k A single vector of (a);

step S13: describing the textured 3D model using a set of viewpoint-based templates generated by rendering the object to compute a per-pixel correlation between a feature map of the object image and features from the object descriptor;

first, an RGB image of an input target is converted into a 3D feature tensor f at a k-th depth ^k The calculation formula is as follows:

the target template sampled along the xyz axis is then converted to the corresponding dense model descriptor o ^k The calculation formula is as follows:

wherein the model descriptor o ^k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the k-th depth dimension ^k And Y ^k Representing the camera position, Z, of the target coordinate system in the kth depth dimension ^k Representing in-plane rotation in the k-th depth dimension, D ^k Representing the feature map extracted from each template by the kth depth dimension down-cut network,

a set of points representing a 3D model of an object;

finally, the characteristic diagram f ^k With the entire object descriptor o ^k Match to produce a correlation tensor

Thus, each pixel of the target image feature map gets a list of all correlations of its feature vector with all feature vectors of the descriptors; the correlation is used for calculating the attention of pixels in the image, combining the original image features into the feature tensor, and performing more accurate segmentation by using the obtained attention;

step S14: the decoder uses a similar UNet structure, firstly carries out upsampling on the feature map, and then carries out upsampling on the convolutional layer until the size of an initial image is reached; the decoder uses a stacking operation at each stage;

step S15: the network is trained to predict the probability that each pixel contains the visible portion of the target, and a target object binary segmentation mask is generated and output.

In a preferred embodiment, the step S2 specifically includes the following steps:

step S21: initial viewpoint estimation by template matching, semantic segmentation feature extraction network S that relies on the same step S1 _FE But only uses the feature of the last layer and adds a 1 × 1 convolution layer to reduce the dimension from H × W × D to H × W × D ', H and W respectively representing the height and width of the input image, D representing the number of channels input to the network, and D' representing the number of channels input to the network after D passes through the 1 × 1 convolution layer;

pre-computing template features using the foreground of the query template and the target foreground in the target image represented using the segmentation mask predicted in step S1, respectively

And image features

The per-pixel correlation between f and t is then calculated, which is calculated as follows:

sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f denotes the correlation per pixel _h，w Representing the image feature at pixel (h, w), t _h，w Represents the template features under pixel (h, w);

step S22: training networks prioritize features with rotational similarity of surface regions of very close objects while penalizing features with poor rotational similarity and take advantage of modified triple loss with headroom

To optimize the attitude, the calculation formula is as follows:

wherein f is ^anchor Is a descriptor of a randomly selected target surface area, f ⁺ Is represented by ^anchor Templates with very similar poses f ^- Is represented by ^anchor The gestures have templates of different gestures, u is set to f ^anchor And f ⁺ The angle between rotations of the target in (1);

finally, during testing, the query template with the highest similarity to the detected target in the target image is selected as the match.

In a preferred embodiment, the specific method of step S3 is:

step S31: during training, randomly sampling object clipping I from data set _obj And its corresponding true attitude value G _obj E SE (3), wherein SE (3) represents a special euclidean group, i.e. a three-dimensional transformation of the object, including rotation and translation; then, a random template I is selected _tmp True value G of template posture _tmp E SE (3) such that G _obj And G _tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two area blocks;

c is used to represent the 2D-3D correspondence of an object presented in a given pose, whose formula is:

using inversion of C ^-1 Recalculating the correspondence for the non-normalized target coordinates, corresponding to the actual 3D target coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the target model, the calculation formula being as follows:

d(p，p′)＝||C ^-1 (I _obj ) _p -C ^-1 (I _tmp ) _p′ || ₂

where p and p' are the pixel coordinates in the image and template slice, respectively; establishing a posture truth value intensive 2D-2D correspondence by matching the pixel pair corresponding to the closest point in the 3D coordinate system of the model; for point p ∈ I _obj The corresponding calculation formula of the template points is as follows:

argmin (·) represents an argmin function, which is used to find the value of the time variable that minimizes the value of the objective function d (p, p');

step S32: the method comprises the following steps that (1) abnormal value perception rejection is adopted for 2D-2D corresponding to large 3D space difference; segmentation penalty is defined as a per-pixel block penalty, using Dice penalty

To process unbalanced class data, the calculation formula is as follows:

a represents a true value of segmentation, B represents a binary mask predicted by a segmentation network, and | A | and | B | represent the pixel numbers of the true value of segmentation and the prediction mask respectively;

finally, the network predicts the discrete 2D coordinates using standard per-pixel cross-entropy classification penalties.

In a preferred embodiment, the step S4 specifically includes the following steps:

step S41: independently estimating 2D-2D correspondences from each of the first S matching templates and estimating pose from them to generate pose hypotheses and removing matching templates that are too close to each other using a greedy strategy;

step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by learnable 2D-3D correspondences; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, in an inference stage, the derivative of the cost function is regularized, and finally, an optimal posture solution is obtained through a PnP optimization solver.

Compared with the prior art, the invention has the following beneficial effects:

1. the semantic segmentation network is used for extracting the features of the target object, so that the attitude estimation problem can be converted into a classification problem, and the attitude estimation task is simpler.

2. Calculating pixel-level correlations between features and descriptor features significantly improves the performance of the network, making the network aware of object three-dimensional feature information at the pixel level.

3. The dense 2D-2D matching network is established, robustness is provided for the problem of large angle difference between the target area and the posture in the template, and the benefit of template matching is improved.

4. The matching templates with larger screening errors promote the convergence of the network, and the differentiable probability layer is used without additionally performing display depth estimation or pose refinement, so that the network architecture is more simplified, and the effect of object pose estimation in an unrestricted scene is effectively improved.

Drawings

Fig. 1 is a flow chart of the implementation of the preferred embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

An object pose estimation method based on template matching and probability distribution is disclosed, referring to fig. 1, and includes the following steps:

step S1: performing semantic segmentation on the pose estimation training set;

and step S3: learning a dense 2D-2D correspondence between the input image pixels and the matching template by using a deep learning network, and further generating 2D-3D correspondence between the image pixels and the 3D model;

The step S1 specifically includes the steps of:

step S12: segmentation of a network S using pre-training post-operation semantics _FE Extracting features from an input image I; first, a computing networkN depth-dimensional features, the index of which is k, satisfy

Wherein the content of the first and second substances,

representing a set of non-negative integers, wherein N represents the depth dimension of the semantic segmentation network; thus, S _FE The network of the k-th depth is

Satisfy the requirement of

representing feature extractor, height and width pairs mapped along each depth feature

step S13: describing the textured 3D model using a set of viewpoint-based templates, generated by rendering the object, to compute a per-pixel correlation between a feature map of the object image and features from the object descriptor;

wherein the model descriptor o ^k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the kth depth dimension ^k And Y ^k Representing the camera position, Z, of the target coordinate system in the k-th depth dimension ^k Representing in-plane rotation in the kth depth dimension, D ^k Representing a feature map extracted from each template by the k-th depth dimension down-cut network,

a set of points representing a 3D model of an object;

The step S2 specifically includes the following steps:

step S21: initial viewpoint estimation by template matching, depending onStep S1 same semantic segmentation feature extraction network S _FE But only the features of the last layer are used, and a 1 × 1 convolutional layer is added, the dimension is reduced from H × W × D to H × W × D ', H and W respectively represent the height and width of an input image, D represents the number of channels input into the network, and D' represents the number of channels input into the network after D passes through the 1 × 1 convolutional layer;

And image features

sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f _h，w Representing the image feature, t, at pixel (h, w) _h，w Represents the template features under pixel (h, w);

step S22: training networks prioritize features with surface area rotational similarity of very close objects while penalizing features with poor rotational similarity and utilizing modified triple loss with headroom

To optimize the pose, the calculation formula is as follows:

finally, at the time of testing, the query template with the highest similarity to the target detected in the target image is selected as the match.

The specific method of the step S3 is as follows:

step S31: during training, randomly sampling object clipping I from data set _obj And its corresponding true attitude value G _obj E SE (3), wherein SE (3) represents a special euclidean group, i.e. a three-dimensional transformation of the object, including rotation and translation; then, a random template I is selected _tmp True value G of template posture _tmp E SE (3) such that G _obj And G _tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two region blocks;

using C to represent the 2D-3D correspondence of an object rendered in a given pose, the calculation formula is as follows:

inverse C using C ^-1 Recalculating the correspondence for the non-normalized target coordinates, corresponding to the actual 3D target coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the target model, the calculation formula being as follows:

d(p，p′)＝||C ^-1 (I _obj ) _p -C ^-1 (I _tmp ) _p′ || ₂

step S32: the 2D-2D corresponding to the large 3D space difference is rejected by adopting abnormal value perception; the segmentation loss is defined as the loss per pixel block, using the Dice loss

To process unbalanced class data, the calculation formula is as follows:

The step S4 specifically includes the following steps:

step S41: independently estimating 2D-2D correspondences from each of the first S matching templates and generating pose hypotheses from them, and removing matching templates that are too close to each other using a greedy strategy;

step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by a learnable 2D-3D correspondence; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, the derivative of the cost function is regularized in an inference stage, and finally an optimal posture solution is obtained through a PnP optimization solver.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. The object pose estimation method based on template matching and probability distribution is characterized by comprising the following steps of:

step S1: performing semantic segmentation on the pose estimation training set;

and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding with the probability distribution of the pose.

2. The method for estimating the pose of an object based on template matching and probability distribution according to claim 1, wherein the step S1 specifically comprises the following steps:

step S12: semantic segmentation network S using pre-training operations _FE Extracting features from an input image I; firstly, calculating the characteristics of N depth dimensions of the network, wherein the index of the characteristics is k and satisfies the requirement

Wherein the content of the first and second substances,

representing a set of non-negative integers, N representing a depth dimension of a semantic segmentation network; thus, S _FE The network of the k-th depth is

Satisfy the requirement of

wherein the model descriptor o ^k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the kth depth dimension ^k And Y ^k Representing the camera position, Z, of the target coordinate system in the k-th depth dimension ^k Representing in-plane rotation in the k-th depth dimension, D ^k Representing a feature map extracted from each template by the k-th depth dimension down-cut network,

a set of points representing a 3D model of an object;

Thus, each pixel of the target image feature map gets a list of all correlations of its feature vector with all feature vectors of the descriptors; the correlation is used for calculating the attention of pixels in the image, combining the characteristics of the original image into the characteristic tensor, and performing more accurate segmentation by using the obtained attention;

3. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the step S2 specifically comprises the following steps:

step S21: initial viewpoint estimation by template matching, semantic segmentation feature extraction network S that is the same as step S1 _FE But only uses the features of the last layer and adds a 1 x 1 convolutional layer to reduce the dimension from hxwxd to hxwxd', H and W representing the height and width of the input image respectively,d represents the number of channels input into the network, and D' represents the number of channels input into the network after D passes through the 1 multiplied by 1 convolutional layer;

And image features

sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f denotes the correlation per pixel _h，w Representing the image feature, t, at pixel (h, w) _h，w Represents the template features under pixel (h, w);

To optimize the attitude, the calculation formula is as follows:

wherein f is ^anchor Is a descriptor of a randomly selected target surface area, f ⁺ Is represented by ^anchor Gestures are very similarTemplate of (a), f ^- Is represented by ^anchor The gestures have templates of different gestures, u is set to f ^anchor And f ⁺ The angle between rotations of the target in (1);

4. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the specific method of the step S3 is as follows:

step S31: during training, randomly sampling object clipping I from data set _obj And its corresponding true attitude value G _obj E, SE (3), wherein SE (3) represents a special Euclidean group, namely three-dimensional transformation of an object, including rotation and translation; then, a random template I is selected _tmp True value G of template posture _tmp E SE (3) such that G _obj And G _tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two region blocks;

using inversion of C ^-1 Recalculating the correspondence for the non-normalized object coordinates, corresponding to the actual 3D object coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the object model, the calculation formula being as follows:

d(p，p′)＝||C ^-1 (I _obj ) _p -C ^-1 (I _tmp ) _p′ || ₂

where p and p' are the pixel coordinates in the image and template slice, respectively; establishing attitude truth value intensive 2D-2D correspondence by matching pixel pairs corresponding to the closest points in a 3D coordinate system of the model; for a point p ∈ I _obj The corresponding template point calculation formula is as follows:

step S32: the method comprises the following steps that (1) abnormal value perception rejection is adopted for 2D-2D corresponding to large 3D space difference; the segmentation loss is defined as the loss per pixel block, using the Dice loss

To process unbalanced class data, the calculation formula is as follows:

5. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the step S4 specifically comprises the following steps:

step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by learnable 2D-3D correspondences; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, the derivative of the cost function is regularized in an inference stage, and finally an optimal posture solution is obtained through a PnP optimization solver.