CN115761734A - Object pose estimation method based on template matching and probability distribution - Google Patents

Object pose estimation method based on template matching and probability distribution Download PDF

Info

Publication number
CN115761734A
CN115761734A CN202211343422.XA CN202211343422A CN115761734A CN 115761734 A CN115761734 A CN 115761734A CN 202211343422 A CN202211343422 A CN 202211343422A CN 115761734 A CN115761734 A CN 115761734A
Authority
CN
China
Prior art keywords
template
target
network
pixel
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211343422.XA
Other languages
Chinese (zh)
Inventor
柯逍
黄森敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202211343422.XA priority Critical patent/CN115761734A/en
Publication of CN115761734A publication Critical patent/CN115761734A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to an object pose estimation method based on template matching and probability distribution, which comprises the following steps of: step S1: performing semantic segmentation on the pose estimation training set; step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation; and step S3: learning a dense 2D-2D correspondence between the input image pixels and the matching template by using a deep learning network, and further generating 2D-3D correspondence between the image pixels and the 3D model; and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding with the probability distribution of the pose. The method can effectively estimate the pose of the target in the RGB image.

Description

Object pose estimation method based on template matching and probability distribution
Technical Field
The invention relates to the technical field of pattern recognition and computer vision, in particular to an object pose estimation method based on template matching and probability distribution.
Background
Rapid developments in the field of computer vision have made significant contributions to the many tasks that robot operations address, including the grasping and detection of objects. The object pose estimation refers to estimating a three-dimensional rotation matrix and a three-dimensional translation vector of an object with a camera coordinate as an origin, so that the accurate posture of the object can be obtained, the fine operation of the object is supported, the method is an important technology in the field of robot grabbing, and the method also continuously becomes a research hotspot in recent years. In the meta universe, in order to improve the immersive experience, it is essential to truly and accurately represent the spatial state of an object, including acquiring position and pose information of a target object. Because object pose estimation has wide application fields and great commercial value, new pose estimation technologies are continuously explored in both academic and industrial fields.
Although the object pose estimation technology has made great progress, the existing pose estimation method only using RGB images still faces many challenges in real environment, and due to the absence of depth information guidance, the algorithm is easily affected by factors such as camera shooting angle and illumination condition changes, and there are problems of image blurring, incomplete object due to camera view truncation, and mutual occlusion between objects, so that it becomes very difficult to extract features from images.
Disclosure of Invention
In view of this, the present invention provides an object pose estimation method based on template matching and probability distribution, which can effectively estimate the pose of a target in an RGB image.
In order to achieve the purpose, the invention adopts the following technical scheme: the object pose estimation method based on template matching and probability distribution comprises the following steps of:
step S1: performing semantic segmentation on the pose estimation training set;
step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation;
and step S3: a deep learning network is utilized to learn the dense 2D-2D correspondence between the input image pixels and the matching template, and further 2D-3D correspondence between the image pixels and the 3D model is generated;
and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding by using probability distribution of the pose.
In a preferred embodiment, the step S1 specifically includes the following steps:
step S11: acquiring a public pose estimation data set from a network to obtain an RGB image and a 3D model for model training;
step S12: segmentation of a network S using pre-training post-operation semantics FE Extracting features from an input image I; firstly, calculating the characteristics of N depth dimensions of the network, wherein the index of the characteristics is k and satisfies the requirement
Figure BDA0003916198530000021
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003916198530000022
representing a set of non-negative integers, N representing a depth dimension of a semantic segmentation network; thus, S FE The network of the kth depth is
Figure BDA0003916198530000023
Satisfy the requirement of
Figure BDA0003916198530000024
Wherein the content of the first and second substances,
Figure BDA0003916198530000025
representing a set of real numbers, H k And W k Representing the image width and height, D, of the kth depth input network k Representing the number of channels of the kth deep input network;
Figure BDA0003916198530000026
representative feature extractor, height and width pairs along each depth feature map
Figure BDA0003916198530000027
Performing space average expansion to generate length D k A single vector of (a);
step S13: describing the textured 3D model using a set of viewpoint-based templates generated by rendering the object to compute a per-pixel correlation between a feature map of the object image and features from the object descriptor;
first, an RGB image of an input target is converted into a 3D feature tensor f at a k-th depth k The calculation formula is as follows:
Figure BDA0003916198530000031
the target template sampled along the xyz axis is then converted to the corresponding dense model descriptor o k The calculation formula is as follows:
Figure BDA0003916198530000032
wherein the model descriptor o k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the k-th depth dimension k And Y k Representing the camera position, Z, of the target coordinate system in the kth depth dimension k Representing in-plane rotation in the k-th depth dimension, D k Representing the feature map extracted from each template by the kth depth dimension down-cut network,
Figure BDA0003916198530000033
a set of points representing a 3D model of an object;
finally, the characteristic diagram f k With the entire object descriptor o k Match to produce a correlation tensor
Figure BDA0003916198530000034
Thus, each pixel of the target image feature map gets a list of all correlations of its feature vector with all feature vectors of the descriptors; the correlation is used for calculating the attention of pixels in the image, combining the original image features into the feature tensor, and performing more accurate segmentation by using the obtained attention;
step S14: the decoder uses a similar UNet structure, firstly carries out upsampling on the feature map, and then carries out upsampling on the convolutional layer until the size of an initial image is reached; the decoder uses a stacking operation at each stage;
step S15: the network is trained to predict the probability that each pixel contains the visible portion of the target, and a target object binary segmentation mask is generated and output.
In a preferred embodiment, the step S2 specifically includes the following steps:
step S21: initial viewpoint estimation by template matching, semantic segmentation feature extraction network S that relies on the same step S1 FE But only uses the feature of the last layer and adds a 1 × 1 convolution layer to reduce the dimension from H × W × D to H × W × D ', H and W respectively representing the height and width of the input image, D representing the number of channels input to the network, and D' representing the number of channels input to the network after D passes through the 1 × 1 convolution layer;
pre-computing template features using the foreground of the query template and the target foreground in the target image represented using the segmentation mask predicted in step S1, respectively
Figure BDA0003916198530000041
And image features
Figure BDA0003916198530000042
Figure BDA0003916198530000043
The per-pixel correlation between f and t is then calculated, which is calculated as follows:
Figure BDA0003916198530000044
sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f denotes the correlation per pixel h,w Representing the image feature at pixel (h, w), t h,w Represents the template features under pixel (h, w);
step S22: training networks prioritize features with rotational similarity of surface regions of very close objects while penalizing features with poor rotational similarity and take advantage of modified triple loss with headroom
Figure BDA0003916198530000045
To optimize the attitude, the calculation formula is as follows:
Figure BDA0003916198530000046
wherein f is anchor Is a descriptor of a randomly selected target surface area, f + Is represented by anchor Templates with very similar poses f - Is represented by anchor The gestures have templates of different gestures, u is set to f anchor And f + The angle between rotations of the target in (1);
finally, during testing, the query template with the highest similarity to the detected target in the target image is selected as the match.
In a preferred embodiment, the specific method of step S3 is:
step S31: during training, randomly sampling object clipping I from data set obj And its corresponding true attitude value G obj E SE (3), wherein SE (3) represents a special euclidean group, i.e. a three-dimensional transformation of the object, including rotation and translation; then, a random template I is selected tmp True value G of template posture tmp E SE (3) such that G obj And G tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two area blocks;
c is used to represent the 2D-3D correspondence of an object presented in a given pose, whose formula is:
Figure BDA0003916198530000051
using inversion of C -1 Recalculating the correspondence for the non-normalized target coordinates, corresponding to the actual 3D target coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the target model, the calculation formula being as follows:
d(p,p′)=||C -1 (I obj ) p -C -1 (I tmp ) p′ || 2
where p and p' are the pixel coordinates in the image and template slice, respectively; establishing a posture truth value intensive 2D-2D correspondence by matching the pixel pair corresponding to the closest point in the 3D coordinate system of the model; for point p ∈ I obj The corresponding calculation formula of the template points is as follows:
Figure BDA0003916198530000052
argmin (·) represents an argmin function, which is used to find the value of the time variable that minimizes the value of the objective function d (p, p');
step S32: the method comprises the following steps that (1) abnormal value perception rejection is adopted for 2D-2D corresponding to large 3D space difference; segmentation penalty is defined as a per-pixel block penalty, using Dice penalty
Figure BDA0003916198530000061
To process unbalanced class data, the calculation formula is as follows:
Figure BDA0003916198530000062
a represents a true value of segmentation, B represents a binary mask predicted by a segmentation network, and | A | and | B | represent the pixel numbers of the true value of segmentation and the prediction mask respectively;
finally, the network predicts the discrete 2D coordinates using standard per-pixel cross-entropy classification penalties.
In a preferred embodiment, the step S4 specifically includes the following steps:
step S41: independently estimating 2D-2D correspondences from each of the first S matching templates and estimating pose from them to generate pose hypotheses and removing matching templates that are too close to each other using a greedy strategy;
step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by learnable 2D-3D correspondences; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, in an inference stage, the derivative of the cost function is regularized, and finally, an optimal posture solution is obtained through a PnP optimization solver.
Compared with the prior art, the invention has the following beneficial effects:
1. the semantic segmentation network is used for extracting the features of the target object, so that the attitude estimation problem can be converted into a classification problem, and the attitude estimation task is simpler.
2. Calculating pixel-level correlations between features and descriptor features significantly improves the performance of the network, making the network aware of object three-dimensional feature information at the pixel level.
3. The dense 2D-2D matching network is established, robustness is provided for the problem of large angle difference between the target area and the posture in the template, and the benefit of template matching is improved.
4. The matching templates with larger screening errors promote the convergence of the network, and the differentiable probability layer is used without additionally performing display depth estimation or pose refinement, so that the network architecture is more simplified, and the effect of object pose estimation in an unrestricted scene is effectively improved.
Drawings
Fig. 1 is a flow chart of the implementation of the preferred embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application; as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
An object pose estimation method based on template matching and probability distribution is disclosed, referring to fig. 1, and includes the following steps:
step S1: performing semantic segmentation on the pose estimation training set;
step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation;
and step S3: learning a dense 2D-2D correspondence between the input image pixels and the matching template by using a deep learning network, and further generating 2D-3D correspondence between the image pixels and the 3D model;
and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding by using probability distribution of the pose.
The step S1 specifically includes the steps of:
step S11: acquiring a public pose estimation data set from a network to obtain an RGB image and a 3D model for model training;
step S12: segmentation of a network S using pre-training post-operation semantics FE Extracting features from an input image I; first, a computing networkN depth-dimensional features, the index of which is k, satisfy
Figure BDA0003916198530000081
Wherein the content of the first and second substances,
Figure BDA0003916198530000088
representing a set of non-negative integers, wherein N represents the depth dimension of the semantic segmentation network; thus, S FE The network of the k-th depth is
Figure BDA0003916198530000082
Satisfy the requirement of
Figure BDA0003916198530000083
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003916198530000084
representing a set of real numbers, H k And W k Representing the image width and height, D, of the kth depth input network k Representing the number of channels of the kth deep input network;
Figure BDA0003916198530000085
representing feature extractor, height and width pairs mapped along each depth feature
Figure BDA0003916198530000086
Performing space average expansion to generate length D k A single vector of (a);
step S13: describing the textured 3D model using a set of viewpoint-based templates, generated by rendering the object, to compute a per-pixel correlation between a feature map of the object image and features from the object descriptor;
first, an RGB image of an input target is converted into a 3D feature tensor f at a k-th depth k The calculation formula is as follows:
Figure BDA0003916198530000087
the target template sampled along the xyz axis is then converted to the corresponding dense model descriptor o k The calculation formula is as follows:
Figure BDA0003916198530000091
wherein the model descriptor o k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the kth depth dimension k And Y k Representing the camera position, Z, of the target coordinate system in the k-th depth dimension k Representing in-plane rotation in the kth depth dimension, D k Representing a feature map extracted from each template by the k-th depth dimension down-cut network,
Figure BDA0003916198530000092
a set of points representing a 3D model of an object;
finally, the characteristic diagram f k With the entire object descriptor o k Match to produce a correlation tensor
Figure BDA0003916198530000093
Thus, each pixel of the target image feature map gets a list of all correlations of its feature vector with all feature vectors of the descriptors; the correlation is used for calculating the attention of pixels in the image, combining the original image features into the feature tensor, and performing more accurate segmentation by using the obtained attention;
step S14: the decoder uses a similar UNet structure, firstly carries out upsampling on the feature map, and then carries out upsampling on the convolutional layer until the size of an initial image is reached; the decoder uses a stacking operation at each stage;
step S15: the network is trained to predict the probability that each pixel contains the visible portion of the target, and a target object binary segmentation mask is generated and output.
The step S2 specifically includes the following steps:
step S21: initial viewpoint estimation by template matching, depending onStep S1 same semantic segmentation feature extraction network S FE But only the features of the last layer are used, and a 1 × 1 convolutional layer is added, the dimension is reduced from H × W × D to H × W × D ', H and W respectively represent the height and width of an input image, D represents the number of channels input into the network, and D' represents the number of channels input into the network after D passes through the 1 × 1 convolutional layer;
pre-computing template features using the foreground of the query template and the target foreground in the target image represented using the segmentation mask predicted in step S1, respectively
Figure BDA0003916198530000101
And image features
Figure BDA0003916198530000102
Figure BDA0003916198530000103
The per-pixel correlation between f and t is then calculated, which is calculated as follows:
Figure BDA0003916198530000104
sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f h,w Representing the image feature, t, at pixel (h, w) h,w Represents the template features under pixel (h, w);
step S22: training networks prioritize features with surface area rotational similarity of very close objects while penalizing features with poor rotational similarity and utilizing modified triple loss with headroom
Figure BDA0003916198530000105
To optimize the pose, the calculation formula is as follows:
Figure BDA0003916198530000106
wherein f is anchor Is a descriptor of a randomly selected target surface area, f + Is represented by anchor Templates with very similar poses f - Is represented by anchor The gestures have templates of different gestures, u is set to f anchor And f + The angle between rotations of the target in (1);
finally, at the time of testing, the query template with the highest similarity to the target detected in the target image is selected as the match.
The specific method of the step S3 is as follows:
step S31: during training, randomly sampling object clipping I from data set obj And its corresponding true attitude value G obj E SE (3), wherein SE (3) represents a special euclidean group, i.e. a three-dimensional transformation of the object, including rotation and translation; then, a random template I is selected tmp True value G of template posture tmp E SE (3) such that G obj And G tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two region blocks;
using C to represent the 2D-3D correspondence of an object rendered in a given pose, the calculation formula is as follows:
Figure BDA0003916198530000111
inverse C using C -1 Recalculating the correspondence for the non-normalized target coordinates, corresponding to the actual 3D target coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the target model, the calculation formula being as follows:
d(p,p′)=||C -1 (I obj ) p -C -1 (I tmp ) p′ || 2
where p and p' are the pixel coordinates in the image and template slice, respectively; establishing a posture truth value intensive 2D-2D correspondence by matching the pixel pair corresponding to the closest point in the 3D coordinate system of the model; for point p ∈ I obj The corresponding calculation formula of the template points is as follows:
Figure BDA0003916198530000112
argmin (·) represents an argmin function, which is used to find the value of the time variable that minimizes the value of the objective function d (p, p');
step S32: the 2D-2D corresponding to the large 3D space difference is rejected by adopting abnormal value perception; the segmentation loss is defined as the loss per pixel block, using the Dice loss
Figure BDA0003916198530000113
To process unbalanced class data, the calculation formula is as follows:
Figure BDA0003916198530000114
a represents a true value of segmentation, B represents a binary mask predicted by a segmentation network, and | A | and | B | represent the pixel numbers of the true value of segmentation and the prediction mask respectively;
finally, the network predicts the discrete 2D coordinates using standard per-pixel cross-entropy classification penalties.
The step S4 specifically includes the following steps:
step S41: independently estimating 2D-2D correspondences from each of the first S matching templates and generating pose hypotheses from them, and removing matching templates that are too close to each other using a greedy strategy;
step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by a learnable 2D-3D correspondence; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, the derivative of the cost function is regularized in an inference stage, and finally an optimal posture solution is obtained through a PnP optimization solver.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (5)

1. The object pose estimation method based on template matching and probability distribution is characterized by comprising the following steps of:
step S1: performing semantic segmentation on the pose estimation training set;
step S2: matching the target detected by semantic segmentation in the step S1 with a rendering template to generate initial viewpoint estimation;
and step S3: a deep learning network is utilized to learn the dense 2D-2D correspondence between the input image pixels and the matching template, and further 2D-3D correspondence between the image pixels and the 3D model is generated;
and step S4: and generating six-degree-of-freedom information of the target object by using the differentiable pnp layer, and generating a final pose solution by guiding with the probability distribution of the pose.
2. The method for estimating the pose of an object based on template matching and probability distribution according to claim 1, wherein the step S1 specifically comprises the following steps:
step S11: acquiring a public pose estimation data set from a network to obtain an RGB image and a 3D model for model training;
step S12: semantic segmentation network S using pre-training operations FE Extracting features from an input image I; firstly, calculating the characteristics of N depth dimensions of the network, wherein the index of the characteristics is k and satisfies the requirement
Figure FDA0003916198520000011
Wherein the content of the first and second substances,
Figure FDA0003916198520000017
representing a set of non-negative integers, N representing a depth dimension of a semantic segmentation network; thus, S FE The network of the k-th depth is
Figure FDA0003916198520000012
Satisfy the requirement of
Figure FDA0003916198520000013
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003916198520000014
representing a set of real numbers, H k And W k Representing the image width and height, D, of the kth depth input network k Representing the number of channels of the kth deep input network;
Figure FDA0003916198520000015
representative feature extractor, height and width pairs along each depth feature map
Figure FDA0003916198520000016
Performing space average expansion to generate length D k A single vector of (a);
step S13: describing the textured 3D model using a set of viewpoint-based templates, generated by rendering the object, to compute a per-pixel correlation between a feature map of the object image and features from the object descriptor;
first, an RGB image of an input target is converted into a 3D feature tensor f at a k-th depth k The calculation formula is as follows:
Figure FDA0003916198520000021
the target template sampled along the xyz axis is then converted to the corresponding dense model descriptor o k The calculation formula is as follows:
Figure FDA0003916198520000022
wherein the model descriptor o k Representing all templates, X, rendered from a virtual viewpoint on a sphere around an object in the kth depth dimension k And Y k Representing the camera position, Z, of the target coordinate system in the k-th depth dimension k Representing in-plane rotation in the k-th depth dimension, D k Representing a feature map extracted from each template by the k-th depth dimension down-cut network,
Figure FDA0003916198520000023
a set of points representing a 3D model of an object;
finally, the characteristic diagram f k With the entire object descriptor o k Match to produce a correlation tensor
Figure FDA0003916198520000024
Thus, each pixel of the target image feature map gets a list of all correlations of its feature vector with all feature vectors of the descriptors; the correlation is used for calculating the attention of pixels in the image, combining the characteristics of the original image into the characteristic tensor, and performing more accurate segmentation by using the obtained attention;
step S14: the decoder uses a similar UNet structure, firstly carries out upsampling on the feature map, and then carries out upsampling on the convolutional layer until the size of an initial image is reached; the decoder uses a stacking operation at each stage;
step S15: the network is trained to predict the probability that each pixel contains the visible portion of the target, and a target object binary segmentation mask is generated and output.
3. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the step S2 specifically comprises the following steps:
step S21: initial viewpoint estimation by template matching, semantic segmentation feature extraction network S that is the same as step S1 FE But only uses the features of the last layer and adds a 1 x 1 convolutional layer to reduce the dimension from hxwxd to hxwxd', H and W representing the height and width of the input image respectively,d represents the number of channels input into the network, and D' represents the number of channels input into the network after D passes through the 1 multiplied by 1 convolutional layer;
pre-computing template features using the foreground of the query template and the target foreground in the target image represented using the segmentation mask predicted in step S1, respectively
Figure FDA0003916198520000031
And image features
Figure FDA0003916198520000032
Figure FDA0003916198520000033
The per-pixel correlation between f and t is then calculated, which is calculated as follows:
Figure FDA0003916198520000034
sim (f, t) denotes the correlation per pixel between f and t, corr denotes the Pearson correlation, h and w denote the pixel coordinates, f denotes the correlation per pixel h,w Representing the image feature, t, at pixel (h, w) h,w Represents the template features under pixel (h, w);
step S22: training networks prioritize features with surface area rotational similarity of very close objects while penalizing features with poor rotational similarity and utilizing modified triple loss with headroom
Figure FDA0003916198520000035
To optimize the attitude, the calculation formula is as follows:
Figure FDA0003916198520000036
wherein f is anchor Is a descriptor of a randomly selected target surface area, f + Is represented by anchor Gestures are very similarTemplate of (a), f - Is represented by anchor The gestures have templates of different gestures, u is set to f anchor And f + The angle between rotations of the target in (1);
finally, at the time of testing, the query template with the highest similarity to the target detected in the target image is selected as the match.
4. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the specific method of the step S3 is as follows:
step S31: during training, randomly sampling object clipping I from data set obj And its corresponding true attitude value G obj E, SE (3), wherein SE (3) represents a special Euclidean group, namely three-dimensional transformation of an object, including rotation and translation; then, a random template I is selected tmp True value G of template posture tmp E SE (3) such that G obj And G tmp Relatively close; further calculating 2D-3D corresponding mapping of each pixel in the two region blocks;
using C to represent the 2D-3D correspondence of an object rendered in a given pose, the calculation formula is as follows:
Figure FDA0003916198520000041
using inversion of C -1 Recalculating the correspondence for the non-normalized object coordinates, corresponding to the actual 3D object coordinates, defining the distance of the 2D correspondence pair in the 3D coordinate space of the object model, the calculation formula being as follows:
d(p,p′)=||C -1 (I obj ) p -C -1 (I tmp ) p′ || 2
where p and p' are the pixel coordinates in the image and template slice, respectively; establishing attitude truth value intensive 2D-2D correspondence by matching pixel pairs corresponding to the closest points in a 3D coordinate system of the model; for a point p ∈ I obj The corresponding template point calculation formula is as follows:
Figure FDA0003916198520000042
argmin (·) represents an argmin function, which is used to find the value of the time variable that minimizes the value of the objective function d (p, p');
step S32: the method comprises the following steps that (1) abnormal value perception rejection is adopted for 2D-2D corresponding to large 3D space difference; the segmentation loss is defined as the loss per pixel block, using the Dice loss
Figure FDA0003916198520000052
To process unbalanced class data, the calculation formula is as follows:
Figure FDA0003916198520000051
a represents a true value of segmentation, B represents a binary mask predicted by a segmentation network, and | A | and | B | represent the pixel numbers of the true value of segmentation and the prediction mask respectively;
finally, the network predicts the discrete 2D coordinates using standard per-pixel cross-entropy classification penalties.
5. The method for estimating the object pose based on the template matching and the probability distribution according to claim 1, wherein the step S4 specifically comprises the following steps:
step S41: independently estimating 2D-2D correspondences from each of the first S matching templates and generating pose hypotheses from them, and removing matching templates that are too close to each other using a greedy strategy;
step S42: the nondifferential deterministic PnP operation is converted into a differentiable probability layer, and unprecedented flexibility is given to end-to-end 2D-3D corresponding learning; interpreting the output of PnP as a probability distribution parameterized by learnable 2D-3D correspondences; during training, kullback-Leibler divergence between prediction and target posture distribution is used as a loss function, high-quality posture distribution is obtained through effective Monte Carlo posture sampling, the derivative of the cost function is regularized in an inference stage, and finally an optimal posture solution is obtained through a PnP optimization solver.
CN202211343422.XA 2022-10-29 2022-10-29 Object pose estimation method based on template matching and probability distribution Pending CN115761734A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211343422.XA CN115761734A (en) 2022-10-29 2022-10-29 Object pose estimation method based on template matching and probability distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211343422.XA CN115761734A (en) 2022-10-29 2022-10-29 Object pose estimation method based on template matching and probability distribution

Publications (1)

Publication Number Publication Date
CN115761734A true CN115761734A (en) 2023-03-07

Family

ID=85354355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211343422.XA Pending CN115761734A (en) 2022-10-29 2022-10-29 Object pose estimation method based on template matching and probability distribution

Country Status (1)

Country Link
CN (1) CN115761734A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188959A (en) * 2023-03-14 2023-05-30 北京未来链技术有限公司 Electronic commerce shopping scene intelligent identification and storage system based on meta universe
CN116386074A (en) * 2023-06-07 2023-07-04 青岛雅筑景观设计有限公司 Intelligent processing and management system for garden engineering design data
CN116758380A (en) * 2023-08-15 2023-09-15 摩尔线程智能科技(北京)有限责任公司 Network training method and device for posture estimation
CN117495970A (en) * 2024-01-03 2024-02-02 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188959A (en) * 2023-03-14 2023-05-30 北京未来链技术有限公司 Electronic commerce shopping scene intelligent identification and storage system based on meta universe
CN116386074A (en) * 2023-06-07 2023-07-04 青岛雅筑景观设计有限公司 Intelligent processing and management system for garden engineering design data
CN116386074B (en) * 2023-06-07 2023-08-15 青岛雅筑景观设计有限公司 Intelligent processing and management system for garden engineering design data
CN116758380A (en) * 2023-08-15 2023-09-15 摩尔线程智能科技(北京)有限责任公司 Network training method and device for posture estimation
CN116758380B (en) * 2023-08-15 2023-11-10 摩尔线程智能科技(北京)有限责任公司 Network training method and device for posture estimation
CN117495970A (en) * 2024-01-03 2024-02-02 中国科学技术大学 Template multistage matching-based chemical instrument pose estimation method, equipment and medium

Similar Documents

Publication Publication Date Title
CN107330439B (en) Method for determining posture of object in image, client and server
US20220114750A1 (en) Map constructing method, positioning method and wireless communication terminal
CN115761734A (en) Object pose estimation method based on template matching and probability distribution
CN106780576B (en) RGBD data stream-oriented camera pose estimation method
CN108038420B (en) Human behavior recognition method based on depth video
CN110246181B (en) Anchor point-based attitude estimation model training method, attitude estimation method and system
CN111428575B (en) Tracking method for fuzzy target based on twin network
CN112528976B (en) Text detection model generation method and text detection method
CN111563502A (en) Image text recognition method and device, electronic equipment and computer storage medium
CN111028292B (en) Sub-pixel level image matching navigation positioning method
CN102360504A (en) Self-adaptation virtual and actual three-dimensional registration method based on multiple natural characteristics
CN111667005B (en) Human interactive system adopting RGBD visual sensing
CN110969089A (en) Lightweight face recognition system and recognition method under noise environment
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN113673354A (en) Human body key point detection method based on context information and combined embedding
CN116994022A (en) Object detection method, model training method, device, electronic equipment and medium
CN113436251B (en) Pose estimation system and method based on improved YOLO6D algorithm
CN112767478B (en) Appearance guidance-based six-degree-of-freedom pose estimation method
CN116703996A (en) Monocular three-dimensional target detection algorithm based on instance-level self-adaptive depth estimation
CN116580085A (en) Deep learning algorithm for 6D pose estimation based on attention mechanism
Cheng et al. An augmented reality image registration method based on improved ORB
CN114723973A (en) Image feature matching method and device for large-scale change robustness
CN113628349B (en) AR navigation method, device and readable storage medium based on scene content adaptation
JP2018010359A (en) Information processor, information processing method, and program
CN113570535A (en) Visual positioning method and related device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination