CN115331301A - 6D attitude estimation method based on Transformer - Google Patents

6D attitude estimation method based on Transformer Download PDF

Info

Publication number
CN115331301A
CN115331301A CN202210759936.7A CN202210759936A CN115331301A CN 115331301 A CN115331301 A CN 115331301A CN 202210759936 A CN202210759936 A CN 202210759936A CN 115331301 A CN115331301 A CN 115331301A
Authority
CN
China
Prior art keywords
key point
dimensional
image
key
key points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210759936.7A
Other languages
Chinese (zh)
Inventor
赵国英
姜媛
赵万青
张少博
彭先霖
李斌
汪霖
王珺
彭进业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202210759936.7A priority Critical patent/CN115331301A/en
Publication of CN115331301A publication Critical patent/CN115331301A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a 6D attitude estimation method based on a Transformer, which adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the Transformer, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from the corresponding relation between two-dimensional key points and three-dimensional key points of a target object by using a PnP algorithm; the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture; the key point structure modeling module uses self-attention to learn the structure relation and the context information of the extracted key point characteristics and predict the coordinates of the key points; the gesture reasoning module calculates the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points and the three-dimensional key points of the target object; the attitude estimation network with a pure transform structure is realized by using the superiority and the high efficiency of the transform structure, the accuracy of the 6D attitude estimation is improved by fully using the geometric invariance of the key point structure, and the accuracy exceeds that of the attitude estimation network based on cnn.

Description

6D attitude estimation method based on Transformer
Technical Field
The invention belongs to the field of image detection, and particularly relates to a Transformer-based 6D attitude estimation method for efficiently and accurately estimating three-dimensional attitude of an object in an RGB image.
Background
The 6D pose estimation of an object refers to detecting an object appearing in an image and estimating its 3D position and orientation. Object pose estimation of a single image has been a very important research content in the field of computer vision, and in addition, object 6D pose estimation is crucial to augmented reality, virtual reality, robot grabbing, and unmanned technologies.
At present, a 6D posture estimation algorithm based on a depth image has obtained good results, but the acquisition of the depth image depends on an RGB-D camera, but the acquisition of a photo by the RGB-D camera is affected by factors such as resolution, visual field and frame rate, and the like, and the algorithm is large in size and cannot be integrated in wearable equipment to perform real-time posture estimation on a moving object.
Therefore, schemes based on RGB images are widely studied, and the conventional 6D pose estimation methods are mainly classified into a feature point-based method and a template-based method, but these schemes still have many limitations, for example, the feature point-based method cannot process texture-free pose estimation of an object, and the template-based method cannot process pose estimation of an occluded object.
With the advent of deep learning, and in particular the emergence of Convolutional Neural Networks (CNNs), the accuracy and robustness of monocular 6D object pose estimation has been increasing, sometimes even exceeding methods that rely on depth data. Most of the existing attitude estimation algorithms adopt a convolutional neural network to directly or indirectly regress and predict the attitude of an object, but still face a lot of problems, and although some attitude estimation algorithms can directly regress the attitude of the object, a lot of parameter training is needed, and the accuracy of the attitude of the object is slightly low. Some two-stage pose estimation algorithms need to calculate the pose of the object by using key points of the object and camera parameters, and although the accuracy of the two-stage pose estimation algorithms is higher than that of a direct prediction method, the structural relationship between the key points of the object is not constructed.
Disclosure of Invention
In view of the above drawbacks and deficiencies of the prior art, an object of the present invention is to provide a Transformer-based 6D pose estimation method for efficiently and accurately estimating a three-dimensional pose of an object in an RGB image.
In order to realize the task, the invention adopts the following technical solution:
A6D attitude estimation method based on a Transformer is characterized in that the method adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the Transformer, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from the corresponding relation between two-dimensional key points and three-dimensional key points of a target object by using a PnP algorithm; wherein:
the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture. For an input target object image, the two-dimensional key point feature extraction module outputs a group of key point feature vectors which represent eight key point features extracted from the image;
the key point structure modeling module comprises a self-attention layer and a multi-layer perceptron layer. The self-attention layer learns the structural relationship and the context information of the predicted key point features, and the multilayer perceptron layer predicts the feature vectors after the relationship modeling as two-dimensional coordinate points on the image, wherein the two-dimensional coordinate points are also called two-dimensional key points;
and the posture reasoning module calculates the posture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputs the calculated posture as a rotation matrix and a translation matrix.
According to the invention, the method is specifically implemented according to the following steps:
step 1: acquiring a plurality of two-dimensional image groups containing targets to be recognized, wherein the two-dimensional image groups contain fifteen groups of target objects; the same target object comprises more than one thousand RGB images, and the difference is only that the image acquisition angles are different, so that a training image set is obtained;
step 2: and (3) data preprocessing, namely cutting the image in the original data set to ensure that the object completely appears in the image, wherein the size of the image after cutting is 256 × 256.
And step 3: inputting the training image set into a transform-based object two-dimensional key point feature extraction module; obtaining key point features in the image;
and 4, step 4: and inputting the obtained key point features into a key point structure modeling module. The two-dimensional key point structure modeling module utilizes the geometric relationship of key points of the same object under different visual angles and adopts a self-attribute structure to carry out interactive learning on the eight key point characteristics extracted by the same object, so as to extract two-dimensional key point characteristics with geometric consistency of a single object, and the characteristics after interaction are input into a multilayer perceptron layer to obtain two-dimensional key point coordinates;
and 5: and the posture reasoning module is used for reasoning the extracted two-dimensional key point coordinates and the three-dimensional key points of the object model through a pnp algorithm to obtain the final posture of the object.
Specifically, the construction of the transform-based key point feature extraction module comprises the following steps:
step 101: inputting a training image into a network;
step 102: dividing the target image into Patch for serialization;
step 103: adding position embedding for the image Patch;
step 104: predefining J learnable d-dimensional key point embedding vectors;
step 105: the embedded vectors of the image Patch and the d-dimensional key points are used as input and sent into a Transformer encoder structure
Step 106: and the Transformer encoder structure outputs the two-dimensional key point characteristics extracted after learning.
Specifically, the construction of the key point structure modeling module comprises the following steps:
step 201: a self-attention mechanism is applied to a key point structure modeling module to integrate the structure relationship among key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance among the key points so as to learn the structure relationship among the key points of the object; and sending the key point features into a self-attribute module, and calculating the similarity between the key point features to learn the association relationship between the key point features. For the input key point feature sequence, performing similarity calculation on the query and each key to obtain a weight;
step 202: normalizing the weights using a Softmax function;
step 203: weighting and summing the weight and the corresponding value to obtain the final key point characteristic with the incidence relation;
step 204: and inputting the interacted key point characteristics into a multilayer perceptron layer to obtain two-dimensional key point coordinates.
Further, the construction of the transform-based keypoint feature extraction module further comprises image serialization and two-dimensional keypoint feature extraction, and the specific construction method comprises the following steps:
(1) Serializing the input two-dimensional image: picture taking
Figure BDA0003720746520000041
Processed as a series of flat 2D image blocks
Figure BDA0003720746520000042
Where (H, W) is the resolution of the input image, C is the number of input channels, (P, P) is the resolution of each image block,
Figure BDA0003720746520000043
is the number of image blocks obtained, followed by a trainable linear projection
Figure BDA0003720746520000044
X is to be PATCH Mapping to D dimension, and calling the output of the projection as patch embedding;
(2) Adding position information to patch embedding, namely adding position codes P to sequence blocks, and obtaining a sequence Z 0 As a sequence of image features
Figure BDA0003720746520000045
Wherein,
Figure BDA0003720746520000046
is the number of image blocks obtained.
(3) Predefining J learnable d-dimensional keypoint embedding vectors keypoints J Embedding vectors keypoints into the J learnable d-dimensional keypoints before training begins J Random initialization is performed, and J represents the number of keypoints.
(4) The obtained image characteristic sequence Z 0 Embedding vectors keypoints into keypoints J As input, it is sent to the transform coder to learn the relation between image blocks and let the image feature sequence Z 0 Embedding vectors keypoints into keypoints J Performing a global interaction in a transform encoder; each transform encoder layer consists of a multi-head self-integration (MSA) block and an MLP block; layer Norm (LN) is applied before each block and residual concatenation is applied after each block; the MLP comprises a hidden layer with an intermediate GELU nonlinear activation function; finally, the transform encoder outputs the key after interacting with the imageAnd the key point characteristics are sent to a key point structure modeling module.
Specifically, the construction of the key point structure modeling module comprises a self-attention layer and a multilayer perceptron layer, and the specific construction method comprises the following steps:
(1) Sending the key point features extracted from the transform-based key point feature extraction module into a self-attribute module, and performing interactive learning on the relevance between the key point features; in the self-attention module, an attention mechanism is utilized to integrate the structural relationship among key points for an input feature sequence, a self-attention mechanism is actually added with some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance among the key points so as to learn the structural relationship among the key points of the object;
(2) In self-association, each keypoint feature generates 3 different vectors, which are Query vector (Q), key vector (K) and Value vector (V), respectively, which are embedded vector X multiplied by three different weight matrices W Q ,W K ,W V Obtaining; then, a score is calculated for each Key point vector by multiplying the Query vector (Q) and the Key vector (K), a Softmax activation function is applied to the score, the Softmax score determines the 'contribution' of each Key point to the coding of the current position, the Key points which are already at the position obtain the highest Softmax score, and then the result is multiplied by the Value vector (V) to obtain an output vector;
(3) Sending the output vector into a multilayer perceptron to carry out dense prediction, and regressing the coordinates of key points; the key point coordinate regression head is realized by eight MLPs with individual parameters; generating a thermodynamic diagram by using MLP, and converting the heat diagram into a probability distribution diagram by using a softmax function, thereby obtaining key point coordinates;
(4) The predicted keypoint coordinates are then used to calculate the distance from the actual keypoint coordinates, and the penalty is defined as follows:
Figure BDA0003720746520000061
in the formula, N is the number of key points,
Figure BDA0003720746520000062
is the predicted coordinates of the key points,
Figure BDA0003720746520000063
is the true coordinate of the keypoint;
smooth L1 the definition is as follows:
Figure BDA0003720746520000064
where | x | represents the absolute distance between the predicted coordinates of the keypoint and the true value of the keypoint.
Further, the specific implementation method for obtaining the final pose of the object by the pose inference module in step 5 by inferring the extracted two-dimensional key point coordinates and the three-dimensional key point of the object model through the pnp algorithm calculates a rotation matrix and a translation matrix of the target object in the RGB image, and is implemented according to the following steps:
step I: acquiring a two-dimensional image containing a target to be identified, and acquiring an image to be identified;
step II: the two-dimensional key point feature extraction module based on the Transformer of claim 2 or claim 4 is adopted to obtain two-dimensional key point features of the target to be recognized in the image to be recognized. The predicted two-dimensional keypoints are projections of three-dimensional keypoints of the predefined object;
step III: performing interactive learning on the key point features by adopting the key point structure modeling module of claim 3 or claim 5, extracting two-dimensional key point features of which the single object has geometric consistency, and inputting the interacted features into a multilayer perceptron layer to obtain a two-dimensional key point coordinate point set in an RGB image; the two-dimensional key point set comprises Q two-dimensional key points, and Q is a positive integer.
Step IV: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using only the correspondence between the 8 two-dimensional key points and the three-dimensional key points.
Compared with the prior art, the 6D attitude estimation method based on the Transformer has the following advantages:
1. the attitude estimation network with a pure transform structure is realized by using the superiority and the high efficiency of the transform structure, the mutual relation among features can be well learned, and the global characteristic relation is modeled. In addition, it has also been demonstrated that the 6D pose estimation algorithm can be implemented in a pure sequence to sequence manner.
2. The self-attention mechanism is introduced to model the relevance relation among the key point features to learn the internal geometric relation of the object key points, and the structural features are utilized to have invariance, so that the accuracy of posture prediction is improved. The accuracy exceeds cnn-based pose estimation networks.
Drawings
FIG. 1 is a general overview of the Transformer-based 6D pose estimation method of the present invention.
FIG. 2 is a schematic diagram of an extraction structure of a transform-based object two-dimensional key point feature extraction module;
FIG. 3 is a schematic diagram of a transform encoder architecture;
the present invention will be described in further detail with reference to the accompanying drawings and examples.
Detailed Description
In the research, the applicant has noticed that the Transformer can learn the relationship between the patches, and in addition, the applicant has noticed that, in Natural Language Processing (NLP), the main method is to pre-train the Transformer on a large universal corpus first, and then fine-tune the model, so as to achieve different downstream tasks. There are also increasing attempts in the field of computer vision using the Transformer model, vit contributing in the visual field aiming at modeling global context information, not just region-level relationships, the emergence of Vit having demonstrated that transformers can greatly improve the performance of intensive recognition tasks.
Therefore, on the basis of a generic model of a transform, an image sequence and a key point feature sequence are extracted and are used as input to be sent to a transform encoder structure for interaction, so that two-dimensional key point features of an RGB image are extracted, a self-attention layer and a multilayer perceptron are added, structural relation modeling is carried out on the extracted two-dimensional key point features, dense coordinate prediction is carried out, and a PnP algorithm is combined to realize the gesture recognition problem in the 6D gesture estimation field.
As shown in fig. 1, in the present embodiment, a method for estimating a 6D pose based on a transform is provided, where a pose estimation network including a transform-based object two-dimensional key point feature extraction module, a key point structure modeling module, and a pose inference module is used to calculate a 6D pose from a correspondence between two-dimensional key points and target object three-dimensional key points by using a PnP algorithm; wherein:
the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture. For an input target object image, the two-dimensional key point feature extraction module outputs a group of key point feature vectors which represent eight key point features extracted from the image;
the key point structure modeling module comprises a self-attention layer and a multi-layer perceptron layer. The self-attention layer learns the structural relationship and the context information of the predicted key point features, and the multilayer perceptron layer predicts the feature vectors after the relationship modeling as two-dimensional coordinate points on the image, wherein the two-dimensional coordinate points are also called two-dimensional key points;
the gesture reasoning module is used for calculating the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputting the gesture as a rotation matrix and a translation matrix;
the method for constructing the object two-dimensional key point feature extraction module based on the Transformer comprises the following steps:
step 101: inputting a training image into a network;
step 102: segmenting the target image into Patch for serialization;
step 103: adding position embedding for the image Patch;
step 104: predefining J learnable d-dimensional key point embedding vectors;
step 105: embedding vectors of the image Patch and the d-dimensional key points into a Transformer encoder structure as input
Step 106: and the Transformer encoder structure outputs the two-dimensional key point characteristics extracted after learning.
The construction of the key point structure modeling module comprises the following steps:
step 201: a self-attention mechanism is applied to a key point structure modeling module to integrate the structural relationship between key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance between the key points so as to learn the structural relationship of the object key points. And sending the key point features into a self-attribute module, and calculating the similarity between the key point features to learn the association relationship between the key point features. For the input key point feature sequence, similarity calculation is carried out on the query and each key to obtain weight, and common similarity calculation comprises the following steps: dot product, splicing, perceptron and the like.
Step 202: the weights are normalized using the Softmax function.
Step 203: and carrying out weighted summation on the weight and the corresponding value to obtain the final key point characteristic with the association relationship.
Step 204: and inputting the interacted key point characteristics into a multilayer perceptron layer to obtain two-dimensional key point coordinates.
The attitude reasoning module is constructed:
step 301: calculating a final 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by only using the correspondence between 8 two-dimensional key points and the three-dimensional key points, so that the posture of the object can be calculated by inputting the finally predicted two-dimensional key points and the three-dimensional key points of the corresponding object into the PnP algorithm.
Fig. 2 shows a key point feature extraction structure in a network training stage, and two-dimensional key point features of a predicted target object are output.
Step 401: the standard transform input is a one-dimensional token embedding, and to process a two-dimensional image, a picture is first processed
Figure BDA0003720746520000101
Processed as a series of flat 2D image blocks
Figure BDA0003720746520000102
Where (H, W) is the resolution of the input image, C is the number of input channels, (P, P) is the resolution of each image block,
Figure BDA0003720746520000103
is the number of image blocks obtained, followed by a trainable linear projection
Figure BDA0003720746520000104
X is to be PATCH Mapped to the D dimension, and the output of this projection is called patch embedding.
Step 402: adding the position code P to all the input patch embedding to retain the position information, and adding the resulting sequence Z 0 Used as input to the encoder:
Figure BDA0003720746520000105
wherein,
Figure BDA0003720746520000106
is the number of image blocks obtained.
Step 403: predefining J learnable d-dimensional keypoint embedding vectors keypoints J Before training, for the J learnable d dimension relationsKey point embedding vector keypoints J Random initialization is performed, and J represents the number of keypoints.
Step 404: the obtained image characteristic sequence Z 0 Embedding vectors keypoints into keypoints J As input, it is sent to the transform coder to learn the relation between image blocks and let the image feature sequence Z 0 Embedding vectors keypoints with keypoints into keypoints J Global interaction is performed in the transform encoder.
Step 405: FIG. 3 shows the structure of the Transformer encoders, each of which consists of one multi-head self-attack (MSA) block and one MLP block. Layer Norm (LN) is applied before each block and residual concatenation is applied after each block. The MLP contains a hidden layer with an intermediate GELU nonlinear activation function. The input of the multi-head attention layer is based on
Figure BDA0003720746520000111
Calculated triplets (query, key, value):
query=Z l-1 W Q
key=Z l-1 W K
value=Z l-1 W V
wherein,
Figure BDA0003720746520000112
is the learning parameter of three linear projection layers, d is the dimension of (Query, key, value).
Self-attention is expressed as:
Figure BDA0003720746520000113
the multi-head attention mechanism is characterized in that m independent SA operations are provided, a plurality of groups of (q, k, v) matrixes are provided, one group of (q, k, v) matrixes represent the operation of one attention mechanism, and the plurality of matrixes are spliced and multiplied by a parameter matrix W 0 The final output of the multi-attention layer can be obtained:
MSA(Z l-1 )=[SA 1 (Z l-1 );SA 2 (Z l-1 );…;SA m (Z l-1 )]W O
Figure BDA0003720746520000114
d is typically set to C/m, then the output of the MSA is transformed by the MLP block with skip concatenation, the input to the transform l-layer coding layer is represented as:
Figure BDA0003720746520000115
step 406: and finally, outputting the key point characteristics interacted with the image by the transform coder, and sending the key point characteristics to the key point structure modeling module.
The key point structure modeling module in fig. 1 specifically comprises the following steps:
step 501: and sending the key point features extracted from the transform-based key point feature extraction module into a self-attribute module for interactive learning of relevance among the key point features. Multiplying the input matrix I e R (d, N) with three different parameter matrices W q ,W k ,W v Obtaining three intermediate matrixes Q, K and V which are formed by forming an element to R (d, d), wherein the dimensions of the three intermediate matrixes Q, K and V are formed by forming an element to R (d, N), converting K and multiplying the K by Q to obtain an orientation matrix A which is formed by forming an element to R (N, N) in each position, and obtaining softmax operation on the orientation matrix
Figure BDA0003720746520000121
Finally, multiplying the vector by V to obtain an output vector O epsilon R (d, N).
Figure BDA0003720746520000122
Figure BDA0003720746520000123
Step 502: the key point coordinate regression head is realized by eight MLPs with independent parameters; generating a thermodynamic map using MLP and converting the heatmap to a probability distribution map P (u, v) using a softmax function, where (u, v) is the two-dimensional position in the thermodynamic map, so the coordinate x of the ith keypoint i ,y i Is represented as;
Figure BDA0003720746520000124
Figure BDA0003720746520000125
wherein P (u, v) is the probability distribution graph of the key points,
Figure BDA0003720746520000126
is a floating point number multiplication.
Step 503: the predicted keypoint coordinates are then calculated as the distance from the actual keypoint coordinates, and the penalty is defined as:
Figure BDA0003720746520000127
in the formula, N is the number of key points,
Figure BDA0003720746520000128
is the predicted coordinates of the key point(s),
Figure BDA0003720746520000129
is the true coordinates of the keypoints.
smooth L1 The definition is as follows:
Figure BDA00037207465200001210
the attitude inference module in fig. 1 specifically comprises the following steps:
step 601: inputting a test image into a network, and predicting two-dimensional key points of a target object on an RGB image by using the trained network, wherein the predicted two-dimensional key points are projections of three-dimensional key points of a predefined object;
step 602: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using only the correspondence between the 8 two-dimensional key points and the three-dimensional key points.
Specific experimental examples:
to demonstrate the effectiveness of the transform-based 6D pose estimation method presented in this example, the inventors trained and tested on a LINEMOD dataset, which contains fifteen target objects for which poses were calculated.
Inputting an RGB image used for training of a target object and corresponding real two-dimensional coordinates into a two-dimensional key point extraction network based on a transform for training, obtaining a corresponding training model, inputting a test image into the trained model, predicting two-dimensional key points of the RGB image, using a PnP algorithm to obtain a final predicted attitude matrix from the two-dimensional key points and the three-dimensional key points, and using an ADD evaluation index to evaluate the accuracy of the predicted attitude matrix, as shown in the following table 1, it is demonstrated that the transform-based 6D attitude estimation method of the embodiment can realize the calculation of the attitude of the target object.
TABLE 1
Figure BDA0003720746520000131
Figure BDA0003720746520000141

Claims (7)

1. A6D attitude estimation method based on a transform is characterized in that the method adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the transform, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from a corresponding relation between a two-dimensional key point and a target object three-dimensional key point by using a PnP algorithm; wherein:
the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture; for an input target object image, the two-dimensional key point feature extraction module outputs a group of key point feature vectors which represent eight key point features extracted from the image;
the key point structure modeling module comprises a self-attention layer and a multi-layer perceptron layer; the self-attention layer learns the structural relationship and the context information of the predicted key point features, and the multilayer perceptron layer predicts the feature vectors after the relationship modeling as two-dimensional coordinate points on the image, wherein the two-dimensional coordinate points are also called two-dimensional key points;
the gesture reasoning module calculates the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputs the gesture as a rotation matrix and a translation matrix.
2. The method according to claim 1, characterized in that it is performed in particular according to the following steps:
step 1: acquiring a plurality of two-dimensional image groups containing targets to be identified, wherein the two-dimensional image groups contain fifteen groups of target objects; the same target object comprises more than one thousand RGB images, and the difference is only that the image acquisition angles are different, so that a training image set is obtained;
and 2, step: data preprocessing, namely cutting the image in the original data set to ensure that an object completely appears in the image, wherein the size of the cut image is 256 × 256;
and 3, step 3: inputting the training image set into a transform-based object two-dimensional key point feature extraction module; obtaining key point features in the image;
and 4, step 4: inputting the obtained key point characteristics into a key point structure modeling module; the two-dimensional key point structure modeling module utilizes the geometric relationship of key points of the same object under different visual angles and adopts a self-attention structure to carry out interactive learning on the eight key point features extracted from the same object, so as to extract two-dimensional key point features with geometric consistency of a single object, and the features after interaction are input into a multi-layer perceptron layer to obtain two-dimensional key point coordinates;
and 5: and the posture reasoning module is used for reasoning the extracted two-dimensional key point coordinates and the three-dimensional key points of the object model through a pnp algorithm to obtain the final posture of the object.
3. The method of claim 1, wherein the construction of the transform-based two-dimensional keypoint feature extraction module comprises the following steps:
step 201: inputting a training image into a network;
step 202: dividing the target image into Patch for serialization;
step 203: adding position embedding to the image Patch;
step 204: predefining J learnable d-dimensional key point embedding vectors;
step 205: embedding the image Patch and the d-dimensional key point into a vector as input and sending the vector into a transform encoder structure;
step 206: and the Transformer encoder structure outputs the two-dimensional key point characteristics extracted after learning.
4. The method of claim 1, wherein said keypoint structural modeling module construction comprises the steps of:
step 301: a self-attention mechanism is applied to a key point structure modeling module to integrate the structure relationship between key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance between the key points so as to learn the structure relationship of the object key points; sending the key point features into a self-attribute module, calculating the similarity between the key point features, and learning the association relation between the key point features; for the input key point feature sequence, similarity calculation is carried out on the query and each key to obtain weight;
step 302: normalizing the weights using a Softmax function;
step 303: weighting and summing the weight and the corresponding value to obtain the final key point characteristic with the incidence relation;
step 304: and inputting the interacted key point characteristics into a multilayer perceptron layer to obtain two-dimensional key point coordinates.
5. The method of claim 3, wherein the construction of the transform-based keypoint feature extraction module further comprises image serialization and two-dimensional keypoint feature extraction, and the specific construction method comprises the following steps:
(1) Serializing the input two-dimensional image: picture taking
Figure FDA0003720746510000031
Processed as a series of flat 2D image blocks
Figure FDA0003720746510000032
Where (H, W) is the resolution of the input image, C is the number of input channels, (P, P) is the resolution of each image block,
Figure FDA0003720746510000033
is the number of image blocks obtained, followed by a trainable linear projection
Figure FDA0003720746510000034
X is to be PATCH Mapping to D dimension, and calling the output of the projection as patch embedding;
(2) Adding position information to patch embedding, namely adding position codes P to sequence blocks, and obtaining a sequence Z 0 As a sequence of image features:
Figure FDA0003720746510000035
wherein,
Figure FDA0003720746510000036
is the number of the obtained image blocks;
(3) Predefining J learnable d-dimensional keypoint embedding vectors keypoints J Embedding vectors keypoints into the J learnable d-dimensional keypoints before training begins J Carrying out random initialization, wherein J represents the number of key points;
(4) The obtained image characteristic sequence Z 0 Embedding vectors keypoints with keypoints into keypoints J As input, the image data is sent to a Transformer encoder to learn the relation between image blocks and let the image feature sequence Z 0 Embedding vectors keypoints into keypoints J Performing a global interaction in a transform encoder; each transform encoder layer consists of a multi-head self-integration (MSA) block and an MLP block; layer Norm (LN) is applied before each block and residual concatenation is applied after each block; the MLP comprises a hidden layer with an intermediate GELU nonlinear activation function; and finally, outputting the key point characteristics interacted with the image by the transform coder, and sending the key point characteristics to the key point structure modeling module.
6. The method of claim 1, wherein the specific construction method of the keypoint structural modeling module is as follows:
(1) Sending the key point features extracted from the transform-based key point feature extraction module into a self-attribute module, and performing interactive learning on the relevance between the key point features; in the self-attention module, an attention mechanism is utilized to integrate the structural relationship among key points for an input feature sequence, a self-attention mechanism is actually added with some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance among the key points so as to learn the structural relationship among the key points of the object;
(2) In self-association, each keypoint feature generates 3 different vectors, which are respectively a Query vector (Q), a Key vector (K) and a Value vector (V), which are obtained by multiplying an embedding vector X by three different weight matrices W Q ,W K ,W V Obtaining; then, a score is calculated for each Key point vector by multiplying the Query vector (Q) and the Key vector (K), a Softmax activation function is applied to the score, the Softmax score determines the contribution of each Key point to the current position of the code, the Key points which are already at the position obtain the highest Softmax score, and then the result is multiplied by the Value vector (V) to obtain an output vector;
(3) Sending the output vector into a multilayer perceptron to carry out dense prediction, and regressing the coordinates of key points; the key point coordinate regression head is realized by eight MLPs with independent parameters; generating a thermodynamic map using the MLP and converting the heat map into a probability distribution map using a softmax function, thereby obtaining key point coordinates;
(4) The predicted keypoint coordinates are then calculated as the distance from the actual keypoint coordinates, and the penalty is defined as:
Figure FDA0003720746510000051
in the formula, N is the number of key points,
Figure FDA0003720746510000052
is the predicted coordinates of the key points,
Figure FDA0003720746510000053
is the true coordinates of the keypoints;
smooth L1 the definition is as follows:
Figure FDA0003720746510000054
where | x | represents the absolute distance between the predicted coordinates of the keypoint and the true value of the keypoint.
7. The method as claimed in claim 2, wherein the gesture inference module in step 5 infers the extracted two-dimensional key point coordinates and the three-dimensional key points of the object model by pnp algorithm to obtain the final gesture of the object by calculating a rotation matrix and a translation matrix of the target object in the RGB image, and the method is performed according to the following steps:
step I: acquiring a two-dimensional image containing a target to be identified, and acquiring the image to be identified;
step II: acquiring two-dimensional key point characteristics of a target to be recognized in an image to be recognized by adopting a two-dimensional key point characteristic extraction module based on a Transformer; the predicted two-dimensional keypoints are projections of three-dimensional keypoints of the predefined object;
step III: performing interactive learning on the key point characteristics by adopting a key point structure modeling module, extracting two-dimensional key point characteristics of which the single object has geometric consistency, and inputting the interacted characteristics into a multilayer perceptron layer to obtain a two-dimensional key point coordinate point set in an RGB image; the two-dimensional key point set comprises Q two-dimensional key points, and Q is a positive integer;
step IV: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a maximum-distance point sampling algorithm FPS, and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using the correspondence between the 8 two-dimensional key points and the three-dimensional key points.
CN202210759936.7A 2022-06-29 2022-06-29 6D attitude estimation method based on Transformer Pending CN115331301A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210759936.7A CN115331301A (en) 2022-06-29 2022-06-29 6D attitude estimation method based on Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210759936.7A CN115331301A (en) 2022-06-29 2022-06-29 6D attitude estimation method based on Transformer

Publications (1)

Publication Number Publication Date
CN115331301A true CN115331301A (en) 2022-11-11

Family

ID=83918023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210759936.7A Pending CN115331301A (en) 2022-06-29 2022-06-29 6D attitude estimation method based on Transformer

Country Status (1)

Country Link
CN (1) CN115331301A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance
CN117237451B (en) * 2023-09-15 2024-04-02 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance

Similar Documents

Publication Publication Date Title
CN110135375B (en) Multi-person attitude estimation method based on global information integration
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN112766158A (en) Multi-task cascading type face shielding expression recognition method
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN114332639B (en) Satellite attitude vision measurement method of nonlinear residual error self-attention mechanism
CN111695523B (en) Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information
CN112348033B (en) Collaborative saliency target detection method
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
WO2024114321A1 (en) Image data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN114140831B (en) Human body posture estimation method and device, electronic equipment and storage medium
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
CN115331301A (en) 6D attitude estimation method based on Transformer
CN114913342A (en) Motion blurred image line segment detection method and system fusing event and image
CN117727022A (en) Three-dimensional point cloud target detection method based on transform sparse coding and decoding
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
Zhao et al. Adaptive Dual-Stream Sparse Transformer Network for Salient Object Detection in Optical Remote Sensing Images
CN114119999B (en) Iterative 6D pose estimation method and device based on deep learning
CN116703996A (en) Monocular three-dimensional target detection algorithm based on instance-level self-adaptive depth estimation
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning
CN114187360B (en) Head pose estimation method based on deep learning and quaternion
CN116740795B (en) Expression recognition method, model and model training method based on attention mechanism
CN118172387A (en) Attention mechanism-based lightweight multi-target tracking method
CN117953561A (en) Space-time zone three-flow micro-expression recognition method based on transducer and saliency map
CN118072395A (en) Dynamic gesture recognition method combining multi-mode inter-frame motion and shared attention weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination