CN115331301A

CN115331301A - 6D attitude estimation method based on Transformer

Info

Publication number: CN115331301A
Application number: CN202210759936.7A
Authority: CN
Inventors: 赵国英; 姜媛; 赵万青; 张少博; 彭先霖; 李斌; 汪霖; 王珺; 彭进业
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-11-11

Abstract

The invention relates to a 6D attitude estimation method based on a Transformer, which adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the Transformer, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from the corresponding relation between two-dimensional key points and three-dimensional key points of a target object by using a PnP algorithm; the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture; the key point structure modeling module uses self-attention to learn the structure relation and the context information of the extracted key point characteristics and predict the coordinates of the key points; the gesture reasoning module calculates the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points and the three-dimensional key points of the target object; the attitude estimation network with a pure transform structure is realized by using the superiority and the high efficiency of the transform structure, the accuracy of the 6D attitude estimation is improved by fully using the geometric invariance of the key point structure, and the accuracy exceeds that of the attitude estimation network based on cnn.

Description

6D attitude estimation method based on Transformer

Technical Field

The invention belongs to the field of image detection, and particularly relates to a Transformer-based 6D attitude estimation method for efficiently and accurately estimating three-dimensional attitude of an object in an RGB image.

Background

The 6D pose estimation of an object refers to detecting an object appearing in an image and estimating its 3D position and orientation. Object pose estimation of a single image has been a very important research content in the field of computer vision, and in addition, object 6D pose estimation is crucial to augmented reality, virtual reality, robot grabbing, and unmanned technologies.

At present, a 6D posture estimation algorithm based on a depth image has obtained good results, but the acquisition of the depth image depends on an RGB-D camera, but the acquisition of a photo by the RGB-D camera is affected by factors such as resolution, visual field and frame rate, and the like, and the algorithm is large in size and cannot be integrated in wearable equipment to perform real-time posture estimation on a moving object.

Therefore, schemes based on RGB images are widely studied, and the conventional 6D pose estimation methods are mainly classified into a feature point-based method and a template-based method, but these schemes still have many limitations, for example, the feature point-based method cannot process texture-free pose estimation of an object, and the template-based method cannot process pose estimation of an occluded object.

With the advent of deep learning, and in particular the emergence of Convolutional Neural Networks (CNNs), the accuracy and robustness of monocular 6D object pose estimation has been increasing, sometimes even exceeding methods that rely on depth data. Most of the existing attitude estimation algorithms adopt a convolutional neural network to directly or indirectly regress and predict the attitude of an object, but still face a lot of problems, and although some attitude estimation algorithms can directly regress the attitude of the object, a lot of parameter training is needed, and the accuracy of the attitude of the object is slightly low. Some two-stage pose estimation algorithms need to calculate the pose of the object by using key points of the object and camera parameters, and although the accuracy of the two-stage pose estimation algorithms is higher than that of a direct prediction method, the structural relationship between the key points of the object is not constructed.

Disclosure of Invention

In view of the above drawbacks and deficiencies of the prior art, an object of the present invention is to provide a Transformer-based 6D pose estimation method for efficiently and accurately estimating a three-dimensional pose of an object in an RGB image.

In order to realize the task, the invention adopts the following technical solution:

A6D attitude estimation method based on a Transformer is characterized in that the method adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the Transformer, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from the corresponding relation between two-dimensional key points and three-dimensional key points of a target object by using a PnP algorithm; wherein:

the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture. For an input target object image, the two-dimensional key point feature extraction module outputs a group of key point feature vectors which represent eight key point features extracted from the image;

the key point structure modeling module comprises a self-attention layer and a multi-layer perceptron layer. The self-attention layer learns the structural relationship and the context information of the predicted key point features, and the multilayer perceptron layer predicts the feature vectors after the relationship modeling as two-dimensional coordinate points on the image, wherein the two-dimensional coordinate points are also called two-dimensional key points;

and the posture reasoning module calculates the posture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputs the calculated posture as a rotation matrix and a translation matrix.

According to the invention, the method is specifically implemented according to the following steps:

step 1: acquiring a plurality of two-dimensional image groups containing targets to be recognized, wherein the two-dimensional image groups contain fifteen groups of target objects; the same target object comprises more than one thousand RGB images, and the difference is only that the image acquisition angles are different, so that a training image set is obtained;

step 2: and (3) data preprocessing, namely cutting the image in the original data set to ensure that the object completely appears in the image, wherein the size of the image after cutting is 256 × 256.

And step 3: inputting the training image set into a transform-based object two-dimensional key point feature extraction module; obtaining key point features in the image;

and 4, step 4: and inputting the obtained key point features into a key point structure modeling module. The two-dimensional key point structure modeling module utilizes the geometric relationship of key points of the same object under different visual angles and adopts a self-attribute structure to carry out interactive learning on the eight key point characteristics extracted by the same object, so as to extract two-dimensional key point characteristics with geometric consistency of a single object, and the characteristics after interaction are input into a multilayer perceptron layer to obtain two-dimensional key point coordinates;

and 5: and the posture reasoning module is used for reasoning the extracted two-dimensional key point coordinates and the three-dimensional key points of the object model through a pnp algorithm to obtain the final posture of the object.

Specifically, the construction of the transform-based key point feature extraction module comprises the following steps:

step 101: inputting a training image into a network;

step 102: dividing the target image into Patch for serialization;

step 103: adding position embedding for the image Patch;

step 104: predefining J learnable d-dimensional key point embedding vectors;

step 105: the embedded vectors of the image Patch and the d-dimensional key points are used as input and sent into a Transformer encoder structure

Step 106: and the Transformer encoder structure outputs the two-dimensional key point characteristics extracted after learning.

Specifically, the construction of the key point structure modeling module comprises the following steps:

step 201: a self-attention mechanism is applied to a key point structure modeling module to integrate the structure relationship among key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance among the key points so as to learn the structure relationship among the key points of the object; and sending the key point features into a self-attribute module, and calculating the similarity between the key point features to learn the association relationship between the key point features. For the input key point feature sequence, performing similarity calculation on the query and each key to obtain a weight;

step 202: normalizing the weights using a Softmax function;

step 203: weighting and summing the weight and the corresponding value to obtain the final key point characteristic with the incidence relation;

step 204: and inputting the interacted key point characteristics into a multilayer perceptron layer to obtain two-dimensional key point coordinates.

Further, the construction of the transform-based keypoint feature extraction module further comprises image serialization and two-dimensional keypoint feature extraction, and the specific construction method comprises the following steps:

(1) Serializing the input two-dimensional image: picture taking

Processed as a series of flat 2D image blocks

Where (H, W) is the resolution of the input image, C is the number of input channels, (P, P) is the resolution of each image block,

is the number of image blocks obtained, followed by a trainable linear projection

X is to be _PATCH Mapping to D dimension, and calling the output of the projection as patch embedding;

(2) Adding position information to patch embedding, namely adding position codes P to sequence blocks, and obtaining a sequence Z ₀ As a sequence of image features

Wherein,

is the number of image blocks obtained.

(3) Predefining J learnable d-dimensional keypoint embedding vectors keypoints _J Embedding vectors keypoints into the J learnable d-dimensional keypoints before training begins _J Random initialization is performed, and J represents the number of keypoints.

(4) The obtained image characteristic sequence Z ₀ Embedding vectors keypoints into keypoints _J As input, it is sent to the transform coder to learn the relation between image blocks and let the image feature sequence Z ₀ Embedding vectors keypoints into keypoints _J Performing a global interaction in a transform encoder; each transform encoder layer consists of a multi-head self-integration (MSA) block and an MLP block; layer Norm (LN) is applied before each block and residual concatenation is applied after each block; the MLP comprises a hidden layer with an intermediate GELU nonlinear activation function; finally, the transform encoder outputs the key after interacting with the imageAnd the key point characteristics are sent to a key point structure modeling module.

Specifically, the construction of the key point structure modeling module comprises a self-attention layer and a multilayer perceptron layer, and the specific construction method comprises the following steps:

(1) Sending the key point features extracted from the transform-based key point feature extraction module into a self-attribute module, and performing interactive learning on the relevance between the key point features; in the self-attention module, an attention mechanism is utilized to integrate the structural relationship among key points for an input feature sequence, a self-attention mechanism is actually added with some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance among the key points so as to learn the structural relationship among the key points of the object;

(2) In self-association, each keypoint feature generates 3 different vectors, which are Query vector (Q), key vector (K) and Value vector (V), respectively, which are embedded vector X multiplied by three different weight matrices W ^Q ，W ^K ，W ^V Obtaining; then, a score is calculated for each Key point vector by multiplying the Query vector (Q) and the Key vector (K), a Softmax activation function is applied to the score, the Softmax score determines the 'contribution' of each Key point to the coding of the current position, the Key points which are already at the position obtain the highest Softmax score, and then the result is multiplied by the Value vector (V) to obtain an output vector;

(3) Sending the output vector into a multilayer perceptron to carry out dense prediction, and regressing the coordinates of key points; the key point coordinate regression head is realized by eight MLPs with individual parameters; generating a thermodynamic diagram by using MLP, and converting the heat diagram into a probability distribution diagram by using a softmax function, thereby obtaining key point coordinates;

(4) The predicted keypoint coordinates are then used to calculate the distance from the actual keypoint coordinates, and the penalty is defined as follows:

in the formula, N is the number of key points,

is the predicted coordinates of the key points,

is the true coordinate of the keypoint;

smooth _L1 the definition is as follows:

where | x | represents the absolute distance between the predicted coordinates of the keypoint and the true value of the keypoint.

Further, the specific implementation method for obtaining the final pose of the object by the pose inference module in step 5 by inferring the extracted two-dimensional key point coordinates and the three-dimensional key point of the object model through the pnp algorithm calculates a rotation matrix and a translation matrix of the target object in the RGB image, and is implemented according to the following steps:

step I: acquiring a two-dimensional image containing a target to be identified, and acquiring an image to be identified;

step II: the two-dimensional key point feature extraction module based on the Transformer of claim 2 or claim 4 is adopted to obtain two-dimensional key point features of the target to be recognized in the image to be recognized. The predicted two-dimensional keypoints are projections of three-dimensional keypoints of the predefined object;

step III: performing interactive learning on the key point features by adopting the key point structure modeling module of claim 3 or claim 5, extracting two-dimensional key point features of which the single object has geometric consistency, and inputting the interacted features into a multilayer perceptron layer to obtain a two-dimensional key point coordinate point set in an RGB image; the two-dimensional key point set comprises Q two-dimensional key points, and Q is a positive integer.

Step IV: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using only the correspondence between the 8 two-dimensional key points and the three-dimensional key points.

Compared with the prior art, the 6D attitude estimation method based on the Transformer has the following advantages:

1. the attitude estimation network with a pure transform structure is realized by using the superiority and the high efficiency of the transform structure, the mutual relation among features can be well learned, and the global characteristic relation is modeled. In addition, it has also been demonstrated that the 6D pose estimation algorithm can be implemented in a pure sequence to sequence manner.

2. The self-attention mechanism is introduced to model the relevance relation among the key point features to learn the internal geometric relation of the object key points, and the structural features are utilized to have invariance, so that the accuracy of posture prediction is improved. The accuracy exceeds cnn-based pose estimation networks.

Drawings

FIG. 1 is a general overview of the Transformer-based 6D pose estimation method of the present invention.

FIG. 2 is a schematic diagram of an extraction structure of a transform-based object two-dimensional key point feature extraction module;

FIG. 3 is a schematic diagram of a transform encoder architecture;

the present invention will be described in further detail with reference to the accompanying drawings and examples.

Detailed Description

In the research, the applicant has noticed that the Transformer can learn the relationship between the patches, and in addition, the applicant has noticed that, in Natural Language Processing (NLP), the main method is to pre-train the Transformer on a large universal corpus first, and then fine-tune the model, so as to achieve different downstream tasks. There are also increasing attempts in the field of computer vision using the Transformer model, vit contributing in the visual field aiming at modeling global context information, not just region-level relationships, the emergence of Vit having demonstrated that transformers can greatly improve the performance of intensive recognition tasks.

Therefore, on the basis of a generic model of a transform, an image sequence and a key point feature sequence are extracted and are used as input to be sent to a transform encoder structure for interaction, so that two-dimensional key point features of an RGB image are extracted, a self-attention layer and a multilayer perceptron are added, structural relation modeling is carried out on the extracted two-dimensional key point features, dense coordinate prediction is carried out, and a PnP algorithm is combined to realize the gesture recognition problem in the 6D gesture estimation field.

As shown in fig. 1, in the present embodiment, a method for estimating a 6D pose based on a transform is provided, where a pose estimation network including a transform-based object two-dimensional key point feature extraction module, a key point structure modeling module, and a pose inference module is used to calculate a 6D pose from a correspondence between two-dimensional key points and target object three-dimensional key points by using a PnP algorithm; wherein:

the gesture reasoning module is used for calculating the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputting the gesture as a rotation matrix and a translation matrix;

the method for constructing the object two-dimensional key point feature extraction module based on the Transformer comprises the following steps:

step 101: inputting a training image into a network;

step 102: segmenting the target image into Patch for serialization;

step 103: adding position embedding for the image Patch;

step 104: predefining J learnable d-dimensional key point embedding vectors;

step 105: embedding vectors of the image Patch and the d-dimensional key points into a Transformer encoder structure as input

The construction of the key point structure modeling module comprises the following steps:

step 201: a self-attention mechanism is applied to a key point structure modeling module to integrate the structural relationship between key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance between the key points so as to learn the structural relationship of the object key points. And sending the key point features into a self-attribute module, and calculating the similarity between the key point features to learn the association relationship between the key point features. For the input key point feature sequence, similarity calculation is carried out on the query and each key to obtain weight, and common similarity calculation comprises the following steps: dot product, splicing, perceptron and the like.

Step 202: the weights are normalized using the Softmax function.

Step 203: and carrying out weighted summation on the weight and the corresponding value to obtain the final key point characteristic with the association relationship.

The attitude reasoning module is constructed:

step 301: calculating a final 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by only using the correspondence between 8 two-dimensional key points and the three-dimensional key points, so that the posture of the object can be calculated by inputting the finally predicted two-dimensional key points and the three-dimensional key points of the corresponding object into the PnP algorithm.

Fig. 2 shows a key point feature extraction structure in a network training stage, and two-dimensional key point features of a predicted target object are output.

Step 401: the standard transform input is a one-dimensional token embedding, and to process a two-dimensional image, a picture is first processed

Processed as a series of flat 2D image blocks

X is to be _PATCH Mapped to the D dimension, and the output of this projection is called patch embedding.

Step 402: adding the position code P to all the input patch embedding to retain the position information, and adding the resulting sequence Z ₀ Used as input to the encoder:

wherein,

is the number of image blocks obtained.

Step 403: predefining J learnable d-dimensional keypoint embedding vectors keypoints _J Before training, for the J learnable d dimension relationsKey point embedding vector keypoints _J Random initialization is performed, and J represents the number of keypoints.

Step 404: the obtained image characteristic sequence Z ₀ Embedding vectors keypoints into keypoints _J As input, it is sent to the transform coder to learn the relation between image blocks and let the image feature sequence Z ₀ Embedding vectors keypoints with keypoints into keypoints _J Global interaction is performed in the transform encoder.

Step 405: FIG. 3 shows the structure of the Transformer encoders, each of which consists of one multi-head self-attack (MSA) block and one MLP block. Layer Norm (LN) is applied before each block and residual concatenation is applied after each block. The MLP contains a hidden layer with an intermediate GELU nonlinear activation function. The input of the multi-head attention layer is based on

Calculated triplets (query, key, value):

query＝Z ^l-1 W _Q

key＝Z ^l-1 W _K

value＝Z ^l-1 W _V

wherein,

is the learning parameter of three linear projection layers, d is the dimension of (Query, key, value).

Self-attention is expressed as:

the multi-head attention mechanism is characterized in that m independent SA operations are provided, a plurality of groups of (q, k, v) matrixes are provided, one group of (q, k, v) matrixes represent the operation of one attention mechanism, and the plurality of matrixes are spliced and multiplied by a parameter matrix W ₀ The final output of the multi-attention layer can be obtained:

MSA(Z ^l-1 )＝[SA ₁ (Z ^l-1 )；SA ₂ (Z ^l-1 )；…；SA _m (Z ^l-1 )]W _O

d is typically set to C/m, then the output of the MSA is transformed by the MLP block with skip concatenation, the input to the transform l-layer coding layer is represented as:

step 406: and finally, outputting the key point characteristics interacted with the image by the transform coder, and sending the key point characteristics to the key point structure modeling module.

The key point structure modeling module in fig. 1 specifically comprises the following steps:

step 501: and sending the key point features extracted from the transform-based key point feature extraction module into a self-attribute module for interactive learning of relevance among the key point features. Multiplying the input matrix I e R (d, N) with three different parameter matrices W _q ，W _k ，W _v Obtaining three intermediate matrixes Q, K and V which are formed by forming an element to R (d, d), wherein the dimensions of the three intermediate matrixes Q, K and V are formed by forming an element to R (d, N), converting K and multiplying the K by Q to obtain an orientation matrix A which is formed by forming an element to R (N, N) in each position, and obtaining softmax operation on the orientation matrix

Finally, multiplying the vector by V to obtain an output vector O epsilon R (d, N).

Step 502: the key point coordinate regression head is realized by eight MLPs with independent parameters; generating a thermodynamic map using MLP and converting the heatmap to a probability distribution map P (u, v) using a softmax function, where (u, v) is the two-dimensional position in the thermodynamic map, so the coordinate x of the ith keypoint _i ，y _i Is represented as;

wherein P (u, v) is the probability distribution graph of the key points,

is a floating point number multiplication.

Step 503: the predicted keypoint coordinates are then calculated as the distance from the actual keypoint coordinates, and the penalty is defined as:

in the formula, N is the number of key points,

is the predicted coordinates of the key point(s),

is the true coordinates of the keypoints.

smooth _L1 The definition is as follows:

the attitude inference module in fig. 1 specifically comprises the following steps:

step 601: inputting a test image into a network, and predicting two-dimensional key points of a target object on an RGB image by using the trained network, wherein the predicted two-dimensional key points are projections of three-dimensional key points of a predefined object;

step 602: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a farthest point sampling algorithm (FPS), and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using only the correspondence between the 8 two-dimensional key points and the three-dimensional key points.

Specific experimental examples:

to demonstrate the effectiveness of the transform-based 6D pose estimation method presented in this example, the inventors trained and tested on a LINEMOD dataset, which contains fifteen target objects for which poses were calculated.

Inputting an RGB image used for training of a target object and corresponding real two-dimensional coordinates into a two-dimensional key point extraction network based on a transform for training, obtaining a corresponding training model, inputting a test image into the trained model, predicting two-dimensional key points of the RGB image, using a PnP algorithm to obtain a final predicted attitude matrix from the two-dimensional key points and the three-dimensional key points, and using an ADD evaluation index to evaluate the accuracy of the predicted attitude matrix, as shown in the following table 1, it is demonstrated that the transform-based 6D attitude estimation method of the embodiment can realize the calculation of the attitude of the target object.

TABLE 1

Claims

1. A6D attitude estimation method based on a transform is characterized in that the method adopts an attitude estimation network comprising an object two-dimensional key point feature extraction module based on the transform, a key point structure modeling module and an attitude inference module, and calculates a 6D attitude from a corresponding relation between a two-dimensional key point and a target object three-dimensional key point by using a PnP algorithm; wherein:

the object two-dimensional key point feature extraction module based on the Transformer is used for extracting two-dimensional key point features of a target object on an RGB picture; for an input target object image, the two-dimensional key point feature extraction module outputs a group of key point feature vectors which represent eight key point features extracted from the image;

the key point structure modeling module comprises a self-attention layer and a multi-layer perceptron layer; the self-attention layer learns the structural relationship and the context information of the predicted key point features, and the multilayer perceptron layer predicts the feature vectors after the relationship modeling as two-dimensional coordinate points on the image, wherein the two-dimensional coordinate points are also called two-dimensional key points;

the gesture reasoning module calculates the gesture of the target object by using a pnp algorithm on the predicted two-dimensional key points of the image and the three-dimensional key points of the target object, and outputs the gesture as a rotation matrix and a translation matrix.

2. The method according to claim 1, characterized in that it is performed in particular according to the following steps:

step 1: acquiring a plurality of two-dimensional image groups containing targets to be identified, wherein the two-dimensional image groups contain fifteen groups of target objects; the same target object comprises more than one thousand RGB images, and the difference is only that the image acquisition angles are different, so that a training image set is obtained;

and 2, step: data preprocessing, namely cutting the image in the original data set to ensure that an object completely appears in the image, wherein the size of the cut image is 256 × 256;

and 3, step 3: inputting the training image set into a transform-based object two-dimensional key point feature extraction module; obtaining key point features in the image;

and 4, step 4: inputting the obtained key point characteristics into a key point structure modeling module; the two-dimensional key point structure modeling module utilizes the geometric relationship of key points of the same object under different visual angles and adopts a self-attention structure to carry out interactive learning on the eight key point features extracted from the same object, so as to extract two-dimensional key point features with geometric consistency of a single object, and the features after interaction are input into a multi-layer perceptron layer to obtain two-dimensional key point coordinates;

3. The method of claim 1, wherein the construction of the transform-based two-dimensional keypoint feature extraction module comprises the following steps:

step 201: inputting a training image into a network;

step 202: dividing the target image into Patch for serialization;

step 203: adding position embedding to the image Patch;

step 204: predefining J learnable d-dimensional key point embedding vectors;

step 205: embedding the image Patch and the d-dimensional key point into a vector as input and sending the vector into a transform encoder structure;

step 206: and the Transformer encoder structure outputs the two-dimensional key point characteristics extracted after learning.

4. The method of claim 1, wherein said keypoint structural modeling module construction comprises the steps of:

step 301: a self-attention mechanism is applied to a key point structure modeling module to integrate the structure relationship between key points, the self-attention mechanism is actually to additionally add some learnable parameters, and a series of attention weights are obtained through the parameters in the reasoning process to simulate the strength of the relevance between the key points so as to learn the structure relationship of the object key points; sending the key point features into a self-attribute module, calculating the similarity between the key point features, and learning the association relation between the key point features; for the input key point feature sequence, similarity calculation is carried out on the query and each key to obtain weight;

step 302: normalizing the weights using a Softmax function;

step 303: weighting and summing the weight and the corresponding value to obtain the final key point characteristic with the incidence relation;

step 304: and inputting the interacted key point characteristics into a multilayer perceptron layer to obtain two-dimensional key point coordinates.

5. The method of claim 3, wherein the construction of the transform-based keypoint feature extraction module further comprises image serialization and two-dimensional keypoint feature extraction, and the specific construction method comprises the following steps:

(1) Serializing the input two-dimensional image: picture taking

Processed as a series of flat 2D image blocks

(2) Adding position information to patch embedding, namely adding position codes P to sequence blocks, and obtaining a sequence Z ₀ As a sequence of image features:

wherein,

is the number of the obtained image blocks;

(3) Predefining J learnable d-dimensional keypoint embedding vectors keypoints _J Embedding vectors keypoints into the J learnable d-dimensional keypoints before training begins _J Carrying out random initialization, wherein J represents the number of key points;

(4) The obtained image characteristic sequence Z ₀ Embedding vectors keypoints with keypoints into keypoints _J As input, the image data is sent to a Transformer encoder to learn the relation between image blocks and let the image feature sequence Z ₀ Embedding vectors keypoints into keypoints _J Performing a global interaction in a transform encoder; each transform encoder layer consists of a multi-head self-integration (MSA) block and an MLP block; layer Norm (LN) is applied before each block and residual concatenation is applied after each block; the MLP comprises a hidden layer with an intermediate GELU nonlinear activation function; and finally, outputting the key point characteristics interacted with the image by the transform coder, and sending the key point characteristics to the key point structure modeling module.

6. The method of claim 1, wherein the specific construction method of the keypoint structural modeling module is as follows:

(2) In self-association, each keypoint feature generates 3 different vectors, which are respectively a Query vector (Q), a Key vector (K) and a Value vector (V), which are obtained by multiplying an embedding vector X by three different weight matrices W ^Q ，W ^K ，W ^V Obtaining; then, a score is calculated for each Key point vector by multiplying the Query vector (Q) and the Key vector (K), a Softmax activation function is applied to the score, the Softmax score determines the contribution of each Key point to the current position of the code, the Key points which are already at the position obtain the highest Softmax score, and then the result is multiplied by the Value vector (V) to obtain an output vector;

(3) Sending the output vector into a multilayer perceptron to carry out dense prediction, and regressing the coordinates of key points; the key point coordinate regression head is realized by eight MLPs with independent parameters; generating a thermodynamic map using the MLP and converting the heat map into a probability distribution map using a softmax function, thereby obtaining key point coordinates;

(4) The predicted keypoint coordinates are then calculated as the distance from the actual keypoint coordinates, and the penalty is defined as:

in the formula, N is the number of key points,

is the predicted coordinates of the key points,

is the true coordinates of the keypoints;

smooth _L1 the definition is as follows:

7. The method as claimed in claim 2, wherein the gesture inference module in step 5 infers the extracted two-dimensional key point coordinates and the three-dimensional key points of the object model by pnp algorithm to obtain the final gesture of the object by calculating a rotation matrix and a translation matrix of the target object in the RGB image, and the method is performed according to the following steps:

step I: acquiring a two-dimensional image containing a target to be identified, and acquiring the image to be identified;

step II: acquiring two-dimensional key point characteristics of a target to be recognized in an image to be recognized by adopting a two-dimensional key point characteristic extraction module based on a Transformer; the predicted two-dimensional keypoints are projections of three-dimensional keypoints of the predefined object;

step III: performing interactive learning on the key point characteristics by adopting a key point structure modeling module, extracting two-dimensional key point characteristics of which the single object has geometric consistency, and inputting the interacted characteristics into a multilayer perceptron layer to obtain a two-dimensional key point coordinate point set in an RGB image; the two-dimensional key point set comprises Q two-dimensional key points, and Q is a positive integer;

step IV: calculating a 6D gesture from the corresponding relation between the two-dimensional key points and the three-dimensional key points of the target object by using a PnP algorithm; the three-dimensional key points are eight three-dimensional coordinate points acquired on the object model by using a maximum-distance point sampling algorithm FPS, and the PnP can estimate the three-dimensional rotation and the three-dimensional translation of the target object under the camera coordinates by using the correspondence between the 8 two-dimensional key points and the three-dimensional key points.