CN111462239B

CN111462239B - Attitude encoder training and attitude estimation method and device

Info

Publication number: CN111462239B
Application number: CN202010261228.1A
Authority: CN
Inventors: 季向阳; 李志刚
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2023-04-14
Anticipated expiration: 2040-04-03
Also published as: CN111462239A

Abstract

The present disclosure relates to a method and a device for training and estimating attitude of an attitude encoder, wherein the attitude encoder comprises an encoding network and a decoding network, and the method comprises the following steps: carrying out target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target; extracting the characteristics of the first image area through a coding network to obtain first characteristic information of a first target; processing the first characteristic information through a decoding network to determine a predicted characteristic map of the first target; and training the attitude encoder according to the predicted characteristic diagram and the labeled characteristic diagram of the first target. The embodiment of the disclosure trains the attitude encoder by using the labeled feature map comprising the attitude information of the target, so that the accuracy of extracting the encoding network features in the attitude encoder can be improved, and the accuracy of attitude estimation can be further improved.

Description

Attitude encoder training and attitude estimation method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for training an attitude encoder and estimating an attitude.

Background

The object posture estimation can estimate the posture information of an object relative to a camera from an object image, and plays an important role in the fields of robot operation, automatic driving, augmented reality and the like. Self-encoders have been developed in recent years as a direct method for estimating the attitude of an object.

The self-encoder comprises an encoder part and a decoder part. In the training process, the encoder is responsible for converting the input object image into the feature vector, and the decoder recovers the input object image from the feature vector as much as possible. The training is to reconstruct an input image of an object, and the pose information of the object does not directly participate in the training of the self-encoder. Due to the common symmetry of the object, the object may present similar appearances in different postures, and an auto-encoder aiming at reconstructing an object image encodes images of a plurality of objects with similar appearances in different postures into similar feature vectors, so that feature confusion occurs in the posture estimation process, and the accuracy of posture estimation is poor.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for training an attitude encoder and estimating an attitude.

According to an aspect of the present disclosure, there is provided a method for training a gesture encoder based on three-dimensional coordinates, the gesture encoder including an encoding network and a decoding network, the method including:

performing target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target, wherein the training set comprises a plurality of sample images and an annotation feature map of the first target in the sample images, and the annotation feature map is used for representing three-dimensional coordinates of a visible part of the first target in the sample images;

performing feature extraction on the first image area through the coding network to obtain first feature information of the first target;

processing the first characteristic information through the decoding network to determine a predicted characteristic map of the first target;

and training the attitude encoder according to the predicted characteristic diagram and the labeled characteristic diagram of the first target.

In one possible implementation, the predicted feature map includes three channels, which respectively represent three-dimensional coordinates of a visible portion of the first object in the first image region.

In one possible implementation, the method further includes:

processing the first characteristic information through the decoding network to determine a predicted image of the first target;

and training the attitude encoder according to the prediction characteristic diagram, the labeling characteristic diagram, the first image area and the prediction image of the first target.

In one possible implementation, training the pose encoder according to the predicted feature map, the annotated feature map, the first image region, and the predicted image of the first target includes:

determining a first loss according to the difference between the predicted feature map and the labeled feature map of the first target;

determining a second loss from a difference between a first image region of the first target and a predicted image;

determining the network loss of the attitude encoder according to a preset weight, the first loss and the second loss;

and adjusting the network parameters of the attitude encoder according to the network loss.

In a possible implementation manner, the performing feature extraction on the first image region through the coding network to obtain first feature information of the first target includes:

adjusting the first image area according to a preset image size to obtain an adjusted first image area;

and performing feature extraction on the adjusted first image area through the coding network to obtain first feature information of the first target.

According to another aspect of the present disclosure, there is provided an attitude estimation method based on an attitude encoder, the method including:

carrying out target detection on an image to be detected, and determining a second target in the image to be detected and a second image area of the second target;

extracting the features of the second image area through the encoding network of the attitude encoder to obtain second feature information of the second target;

respectively determining similarity between the second feature information and each piece of third feature information in a preset feature posture library, wherein the feature posture library comprises a plurality of pieces of third feature information and a target posture corresponding to each piece of third feature information, and the target posture comprises a rotation angle of a target;

determining a target posture corresponding to third feature information having the highest similarity to the second feature information as posture information of the second target,

wherein, the coding network is obtained by training according to the training method.

In one possible implementation, the method further includes:

respectively extracting the features of a reference image of a preset reference target under a plurality of rotation angles through the coding network to obtain a plurality of third feature information of the reference target;

for any third feature information, determining a rotation angle corresponding to the third feature information as a target posture corresponding to the third feature information;

and determining a characteristic attitude library according to the plurality of third characteristic information and the target attitude corresponding to the plurality of third characteristic information.

According to another aspect of the present disclosure, there is provided a three-dimensional coordinate-based pose encoder training device, the pose encoder including an encoding network and a decoding network, the device including:

the first target detection module is used for performing target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target, wherein the training set comprises a plurality of sample images and an annotated feature map of the first target in the sample images, and the annotated feature map is used for representing three-dimensional coordinates of a visible part of the first target in the sample images;

the first feature extraction module is used for performing feature extraction on the first image area through the coding network to obtain first feature information of the first target;

the feature map prediction module is used for processing the first feature information through the decoding network and determining a predicted feature map of the first target;

and the training module is used for training the attitude encoder according to the predicted characteristic diagram and the labeled characteristic diagram of the first target.

According to another aspect of the present disclosure, there is provided an attitude estimation apparatus based on an attitude encoder, the apparatus including:

the second target detection module is used for carrying out target detection on an image to be detected and determining a second target in the image to be detected and a second image area of the second target;

the second feature extraction module is used for performing feature extraction on the second image area through the coding network of the attitude encoder to obtain second feature information of the second target;

a similarity determining module, configured to determine a similarity between the second feature information and each piece of third feature information in a preset feature posture library, where the feature posture library includes a plurality of pieces of third feature information and a target posture corresponding to each piece of third feature information, and the target posture includes a rotation angle of a target;

a pose information determination module configured to determine pose information corresponding to third feature information having the highest similarity to the second feature information as pose information of the second target,

wherein, the coding network is obtained by training according to the training device.

According to the embodiment of the disclosure, in the training process of the gesture encoder, a labeled feature map representing a three-dimensional coordinate of a visible part of a first target in a sample image is used as a training label, feature extraction is performed on a first image region of the first target in the sample image through a coding network to obtain first feature information, the first feature information is processed through a decoding network to determine a predicted feature map of the first target, and the gesture encoder is trained according to the predicted feature map and the labeled feature map of the first target, so that the gesture information of the first target can participate in the training process of the gesture encoder, when the coding network in the trained gesture encoder is used for extracting features, the first target with a similar appearance and a different gesture can be accurately identified, the accuracy of the feature extraction of the coding network in the gesture encoder is improved, and the accuracy of gesture estimation by using the gesture encoder is also improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of a three-dimensional coordinate based pose encoder training method according to an embodiment of the present disclosure.

Fig. 2a shows a schematic diagram of an application scenario of a pose encoder based on image reconstruction.

Fig. 2b is a schematic diagram illustrating an application scenario of a three-dimensional coordinate-based gesture encoder training method according to an embodiment of the present disclosure.

FIG. 3 shows a flow diagram of a method for attitude estimation based on an attitude encoder according to an embodiment of the disclosure.

FIG. 4 shows a block diagram of a three-dimensional coordinate based gesture encoder training apparatus, according to an embodiment of the disclosure

Fig. 5 shows a block diagram of an attitude estimation device based on an attitude encoder according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The three-dimensional coordinate-based posture encoder training method according to the embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations, such as a GPU (Graphics Processing Unit), an NPU (Neural-Network Processing Unit), a DSP (Digital Signal Processing Unit), and the like. The present disclosure is not limited to a particular type of processor.

The attitude encoder in the embodiment of the disclosure comprises an encoding network and a decoding network, and can be used for estimating the attitude of an object. The encoding network can be used for extracting the characteristics of the input image, and the decoding network can be used for predicting the three-dimensional characteristic diagram of the visible part of the object in the input image according to the characteristic information extracted by the encoding network.

FIG. 1 shows a flow diagram of a three-dimensional coordinate based pose encoder training method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:

step S11, performing target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target, wherein the training set comprises a plurality of sample images and an labeled feature map of the first target in the sample images, and the labeled feature map is used for representing three-dimensional coordinates of a visible part of the first target in the sample images.

Step S12, extracting the features of the first image area through the coding network to obtain first feature information of the first target;

step S13, processing the first characteristic information through the decoding network, and determining a predicted characteristic diagram of the first target;

and S14, training the posture encoder according to the predicted characteristic diagram and the labeled characteristic diagram of the first target.

In one possible implementation, the training set may include a plurality of sample images and an annotated feature map of a first object in the plurality of sample images, where the annotated feature map may be used to represent three-dimensional coordinates of a visible portion of the first object in the sample images. The annotated feature map of the first object may represent pose information of the first object in the sample image.

In one possible implementation, the sample images may be images at different rotation angles of the first target. The first target may be any kind of specific object, such as a casting, a vehicle, a road sign, a street lamp, a toy, etc. Those skilled in the art can determine the specific first target and the sample image according to the need of pose estimation, which the present disclosure does not limit.

In one possible implementation, the annotated feature map of the first object may be represented as an annotated three-dimensional Location Field (3D Location Field) of the first object. Marking the three-dimensional position field can represent the three-dimensional coordinates of the visible part of the first target in the sample image through the three-channel characteristic diagrams, and the three-channel characteristic diagrams respectively correspond to three coordinate axes of X, Y and Z.

In a possible implementation manner, in step S11, target detection may be performed on sample images in the training set, and a first target and a first image area of the first target in the sample images are determined. The shape of the bounding box of the first image area may be a square or other preset shape, which is not limited by the present disclosure.

In one possible implementation, the sample image may be subject to target detection by a target detection network. The sample image may be input into a target detection network for processing, and the target detection network may detect a first target in the sample image, determine position information of the first target, and then determine a first image area of the first target according to the position information of the first target. The target detection Network may be set according to actual needs, for example, the target detection Network may be an RCNN (Region-based Convolutional Neural Network), fast RCNN (Fast RCNN), or the like. The present disclosure is not limited as to the particular type of object detection network.

In a possible implementation manner, after determining the first image region of the first target, in step S12, feature extraction may be performed on the first image region through a coding network to obtain first feature information of the first target. That is, the first image region of the first object may be input into the coding network for processing such as dimension reduction and feature extraction, so as to obtain the first feature information of the first object.

In a possible implementation manner, after obtaining the first feature information, in step S13, the decoding network may process the first feature information to determine a predicted feature map of the first target. The representation mode of the prediction characteristic diagram is consistent with that of the labeling characteristic diagram.

In one possible implementation, the predicted feature map may include three channels, and the three channels may respectively represent three-dimensional coordinates of a visible portion of the first object in the first image region. The predicted feature map may be represented as a predicted three-dimensional location field of the first object.

In one possible implementation manner, the pose encoder may be trained according to the predicted feature map and the labeled feature map of the first target in step S14. The network loss can be determined according to the difference between the predicted characteristic diagram and the marked characteristic diagram of the first target, and the network parameters of the attitude encoder are adjusted according to the network loss to realize the training of the attitude encoder.

In a possible implementation manner, when the gesture encoder meets a preset training end condition, the training is ended to obtain a trained gesture encoder. The preset training end condition may be set according to an actual situation, for example, the training end condition may be that the network loss of the gesture encoder is reduced to a certain degree or converges within a certain threshold, the training end condition may also be that the output of the gesture encoder on the verification set is expected, and the training end condition may also be another condition. The present disclosure does not limit the specific contents of the training end condition.

In one possible implementation, step S12 may include: adjusting the first image area according to a preset image size to obtain an adjusted first image area; and performing feature extraction on the adjusted first image area through the coding network to obtain first feature information of the first target.

The preset image size may be set according to processing requirements, for example, in units of pixels, and the preset image size may be 256 × 256, 512 × 512, or the like. The present disclosure does not limit the specific values of the image size.

In a possible implementation manner, after the first image region of the first target is obtained, the first image region may be adjusted (i.e., zoomed) according to a preset image size, so as to obtain the adjusted first image region. And then inputting the adjusted first image area into a coding network for feature extraction to obtain first feature information of the first target.

In this embodiment, after the size of the first image region is adjusted to the preset image size, the feature extraction is performed on the adjusted first image region through the coding network to obtain the first feature information of the first target, so that the first image region input to the coding network has a fixed size, and the processing efficiency of the coding network is improved.

In one possible implementation, the method may further include: processing the first characteristic information through the decoding network to determine a predicted image of the first target; and training the attitude encoder according to the prediction characteristic diagram, the labeling characteristic diagram, the first image area and the prediction image of the first target.

In one possible implementation, the first feature information is processed by an encoding network to determine a predicted image of the first target, the predicted image corresponding to a first image region of the first target. That is, the first feature information is processed by the encoding network, and the predicted feature map and the predicted image of the first target can be obtained.

After the predicted feature map and the predicted image of the first target are obtained, the attitude encoder can be trained according to the predicted feature map, the labeling feature map, the first image area and the predicted image of the first target. By the mode, the training labels can be increased, and the training efficiency of the posture encoder is improved.

In one possible implementation, training the pose encoder according to the predicted feature map, the annotated feature map, the first image region, and the predicted image of the first target may include: determining a first loss according to the difference between the predicted characteristic diagram and the labeled characteristic diagram of the first target; determining a second loss from a difference between a first image region of the first target and a predicted image; determining the network loss of the attitude encoder according to a preset weight, the first loss and the second loss; and adjusting the network parameters of the attitude encoder according to the network loss.

In one possible implementation, the first loss may be determined based on a difference between the predicted feature map and the annotated feature map of the first target. For example, the first loss may be determined based on a difference between three-dimensional coordinates of each pixel of the predicted three-dimensional location field of the first target and three-dimensional coordinates of each pixel of the annotated three-dimensional location field. The second loss may be determined based on a difference between a first image region of the first target and the predicted image.

In one possible implementation, the network loss of the gesture encoder may be determined according to a preset weight, the first loss, and the second loss. The default weight may include a weight of the first loss and a weight of the second loss. Network parameters of the pose encoder may then be adjusted based on the network loss. The specific value of the preset weight can be determined by those skilled in the art according to practical situations, and the disclosure is not limited thereto.

In one possible implementation, the network loss L of the gesture encoder can be determined by the following equation (1):

L＝∑ _i (α·‖N _i -N′ _i ‖ ₁ +β·CE(M _i ,M′ _i )) (1)

wherein N is _i Three-dimensional coordinates, N ', of ith pixel point in marked feature map N representing first target' _i Three-dimensional coordinates, M, of the ith pixel point in the predicted feature map N' representing the first target _i Coordinates, M ', of the ith pixel point in the first image area M representing the first target' _i Coordinates of an ith pixel point in the predicted image M' representing the first target, CE () representing cross entropy loss, and α, β representing preset weights.

Fig. 2a shows a schematic diagram of an application scenario of a pose encoder based on image reconstruction. As shown in fig. 2a, the first object 20 has symmetry, the first object 20 in the first image area 21 and the first object 20 in the first image area 24, although similar in appearance, have completely different poses, the first object 20 in the first image area 24 being horizontally rotated by 180 ° compared to the first object 20 in the first image area 21. Feature information 23 and feature information 25 of the first target 20 are obtained by extracting features of the first image region 21 and the first image region 24, respectively, using an encoding network in an attitude encoder based on image reconstruction. Since the appearance of the first object 20 in the first image area 21 is similar to that of the first object 20 in the first image area 24, the extracted feature information 23 and the feature information 25 are close to each other in distance in the feature space 22, i.e., the feature information 23 and the feature information 25 are highly similar to each other.

Fig. 2b is a schematic diagram illustrating an application scenario of a three-dimensional coordinate-based gesture encoder training method according to an embodiment of the present disclosure. As shown in FIG. 2b, the pose encoder training may use the labeled feature map 27 of the first target 20 in the first image region 21 and the labeled feature map 26 of the first target 20 in the first image region 24. The trained pose encoder may be used to extract features of the first image region 21 and the first image region 24 to obtain feature information 23 'and feature information 25' of the first target 20, and since the pose encoder is trained using the feature label map including the pose information, the extracted feature information 23 'and feature information 25' are far apart in the feature space 22, that is, the feature information 23 'and feature information 25' have low similarity.

Therefore, when the features are extracted, the first targets with similar appearances but different postures can be accurately identified by using the posture encoder trained by the posture encoder training method based on the three-dimensional coordinates, and the accuracy of feature extraction can be improved.

FIG. 3 shows a flow diagram of a method for pose estimation based on a pose encoder, according to an embodiment of the present disclosure. As shown in fig. 3, the method includes:

step S31, carrying out target detection on an image to be detected, and determining a second target in the image to be detected and a second image area of the second target;

step S32, performing feature extraction on the second image area through the coding network of the attitude coder to obtain second feature information of the second target;

step S33, respectively determining similarity between the second characteristic information and each piece of third characteristic information in a preset characteristic posture library, wherein the characteristic posture library comprises a plurality of pieces of third characteristic information and a target posture corresponding to each piece of third characteristic information, and the target posture comprises a rotation angle of a target;

step S34, determining the target posture corresponding to the third characteristic information with the highest similarity with the second characteristic information as the posture information of the second target,

the encoding network is obtained by training through the three-dimensional coordinate-based posture encoder training method.

According to the embodiment of the disclosure, when performing pose estimation, a pose encoder trained by a pose encoder training method based on three-dimensional coordinates may be used, feature extraction may be performed on a second image region of a second target in an image to be detected through an encoding network of the pose encoder to obtain second feature information, then similarities between the second feature information and each third feature information in a preset feature pose library are respectively determined, and a target pose corresponding to a third feature information with the highest similarity to the second feature information is determined as pose information of the second target. The attitude encoder trained by the attitude encoder training method based on the three-dimensional coordinates is used for attitude estimation, so that the accuracy of attitude estimation can be improved.

In a possible implementation manner, the posture estimation method based on the posture encoder according to the embodiment of the disclosure may be applied to a processor, where the processor may be a general-purpose processor, such as a CPU (Central Processing Unit), or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations, such as a GPU (Graphics Processing Unit), an NPU (Neural-Network Processing Unit), a DSP (Digital Signal Processing Unit), and the like. The present disclosure is not limited to a particular type of processor.

In a possible implementation manner, in step S21, the target detection may be performed on the image to be detected, and the second target and the second image region of the second target in the image to be detected are determined. The image to be detected may be subjected to target detection through a target detection network (e.g., RCNN, fast RCNN, etc.), and the present disclosure does not limit the specific type of the target detection network.

In a possible implementation manner, after determining the second image region of the second target, in step S22, feature extraction may be performed on the second image region through an encoding network of the pose encoder, so as to obtain second feature information of the second target. The second image region may be input into an encoding network of the pose encoder for performing processing such as dimension reduction and feature extraction to obtain second feature information of the second target.

In a possible implementation manner, before feature extraction is performed on the second image region, the size of the second image region may also be adjusted according to a preset image size, so that the image size of the second image region meets the input requirement of the coding network, and the processing efficiency of the coding network is improved.

In one possible implementation manner, in step S23, the similarity between the second feature information and each third feature information in a preset feature posture library may be respectively determined, the feature posture library may include a plurality of third feature information and a target posture corresponding to each third feature information, and the target posture may include a rotation angle of the target.

In a possible implementation manner, the similarity between the second feature information and each third feature information may be determined in various manners, such as an euclidean distance, a cosine similarity, and the like, which is not limited by the present disclosure.

In a possible implementation manner, after determining the similarity between the second feature information and each third feature information in the preset feature posture library, in step S24, the third feature information with the highest similarity to the second feature information is selected from the multiple third feature information, and the target posture corresponding to the third feature information in the feature posture library is determined as the posture information of the second target.

In one possible implementation, the method further includes:

respectively extracting the features of a reference image of a preset reference target at a plurality of rotation angles through the coding network to obtain a plurality of third feature information of the reference target;

and determining a feature attitude library according to the plurality of third feature information and the target attitude corresponding to the plurality of third feature information.

The preset reference target may be a plurality of different objects, and those skilled in the art may set the reference target according to actual situations, which is not limited in the present disclosure.

In one possible implementation, when building the feature pose library, reference images of a preset reference target at a plurality of rotation angles may be determined first. For example, the rotation angle of the reference target may be uniformly sampled, a plurality of rotation angles of the reference target may be determined, and a reference image of the reference target at the plurality of rotation angles may be determined.

And respectively extracting the features of the reference images of the reference target at a plurality of rotation angles through the trained encoding network of the attitude encoder, namely respectively inputting the reference images of the reference target at a plurality of rotation angles into the encoding network for feature extraction to obtain a plurality of third feature information of the reference target. For any third feature information, the rotation angle corresponding to the third feature information may be determined as the target posture corresponding to the third feature information.

In one possible implementation manner, the feature posture library may be established according to a plurality of third feature information and target postures corresponding to the plurality of third feature information.

In this embodiment, a coding network in a posture coder trained by a posture coder training method based on three-dimensional coordinates is used to perform feature extraction on a reference image to obtain third feature information, and a feature posture library is established according to a plurality of third feature information and target postures corresponding to the plurality of third feature information, so that the accuracy of the feature posture library can be improved, and the accuracy of posture estimation is further improved.

It should be noted that, although the above embodiments have been described as examples of the three-dimensional coordinate-based posture encoder training method and the posture encoder-based posture estimation method, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the present disclosure is met.

FIG. 4 shows a block diagram of a three-dimensional coordinate based gesture encoder training apparatus, according to an embodiment of the disclosure. As shown in fig. 4, the apparatus includes:

a first target detection module 41, configured to perform target detection on a sample image in a training set, and determine a first target in the sample image and a first image area of the first target, where the training set includes a plurality of sample images and an labeled feature map of the first target in the plurality of sample images, and the labeled feature map is used to represent three-dimensional coordinates of a visible portion of the first target in the sample image;

a first feature extraction module 42, configured to perform feature extraction on the first image region through the coding network to obtain first feature information of the first target;

a feature map prediction module 43, configured to process the first feature information through the decoding network, and determine a predicted feature map of the first target;

and a training module 44, configured to train the gesture encoder according to the predicted feature map and the labeled feature map of the first target.

In one possible implementation, the predicted feature map includes three channels that respectively represent three-dimensional coordinates of a visible portion of the first object in the first image region.

According to the embodiment of the disclosure, in the training process of the posture encoder, the labeling feature graph representing the three-dimensional coordinate of the visible part of the first target in the sample image is used as the training label, so that the posture information of the first target can participate in the training process of the posture encoder, when the coding network in the trained posture encoder is used for extracting features, the first target with similar appearance but different posture can be accurately identified, and the accuracy of extracting the coding network features in the posture encoder is improved.

Fig. 5 shows a block diagram of an attitude estimation device based on an attitude encoder according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

the second target detection module 51 is configured to perform target detection on an image to be detected, and determine a second target in the image to be detected and a second image area of the second target;

a second feature extraction module 52, configured to perform feature extraction on the second image region through the coding network of the gesture encoder, so as to obtain second feature information of the second target;

a similarity determining module 53, configured to determine similarities between the second feature information and each piece of third feature information in a preset feature posture library, where the feature posture library includes a plurality of pieces of third feature information and a target posture corresponding to each piece of third feature information, and the target posture includes a rotation angle of the target;

a pose information determination module 54, configured to determine pose information corresponding to third feature information with the highest similarity to the second feature information as pose information of the second target,

the encoding network is obtained by training according to the attitude encoder training device based on the three-dimensional coordinates.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A three-dimensional coordinate-based attitude encoder training method is characterized in that an attitude encoder comprises an encoding network and a decoding network,

the method comprises the following steps:

performing target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target, wherein the training set comprises a plurality of sample images and an annotation feature map of the first target in the sample images, the annotation feature map is used for representing three-dimensional coordinates of a visible part of the first target in the sample images, and the annotation feature map is represented as an annotated three-dimensional position field of the first target;

processing the first feature information through the decoding network to determine a predicted feature map of the first target, wherein the predicted feature map is represented as a predicted three-dimensional position field of the first target;

2. The method of claim 1, wherein the predicted feature map comprises three channels that respectively represent three-dimensional coordinates of a visible portion of the first object in the first image region.

3. The method of claim 1, further comprising:

4. The method of claim 3, wherein training the pose encoder based on the predicted feature map, the annotated feature map, the first image region, and the predicted image of the first target comprises:

determining a first loss according to the difference between the predicted characteristic diagram and the labeled characteristic diagram of the first target;

5. The method according to claim 1, wherein performing feature extraction on the first image region through the coding network to obtain first feature information of the first target comprises:

6. An attitude estimation method based on an attitude encoder, the method comprising:

wherein the coding network is trained according to the method of any one of claims 1 to 5.

7. The method of claim 6, further comprising:

8. A posture encoder training device based on three-dimensional coordinates is characterized in that a posture encoder comprises an encoding network and a decoding network,

the device comprises:

the first target detection module is used for performing target detection on sample images in a training set, and determining a first target in the sample images and a first image area of the first target, wherein the training set comprises a plurality of sample images and an annotation feature map of the first target in the sample images, the annotation feature map is used for representing three-dimensional coordinates of a visible part of the first target in the sample images, and the annotation feature map is represented as an annotated three-dimensional position field of the first target;

a feature map prediction module, configured to process the first feature information through the decoding network, and determine a predicted feature map of the first target, where the predicted feature map is represented as a predicted three-dimensional position field of the first target;

9. The apparatus of claim 8, wherein the predicted feature map comprises three channels, each representing three-dimensional coordinates of a visible portion of the first object in the first image region.

10. An attitude estimation apparatus based on an attitude encoder, the apparatus comprising:

wherein the coding network is trained according to the apparatus of claim 8 or 9.