WO2023273272A1

WO2023273272A1 - Target pose estimation method and apparatus, computing device, storage medium, and computer program

Info

Publication number: WO2023273272A1
Application number: PCT/CN2021/143442
Authority: WO
Inventors: 杨佳丽; 杜国光; 赵开勇
Original assignee: 达闼科技（北京）有限公司
Priority date: 2021-06-30
Filing date: 2021-12-30
Publication date: 2023-01-05
Also published as: CN115222809B; CN115222809A

Abstract

The present invention relates to the technical field of computer vision, and provides a target pose estimation method and apparatus, a computing device, a storage medium, and a computer program. The method comprises: performing 2D detection according to an RGB image and a depth image to obtain a detection region of a target; obtaining a normalization model of the target by using the RGB image in the detection region; obtaining size information of the target according to the depth image in the detection region; fusing the size information and the normalization model to obtain a 3D model, and applying a PnP algorithm according to the 3D model to obtain pose information of the target. By means of the method, in embodiments of the present invention, pose information of a target object can be accurately obtained, the target object can be conveniently grabbed, and the user experience is improved.

Description

Target pose estimation method, device, computing device, storage medium and computer program

cross reference

This application claims the priority of the Chinese patent application with the application number "202110743454.8" and the title of the invention "target pose estimation method, device, computing equipment and storage medium" submitted on June 30, 2021, the entire content of which is incorporated by reference incorporated in this application.

technical field

Embodiments of the present disclosure relate to the technical field of computer vision, and specifically relate to a target pose estimation method, device, computing device, storage medium, and computer program.

Background technique

Intelligent robots, in addition to being able to perceive the surrounding world, must also be able to interact with the environment, and grasping is an indispensable ability. Robot grasping has great application value no matter in the industrial scene or in the home scene. Among them, the pose estimation of the object to be grasped is an important factor affecting the success of grasping. Existing pose estimation methods are generally divided into feature matching methods, template methods, and deep learning-based methods. The feature matching method usually calculates the feature points between the 3D model and the 2D image and matches them, and then uses the PnP method to calculate the pose. The template method usually models the 3D model of the object to be grasped from various perspectives, and estimates the pose by matching the collected images with the template. The method based on deep learning usually needs to collect a large number of objects to be grasped first. Color images and depth images in various pose states, create a data set, and then directly or indirectly estimate the pose of the object to be grasped by training a convolutional neural network based on deep learning.

However, the current algorithm still has shortcomings in the grasping of real objects. The feature matching method often requires a lot of calculations, and the algorithm takes a long time to run. Not only that, the success of feature point selection and matching directly affects the accuracy of pose estimation, and the algorithm for objects with fewer feature points often cannot obtain accurate and stable results. The method based on template matching often requires a large number of templates, and pose estimation is essentially a regression problem, so the accuracy of the algorithm is often directly proportional to the number of templates selected, and it is difficult to achieve a balance. The method based on deep learning directly returns the object pose through the convolutional neural network. Most of the existing deep learning methods are at the instance level, and the generalization ability is poor.

Contents of the invention

In view of the above problems, embodiments of the present disclosure provide a target pose estimation method, device, computing device, and storage medium, which overcome the above problems or at least partially solve the above problems.

According to an aspect of an embodiment of the present disclosure, a method for estimating a target pose is provided, the method comprising: performing 2D detection according to an RGB image and a depth image, and obtaining a detection area of the target; Acquiring a normalized model of the target from an image; acquiring size information of the target according to the depth image in the detection area; merging the size information with the normalized model to obtain a 3D model, and The 3D model uses a PnP algorithm to obtain the pose information of the target.

In an optional manner, the performing 2D detection according to the RGB image and the depth image to obtain the detection area of the target includes: applying a pre-built first convolutional neural network to process the RGB image to obtain the The detection area of the target in the RGB image; acquiring the same detection area of the target in the depth image as the RGB image.

In an optional manner, the normalization model of the target obtained by normalizing the RGB image in the detection area includes: applying the first network structure to the RGB image in the detection area The RGB image is processed to obtain a normalized model of the target.

In an optional manner, the applying the first network structure to process the RGB image in the detection area to obtain a normalized model of the target includes:

Apply multiple sets of convolution + convolution + down-sampling combination to down-sample the RGB image in the detection area and then perform convolution operation on the lowest resolution feature map; apply multiple sets of up-sampling + convolution + convolution Combining and restoring the resolution of the RGB image in the detection area after the operation to the original size, and performing a preset number of convolution operations to obtain a normalized model of the target.

In an optional manner, the acquiring the size information of the target according to the depth image in the detection area includes: converting the depth image in the detection area into a point cloud; applying the first The second network structure processes the point cloud to obtain the size information of the object.

In an optional manner, the merging the size information with the normalized model to obtain the 3D model includes: calculating the 3D model according to the size information and the normalized model by applying the following relational formula: Model:

x'=x×w,

y'=y×l,

z'=z×h,

Wherein, (x, y, z) are the coordinates of the normalized model, (x', y', z') are the coordinates of the 3D model, and (w, l, h) are the dimensions of the target Information, w, l, h represent the width, length, and height of the target, respectively.

In an optional manner, the applying the PnP algorithm to acquire the pose information of the target according to the 3D model includes: applying the PnP algorithm to match the coordinates of the 3D model with the 2D image, and acquiring the target pose information.

According to another aspect of the embodiments of the present disclosure, a target pose estimation device is provided, the target pose estimation device includes: a 2D detection unit, configured to perform 2D detection according to an RGB image and a depth image, and obtain a detection area of a target ;

A normalization unit, configured to acquire a normalized model of the target from the RGB image in the detection area; a size acquisition unit, configured to acquire the target’s size based on the depth image in the detection area Size information; a pose estimation unit, configured to fuse the size information with the normalized model to obtain a 3D model, and apply a PnP algorithm to obtain the pose information of the target according to the 3D model.

In an optional manner, the 2D detection unit includes: applying a pre-built first convolutional neural network to process the RGB image, and obtaining the detection of the target in the RGB image area: acquire the detection area corresponding to the same target as the RGB image in the depth image.

In an optional manner, the normalization unit includes: applying the first network structure to process the RGB image in the detection area, and obtaining the normalized normalized value of the target model.

In an optional manner, the normalization unit includes: applying multiple sets of convolution+convolution+downsampling combinations to downsample the RGB image in the detection area Perform convolution operation on the lowest resolution feature map; apply multiple sets of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of The normalized model of the target is obtained after convolution operations.

In an optional manner, the size acquisition unit is configured to: convert the depth image in the detection area into a point cloud; apply the second network structure to process the point cloud, and obtain the The size information of the target.

In an optional manner, the pose estimation unit includes: calculating the 3D model according to the size information and the normalized model by applying the following relational expression:

x'=x×w,

y'=y×l,

z'=z×h,

In an optional manner, the pose estimation unit is configured to: apply a PnP algorithm to match the coordinates of the 3D model with the 2D image, and obtain pose information of the target.

According to another aspect of the embodiments of the present disclosure, there is provided a computing device, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete the communication through the communication bus. communication between

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the steps of the above target pose estimation method.

According to still another aspect of an embodiment of the present disclosure, a computer storage medium is provided, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes the processor to execute the steps of the above target pose estimation method .

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program, including instructions, which, when run on a computer, cause the computer to execute the above-mentioned target pose estimation method.

The target pose estimation method in the embodiment of the present disclosure includes: performing 2D detection according to the RGB image and the depth image, and obtaining the detection area of the target; obtaining the normalized model of the target from the RGB image in the detection area; Obtaining the size information of the target from the depth image in the detection area; merging the size information with the normalized model to obtain a 3D model, and applying a PnP algorithm to obtain the position of the target according to the 3D model The pose information can accurately obtain the pose information of the target object, which is convenient for grabbing the target object and improves the user experience.

The above description is only an overview of the technical solutions of the embodiments of the present disclosure. In order to better understand the technical means of the embodiments of the present disclosure, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and The advantages can be more obvious and understandable, and the specific implementation manners of the present disclosure are enumerated below.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present disclosure. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

FIG. 1 shows a schematic flow diagram of a target pose estimation method provided by an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of the first convolutional neural network in the target pose estimation method provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of the first network structure in the target pose estimation method provided by an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of acquiring size information in the target pose estimation method provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic structural diagram of an object pose estimation device provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of a computing device provided by an embodiment of the present disclosure.

detailed description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

Fig. 1 shows a schematic flowchart of a method for estimating a target pose provided by an embodiment of the present disclosure. As shown in Fig. 1 , the method for estimating a target pose includes:

Step S11: Perform 2D detection according to the RGB image and the depth image, and obtain the detection area of the target.

In an embodiment of the present disclosure, optionally, the RGB image is processed by applying a pre-built first convolutional neural network to obtain the detection area of the target in the RGB image; in the depth image Acquire the detection area corresponding to the same target as the RGB image. The first convolutional neural network is not limited to a specific detection or segmentation method, the goal is to obtain the specific area of the target (object to be captured) in the image, and reduce background interference factors for subsequent pose estimation.

Before applying the first convolutional neural network for processing, the first convolutional neural network needs to be constructed. First construct the data set: collect the RGB images of the objects to be captured under different environmental backgrounds, and mark the most suitable bounding box (x, y, w, h) and object category id for each RGB image; secondly, for a large number of RGB The image data is trained using a Convolutional Neural Network (CNN) to obtain the first Convolutional Neural Network model. The network structure of the first convolutional neural network is shown in Figure 2. The number of network layers is 31, and the image block is scaled to obtain a 448x448 pixel block as the network input.

Step S12: Obtain a normalized model of the target from the RGB image in the detection area.

In the embodiment of the present disclosure, optionally, a first network structure is applied to process the RGB image in the detection area to obtain a normalized normalized model of the target. The specific structure of the first network structure is shown in Figure 3, applying multiple sets of convolution + convolution + downsampling combination to downsample the RGB image in the detection area and then perform convolution on the lowest resolution feature map Operation: apply multiple sets of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of convolution operations to obtain the target normalized model. Preferably, after performing 4 sets of convolution + convolution + downsampling combined processing on the RGB image in the detection area, a convolution operation is performed on the lowest resolution feature map, and then 4 sets of upsampling + convolution + After convolution combination processing, the normalized model of the target is output after two consecutive convolution operations. The embodiments of the present disclosure use the U-net network structure regression normalized accuracy map on the basis of 2D detection, which can greatly improve the accuracy of the algorithm.

Step S13: Acquiring size information of the target according to the depth image in the detection area.

Optionally, as shown in FIG. 4 , the depth image in the detection area is converted into a point cloud. Specifically apply the following conversion formula for conversion:

Among them, (X,Y,Z) are point cloud coordinates, (x′,y′) are image coordinates, D is depth value, f _x and f _y are focal lengths, c _x , _cy are principal point offsets.

Then apply the second network structure to process the point cloud to obtain the size information of the object. The second network structure is preferably composed of a PointNet++ network and a convolutional layer and a fully connected layer behind the PointNet++ network. Through the PointNet++ network, the size of the object can be regressed, represented by S(w,l,h), and then the size information of the object can be restored by adding a convolutional layer and a fully connected layer.

Step S14: merging the size information with the normalized model to obtain a 3D model, and applying a PnP algorithm according to the 3D model to obtain pose information of the target.

In the embodiment of the present disclosure, the complete 3D information of the target (object to be grasped) can be obtained by fusing the normalized model and the object size information. Optionally, the 3D model is calculated by applying the following relational formula according to the size information and the normalized model:

x'=x×w,

y'=y×l,

z'=z×h,

Then apply the PnP algorithm to match the coordinates of the 3D model with the 2D image to obtain the pose information of the target. The pose information of the target includes the placement matrix R and the translation matrix T. The PnP algorithm may be any existing PnP algorithm capable of realizing the above-mentioned functions, which will not be repeated here. In the embodiment of the present disclosure, size information is restored through PointNet++ by means of a depth map, and prior information of a post-processing algorithm is added to obtain higher precision information.

The embodiments of the present disclosure use RGB images to obtain object categories, segmentation results and normalized models through convolutional neural networks, depth (Depth) images and segmentation results to obtain object size information, size information and normalized models are fused to obtain 3D models, and finally The final pose information is obtained through PnP. Wherein, T(x, y, z) is used to represent position information in three-dimensional space, and a rotation matrix R is used to represent three-axis rotation in three-dimensional space. The use of the normalized model can well solve the problem of inconsistency in the size of similar objects of the same type and the inability to obtain the exact size of the object due to camera scaling. Combined with the size recovered from the depth map to carry out poor information, it can solve the current deep learning method. Instance-level issues.

The steps of applying the target pose estimation method of the embodiment of the present disclosure to a robot are illustrated as examples below:

1) Prepare the robot equipment, including the robot base, mechanical arm and depth camera, etc.;

2) Place the object on the desktop in front of the robotic arm, and collect the RGB image and Depth image of the current position;

3) For the RGB image of the target object, use the target detection method to obtain the area where the object to be captured is located under the current capture angle of view;

4) Use the normalized model to generate a network to generate a standard normalized model of the object to be grasped;

5) Use the size estimation network to calculate the size information of the object to be grasped;

6) Fuse the size information with the normalized model, and use the PnP algorithm to calculate the pose information of the object to be captured;

7) According to the pose, make the robotic arm perform grabbing.

Fig. 5 shows a schematic structural diagram of an object pose estimation device according to an embodiment of the present disclosure. As shown in Fig. Pose Estimation Unit 504.

The 2D detection unit 501 is used to perform 2D detection according to the RGB image and the depth image to obtain the detection area of the target; the normalization unit 502 is used to obtain the normalized model of the target from the RGB image in the detection area; The size acquisition unit 503 is configured to acquire size information of the target according to the depth image in the detection area; the pose estimation unit 504 is configured to fuse the size information with the normalized model to obtain a 3D model, and Applying a PnP algorithm according to the 3D model to obtain pose information of the target.

In an optional manner, the 2D detection unit 501 is configured to: apply a pre-built first convolutional neural network to process the RGB image, and obtain the detection area of the target in the RGB image; The detection area corresponding to the same target as the RGB image is acquired in the depth image.

In an optional manner, the normalization unit 502 is configured to: apply a first network structure to process the RGB image in the detection area, and obtain a normalized normalized model of the target.

In an optional manner, the normalization unit 502 is configured to: apply multiple sets of convolution+convolution+downsampling combinations to downsample the RGB image in the detection area and generate the lowest resolution feature map Perform a convolution operation on the above; apply multiple groups of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of convolution operations A normalized model of the target is obtained.

In an optional manner, the size acquisition unit 503 is configured to: convert the depth image in the detection area into a point cloud; apply the second network structure to process the point cloud, and obtain the The size information.

In an optional manner, the pose estimation unit 504 is configured to: calculate the 3D model by applying the following relational formula according to the size information and the normalized model:

x'=x×w,

y'=y×l,

z'=z×h,

In an optional manner, the pose estimation unit 504 is configured to: apply a PnP algorithm to match the coordinates of the 3D model with the 2D image, and acquire pose information of the target.

An embodiment of the present disclosure provides a non-volatile computer storage medium, where at least one executable instruction is stored in the computer storage medium, and the computer executable instruction can execute the target pose estimation method in any method embodiment above.

Specifically, the executable instruction can be used to make the processor perform the following operations:

Perform 2D detection based on the RGB image and the depth image to obtain the detection area of the target;

Obtaining a normalized model of the target from the RGB image in the detection area;

acquiring size information of the target according to the depth image in the detection area;

The size information is fused with the normalized model to obtain a 3D model, and a PnP algorithm is applied to obtain the object according to the 3D model.

In an optional manner, the executable instructions cause the processor to perform the following operations:

Applying the pre-built first convolutional neural network to process the RGB image to obtain the detection area of the target in the RGB image;

The detection area corresponding to the same target as the RGB image is acquired in the depth image.

Applying the first network structure to process the RGB image in the detection area to obtain a normalized normalized model of the target.

Applying multiple sets of convolution+convolution+downsampling combinations to downsample the RGB image in the detection area and then perform a convolution operation on the lowest resolution feature map;

Apply multiple sets of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of convolution operations to obtain the normalization of the target One model.

converting the depth image within the detection area into a point cloud;

Applying the second network structure to process the point cloud to obtain the size information of the target.

Applying the following relational formula to calculate the 3D model according to the size information and the normalized model:

x'=x×w,

y'=y×l,

z'=z×h,

Applying a PnP algorithm to match the coordinates of the 3D model with the 2D image to obtain the pose information of the target.

Fig. 6 shows a schematic structural diagram of a device embodiment of the present disclosure, and the specific embodiment of the present disclosure does not limit the specific implementation of the device.

As shown in FIG. 6 , the device may include: a processor (processor) 602, a communication interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.

Wherein: the processor 602 , the communication interface 604 , and the memory 606 communicate with each other through the communication bus 608 . The communication interface 604 is used to communicate with network elements of other devices such as clients or other servers. The processor 602 is configured to execute the program 610, and specifically, may execute relevant steps in the above embodiment of the target pose estimation method.

Specifically, the program 610 may include program codes including computer operation instructions.

The processor 602 may be a central processing unit CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present disclosure. The one or more processors included in the device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.

The memory 606 is used for storing the program 610 . The memory 606 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 can specifically be used to make the processor 602 perform the following operations:

In an optional manner, the program 610 enables the processor to perform the following operations:

converting the depth image within the detection area into a point cloud;

x'=x×w,

y'=y×l,

z'=z×h,

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present disclosure are not directed to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present disclosure described herein, and the above description of specific languages is for disclosing the best mode of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the disclosure, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of embodiments of the disclosure are sometimes grouped together into a single implementation examples, figures, or descriptions thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this disclosure.

In addition, those skilled in the art will appreciate that although some embodiments herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the present disclosure. And form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the disclosure, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The disclosure can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names. The steps in the above embodiments, unless otherwise specified, should not be construed as limiting the execution order.

Claims

A method for estimating a target pose, characterized in that the method comprises:

Perform 2D detection based on the RGB image and the depth image to obtain the detection area of the target;

Obtaining a normalized model of the target from the RGB image in the detection area;

acquiring size information of the target according to the depth image in the detection area;

The size information is fused with the normalized model to obtain a 3D model, and the pose information of the target is obtained by applying a PnP algorithm according to the 3D model.
The target pose estimation method according to claim 1, wherein the 2D detection is performed according to the RGB image and the depth image, and the detection area of the target is obtained, comprising:

Applying the pre-built first convolutional neural network to process the RGB image to obtain the detection area of the target in the RGB image;

The detection area corresponding to the same target as the RGB image is acquired in the depth image.
The target pose estimation method according to claim 1, wherein the normalized model of the target obtained by normalizing the RGB images in the detection area comprises:

Applying the first network structure to process the RGB image in the detection area to obtain a normalized normalized model of the target.
The target pose estimation method according to claim 3, wherein said applying the first network structure to process said RGB image in said detection area, and obtaining a normalized model of said target, comprising:

Applying multiple sets of convolution+convolution+downsampling combinations to downsample the RGB image in the detection area and then perform a convolution operation on the lowest resolution feature map;

Apply multiple sets of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of convolution operations to obtain the normalization of the target One model.
The target pose estimation method according to claim 1, wherein the acquiring the size information of the target according to the depth image in the detection area comprises:

converting the depth image within the detection area into a point cloud;

Applying the second network structure to process the point cloud to obtain the size information of the target.
The target pose estimation method according to claim 1, wherein said merging said size information with said normalized model to obtain a 3D model comprises:

Applying the following relational formula to calculate the 3D model according to the size information and the normalized model:

x'=x×w,

y'=y×l,

z'=z×h,

Wherein, (x, y, z) are the coordinates of the normalized model, (x', y', z') are the coordinates of the 3D model, and (w, l, h) are the dimensions of the target Information, w, l, h represent the width, length, and height of the target, respectively.
The target pose estimation method according to claim 6, wherein said applying the PnP algorithm to obtain the pose information of the target according to the 3D model comprises:

Applying a PnP algorithm to match the coordinates of the 3D model with the 2D image to obtain the pose information of the target.
A target pose estimation device, characterized in that the device comprises:

The 2D detection unit is used to perform 2D detection according to the RGB image and the depth image to obtain the detection area of the target;

A normalization unit, configured to obtain a normalized model of the target from the RGB image in the detection area;

a size acquiring unit, configured to acquire size information of the target according to the depth image in the detection area;

A pose estimation unit, configured to fuse the size information with the normalized model to obtain a 3D model, and apply a PnP algorithm to obtain pose information of the target according to the 3D model.
The target pose estimation device according to claim 8, wherein the 2D detection unit includes:

Applying the pre-built first convolutional neural network to process the RGB image to obtain the detection area of the target in the RGB image;

The detection area corresponding to the same target as the RGB image is acquired in the depth image.
The target pose estimation device according to claim 8, wherein the normalization unit includes:

Applying the first network structure to process the RGB image in the detection area to obtain a normalized normalized model of the target.
The target pose estimation device according to claim 10, wherein the normalization unit includes:

Applying multiple sets of convolution+convolution+downsampling combinations to downsample the RGB image in the detection area and then perform a convolution operation on the lowest resolution feature map;

Apply multiple sets of upsampling + convolution + convolution combination to restore the resolution of the RGB image in the detection area after the operation to the original size, and perform a preset number of convolution operations to obtain the normalization of the target One model.
The target pose estimation device according to claim 8, wherein the size acquisition unit includes:

converting the depth image within the detection area into a point cloud;

Applying the second network structure to process the point cloud to obtain the size information of the target.
The target pose estimation device according to claim 8, wherein the pose estimation unit includes:

Applying the following relational formula to calculate the 3D model according to the size information and the normalized model:

x'=x×w,

y'=y×l,

z'=z×h,

Wherein, (x, y, z) are the coordinates of the normalized model, (x', y', z') are the coordinates of the 3D model, and (w, l, h) are the dimensions of the target Information, w, l, h represent the width, length, and height of the target, respectively.
The target pose estimation device according to claim 13, wherein the pose estimation unit includes:

Applying a PnP algorithm to match the coordinates of the 3D model with the 2D image to obtain the pose information of the target.
A computing device, comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete mutual communication through the communication bus;

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the steps of the target pose estimation method according to any one of claims 1-7.
A computer storage medium, at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute the steps of the target pose estimation method according to any one of claims 1-7.
A computer program, comprising instructions, when running on a computer, causes the computer to execute the target pose estimation method according to any one of claims 1-7.