CN114972525B

CN114972525B - Robot grabbing and augmented reality-oriented space target attitude estimation method

Info

Publication number: CN114972525B
Application number: CN202210422447.2A
Authority: CN
Inventors: 吴鹏; 王俊骁; 王晨
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-05-14
Anticipated expiration: 2042-04-21
Also published as: CN114972525A

Abstract

The invention discloses a six-degree-of-freedom gesture estimation method for a space target for robot grabbing and augmented reality, which is proposed and realized based on a depth full convolution network by combining the characteristics of a multi-scale bounding box of a target 3D model according to the requirements of the augmented reality and cooperative robot technology on six-degree-of-freedom gesture information of the space target.

Description

Robot grabbing and augmented reality-oriented space target attitude estimation method

Technical Field

The invention relates to the fields of augmented reality, robot grabbing and the like, in particular to a six-degree-of-freedom gesture estimation method based on a space target multi-scale bounding box.

Background

In the fields of augmented reality, collaborative robot gripping, and the like, the spatial pose of an object is an indispensable information. In the traditional method, a scene and object point clouds are acquired through a laser camera, and the relative pose of an object is acquired through point cloud registration. However, because the point cloud information is huge and redundant and has certain requirements on equipment, the light and rapid task requirements are difficult to meet.

Disclosure of Invention

Therefore, the invention provides the six-degree-of-freedom gesture estimation method based on the RGB image, which does not need additional information acquisition equipment and can ensure stable and rapid six-degree-of-freedom gesture estimation.

A space target attitude estimation method for robot grabbing and augmented reality comprises the following steps:

step 1, calibrating a camera to obtain a camera internal reference, calculating a 3D model of an object, obtaining a multi-scale bounding box of the object, and mapping the multi-scale bounding box onto a 2D image through the camera internal reference matrix;

Step 2, taking the component points of the multi-scale bounding box as characteristic points, training a full convolution neural network to detect and locate the characteristic points, taking RGB images as input by the network, and outputting a Gaussian heat map about the characteristic points;

Step 3, performing non-maximum suppression on the Gaussian heat map output by the neural network to obtain specific characteristic point two-dimensional coordinates;

Step 4, restoring the corresponding relation of the 2D-3D characteristic points into a six-degree-of-freedom gesture of the space target through an improved EPnP algorithm, so as to provide a basis for subsequent grabbing work;

The specific implementation of step 1 comprises the following sub-steps,

Step 1.1, obtaining RGB camera internal parameters through chessboard calibration;

Step 1.2, calculating maximum values and minimum values of x, y and z axes of the 3D model of the object under the space coordinates, so as to obtain a boundary frame of the common size of the object; calculating average value points of maximum values and minimum values of x, y and z axes, multiplying the lengths (the maximum value minus the minimum value) on the axes by coefficients, and obtaining bounding boxes of different scales of the 3D model;

step 1.3, projecting a 3D multi-scale boundary frame of an object to pictures of different scenes through an internal reference matrix of a camera and gesture information of the object, wherein the specific calculation method comprises the following steps:

x＝u′×f_x÷z′+C_x

y＝v′×f_y÷z′+C_y

Wherein u, v, z respectively represent x, y, z axis coordinates under 3D coordinates, R represents a rotation matrix, T represents a translation matrix, fx, fy respectively represent focal lengths on x and y axes of the camera, cx, cy represents a camera principal point position, camera parameters are all in units of pixels, and x, y represent coordinates under 2D images;

the specific implementation of step 2 comprises the sub-steps of,

Step 2.1, extracting the characteristics of the image through modular convolution to obtain a characteristic image, and laying a foundation for subsequent characteristic point detection;

Step 2.2, further feature extraction is carried out on the premise that the size channel is unchanged by adopting an attention mechanism module, three modularized convolutions are continuously carried out after the feature extraction, and the attention mechanism module is used for extracting the feature again;

step 2.3, the dimension of the obtained feature map is reduced through a semantic embedding module, and the probability distribution information of the feature points is mapped to n heat maps, wherein n is the number of scales multiplied by 8;

a specific implementation of step 3 comprises the sub-steps of,

Step 3.1, carrying out convolution operation of 3x3 on the Gaussian heat map to enable the pixel points to have information of adjacent points;

step 3.2, performing non-maximum suppression on the obtained Gaussian heat map, and converting probability distribution of the feature points into specific coordinate points;

a specific implementation of step 4 comprises the sub-steps of,

Step 4.1, adjusting the positions of the characteristic points through the parallel relation among the multi-scale surrounding frame line segments, and reducing the errors of the neural network;

The line segments in the 4.2,3D model and the line segments in the 2D image have the same proportional relationship, and feature points are expanded through the equal proportional relationship between the 2D-3D line segments, so that the influence caused by precision loss is reduced;

And 4.3, randomly sampling the feature point set obtained after expansion, solving the pose by a PnP algorithm every time when n feature points are sampled, executing the process for m times, solving Euclidean distances between the m pose results and other results, and selecting the result with the smallest distance as the final result.

Aiming at the requirements of the augmented reality and the collaborative robot technology on the six-degree-of-freedom gesture information of the space target, the invention provides and realizes the six-degree-of-freedom gesture estimation method of the space target based on the depth full convolution network by combining the characteristics of the multi-scale bounding box of the target 3D model, and the method has the advantages of strong robustness, high accuracy and meeting the processing speed requirements, and can be popularized and applied in the scene of the collaborative robot capturing, the augmented reality and the like.

Drawings

FIG. 1 is a flow chart of the attitude estimation of the present invention.

Fig. 2 is a schematic view of camera calibration and bounding box calculation and projection.

Fig. 3 is a diagram of a neural network.

Fig. 4 is a block diagram of a convolution.

Fig. 5 is a block diagram of an attention mechanism module.

Fig. 6 is a structural diagram of a semantic embedding module.

Fig. 7 is a schematic diagram of a modified EPnP algorithm.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The steps of pose estimation in the implementation example are shown in fig. 1, and the implementation example can be applied to spatial target six-degree-of-freedom pose estimation in the processes of augmented reality and mechanical arm grabbing, and the implementation process is as follows:

Step 1, calibrating a camera to obtain a camera internal reference, calculating a 3D model of an object to obtain a multi-scale bounding box of the object, and mapping the multi-scale bounding box onto a 2D image through the camera internal reference matrix 101;

In the example, accurate camera internal parameters are obtained through a chessboard calibration method, a multi-scale boundary frame of a 3D model is calculated, and the boundary frame of the 3D model is mapped onto a 2D image through internal parameters and gesture information, so that the labeling of a data set is completed. The method comprises the following steps:

Step 1.1, printing a standard 9x7 calibration chessboard 202, and calibrating a camera 201 to obtain an internal reference matrix 203;

step 1.2, calculating a multi-scale bounding box 205 of the 3D model 204, firstly reading point cloud information of the 3D model, and calculating maximum and minimum values of the 3D model in x, y and z axes, wherein the coordinates of the bounding box in normal scale are respectively as follows:

(x_max,y_max,z_max),(x_max,y_max,z_min),(x_max,y_min,z_max),(x_max,y_min,z_min),(x_min,y_max,z_max),(x_min,y_max,z_min),(x_min,y_mjn,z_max),(x_min,y_min,z_min), Calculating the maximum and minimum difference values and the middle point of the three axes, and calculating the coordinates of the boundary frame with the scale, wherein the calculating method comprises the following steps:

x＝(x_mid±length_x×scale_factor)

y＝(y_mid±length_y×scale_factor)

z＝(z_mid±length_z×scale-factor)

step 1.3, mapping the multi-scale bounding box 205 of the 3D model into the 2D image 207 in a corresponding gesture 206 by using an imaging principle, and completing the labeling of the data set.

Step 2, taking the component points of the multi-scale bounding box as characteristic points, training a full convolution neural network to detect and locate the characteristic points, taking RGB images as input by the network, and outputting a Gaussian heat map 102 about the characteristic points;

The specific implementation method of the step2 is as follows:

Step 2.1 inputs an RGB image 301 of size 640x480 to the first modular convolution 302, which outputs an 8-channel signature of size 320x240 and a 16-channel signature of 160x 120. As shown in fig. 4, the modularized convolution specific structure is that the feature map 401 performs normal convolution 402 and 1x1 convolution 405, and the feature map output by the feature map 402 performs hole convolution 403 once and performs addition operation with the output of the convolution 405, and performs pooling operation 404 and downsampling convolution 405, and outputs a feature map with the size reduced by half the number of channels doubled and a feature map with the size reduced by 4 times the number of 1/4 channels.

Step 2.2 the first attention mechanism module 303 performs feature extraction with unchanged size channels on the eight-channel feature map, and performs three times of modular convolution operations 304, 305, and 306 on the output feature map, the final feature map is 40x30x78, and the second attention mechanism module 307 performs feature extraction with unchanged size channels for the second time. The attention mechanism module is shown in fig. 5, the input features 501 are input to the shared full-connection layers 504, 505 and 506 after being subjected to mean pooling 502 and maximum pooling 503 to obtain two groups of features 507 and 508, channel attention weights 510 are obtained through Sigmoid509 after addition, further mean pooling 511 is performed, maximum pooling 512 is performed, spatial attention weights 514 are obtained through Sigmoid513 after the concatate operation, and the input features 501, the channel attention weights 510 and the spatial attention weights 514 are multiplied to obtain a final result 515.

The feature map output in step 2.3307 will be restored to the 640x480 size by the three consecutive semantic embedding modules 308, 309, 310 and output as the gaussian heat map 311. As shown in fig. 6, the semantic embedding module structures are 601 and 602 respectively a high resolution feature and a low resolution feature, the low resolution feature has the same resolution as 601 through a 3x3 convolution 604 and a bilinear interpolation 605, and the high resolution feature 601 is multiplied by the feature output by the 1x1 convolution 603 and 605 to obtain a final output feature 606.

Step 3, performing non-maximum suppression on the Gaussian heat map output by the neural network to obtain specific feature point two-dimensional coordinates 103;

the specific implementation method of the step 3 is as follows:

And 3.1, carrying out Gaussian kernel convolution operation of 3x3 on the Gaussian heat map, so that the pixel points have information of adjacent points.

And 3.2, performing non-maximum suppression on the obtained Gaussian heat map, namely, after applying 3x3 maximum pooling on the heat map, selecting the maximum point coordinate in the heat map as the characteristic point coordinate.

And 4, restoring the corresponding relation of the 2D-3D characteristic points into a six-degree-of-freedom gesture of the space target through an improved EPnP algorithm, and further providing a foundation 104 for the follow-up grabbing work.

The step 4 implementation mode comprises the following steps:

In step 4.1, each scale bounding box consists of 8 points and 12 sides, and the parallel relation of mapping of a line parallel to a 2D image in a 3D space is theoretically kept unchanged, so that the inner products (0 is the inner product of two parallel sides) of the same group under each scale are calculated, the error of a neural network is reduced by the slope of the side with the larger inner product, and the adjustment effect is shown as 701.

And step 4.2,3D, the line segments in the model and the line segments in the 2D image have the same proportional relationship, and feature points are expanded through the equal proportional relationship between the 2D-3D line segments, so that the influence caused by precision loss is reduced. 702 is two points of the line segment before expansion, and 703 is a set of points of the line segment after expansion.

And 4.3, randomly sampling the feature point set obtained after expansion, solving the pose through a PnP algorithm by sampling n feature points each time, and executing the process m times. And solving Euclidean distance between the m pose results and other results, and selecting the result with the smallest distance as a final result. All randomly sampled poses are shown as 704, resulting in a result 705 by minimizing the euclidean distance.

Thus, the execution of the one-time complete six-degree-of-freedom attitude estimation process of the space target is completed.

The experiment is run on an RTX3090 display card, the system environment is Windows10, the software environment is python3.7+CUDA11.1+pytorch1.9.1+ CUDNN 8.0.0, the used experimental data set is LINEMOD data set, all picture samples are stored in jpg format, depth images are stored in png format, model information is stored in ply format, surrounding frame information is stored in npy format, the network learning rate is 0.0003, and training of one sample is performed in each iteration. The experimental results are shown in table 1:

Table 1 shows experimental detection indexes and results, 2Dprojection represents the error of projection of the predicted gesture on 2D, ADD (-S) represents the proportion of the minimum weighted average distance less than 10% of the radius of the model, and 5cm5 represents the ratio of the predicted gesture to the true gesture error less than 5cm5 °:

TABLE 1

From the experimental detection index, according to different performance indexes, no matter 2Dprojection, ADD (-S) or 5cm < 5 >, the method has a good estimation effect on different targets, particularly, the method has a large improvement on smaller models (cat, duck and the like), and the accuracy is improved from 40.6 to 76.0 on the 5cm < 5 > index, and the processing speed of RGB images can reach 15fps. The method has higher detection accuracy, breaks through the limitation of the traditional method on the attitude estimation of the small model, and ensures the processing speed.

In summary, aiming at the requirements of the augmented reality and the collaborative robot technology on the six-degree-of-freedom gesture information of the space target, the method is provided and realized based on the depth full convolution network by combining the characteristics of the multi-scale bounding box of the target 3D model, and the method is high in robustness, high in accuracy and capable of meeting the processing speed requirements, and can be popularized and applied in the scene of grabbing and augmented reality of the collaborative robot.

Claims

1. The method for estimating the spatial target posture facing to robot grabbing and augmented reality is characterized by comprising the following steps of:

Step1 comprises the sub-steps of,

Step 1.2, calculating maximum values and minimum values of x, y and z axes of the 3D model of the object under the space coordinates, so as to obtain a boundary frame of the common size of the object; calculating average value points of maximum values and minimum values of x, y and z axes, and multiplying the lengths on the axes by coefficients to obtain bounding boxes of different scales of the 3D model;

x＝u′×f_x÷z′+C_x

y＝v′×f_y÷z′+C_y

Step 2 comprises the sub-steps of,

step 3 comprises the sub-steps of,

step 4 comprises the sub-steps of,

and 4.3, randomly sampling the feature point set obtained after expansion, solving the pose by a PnP algorithm every time by sampling n feature points, executing the process for m times, solving Euclidean distances between the m pose results and other results, and selecting the result with the minimum distance as a final result.