CN115641322A

CN115641322A - Robot grabbing method and system based on 6D pose estimation

Info

Publication number: CN115641322A
Application number: CN202211376147.1A
Authority: CN
Inventors: 史金龙; 顾健; 钱强; 钱萍; 欧镇; 田朝晖; 於跃成; 白素琴
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-24

Abstract

The invention discloses a robot grasping method and a system thereof based on 6D pose estimation, which comprises the steps of constructing a robot grasping system based on 6D pose estimation, wherein the robot grasping system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; the method comprises the following steps: the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing. The invention provides a robot grabbing technology capable of solving the problems of shielding, similar multiple targets, stacking and the like.

Description

Robot grabbing method and system based on 6D pose estimation

Technical Field

The invention belongs to the technical field of computer vision, and relates to a robot grabbing method and system based on 6D pose estimation.

Background

According to the grabbing representation, application occasions and the like, the robot grabbing task can be divided into 2D plane grabbing and 6-DoF space grabbing.

In 2D planar grasping, the target object is placed in a plane with the grasping hand perpendicular to the plane. In this case, the pose information of the target object is the 2D plane position and the 1D rotation angle. The 2D planar grabbing method mostly uses an oriented rectangle as the grabbing configuration. A large number of 2D grab candidate boxes are first generated and then further optimized to get the final grab box. The two-dimensional plane grasping method is limited to grasping from one direction. Therefore, the method is not suitable for grabbing from any angle, and if the robot is required to grab more flexibly, the grabbing posture of the object 6-DoF needs to be acquired.

The core task of 6-DoF space grabbing is to estimate the 6D pose of a target object. The 6D pose estimation of the object refers to the estimation of the 3D position [ x, y, z ] and the 3D posture [ Rx, ry, rz ] of the object in the camera coordinate system, and represents the space state of the object. In a real environment, the pose estimation of an object can meet various problems such as illumination change, background interference, noise influence and the like, so that the robot can accurately and quickly grasp the object with great challenge.

A robot grabbing method based on single-view key point voting (computer integrated manufacturing system: 1-16[2022-10-24], ISSN 1006-5911, CN 11-5946/TP) introduces a method for realizing robot grabbing by obtaining pose information of an object by using a key point voting method. The method has the following defects: the method uses a real data set to train and test the 6D pose estimation network, the work of collecting the real data set is complicated, and a large amount of time cost is required to be invested; the grabbing of similar multi-target objects cannot be realized, but a plurality of similar objects are often contained in most real environments, particularly in industrial environments.

Disclosure of Invention

The invention aims to overcome the defects of the existing robot grabbing technology and provides a robot grabbing method and system based on 6D pose estimation.

In order to solve the technical problems, the following technical scheme is adopted.

The invention discloses a robot grasping method based on 6D pose estimation, which comprises the steps of constructing a robot grasping system based on 6D pose estimation, wherein the robot grasping system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, according to a single RGB image, the pose information of an object in the image is obtained through a 6D pose estimation network for a robot to grab, and the method comprises the following steps:

step 1, the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: using a BlenderProc tool to render a high-quality image with labels through randomly sampling the positions of lamplight, a camera and an object for training a 6D pose estimation network;

step 2, the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;

step 3, the robot control module completes the conversion from a camera coordinate system to a robot base coordinate system through robot hand-eye calibration, and sends a grabbing instruction to the robot to realize grabbing;

the network loss function is defined as geometric loss L _G And pose loss L _P ；

The synthetic data set comprises RGB images of each type of object, mask images of object outlines and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;

the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises a deconvolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3x3 and step length of 2 and two convolution layers with convolution kernel size of 3x3 and step length of 1; the bilinear interpolation block comprises a bilinear upsampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel sizes of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R _6d And three-dimensional translation t _s The full connection layer of (3).

Specifically, the process of step 2 includes:

the pose estimation module firstly inputs an RGB image with the resolution of 640 x 480, cuts a target object according to a target detection result, enlarges the image to 256 x 256 by taking the object as a center, and then inputs the enlarged image into a pose estimation network according to the following steps:

(1) the amplified image is firstly downsampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 multiplied by 8 characteristic diagram;

(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M _SF Three-dimensional coordinate graph M _3D And a visible object mask map M _mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinate on the corresponding two-dimensional pixel coordinate to obtain a dense correspondence map M _2D-3D ；

(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose.

Specifically, the step 3 includes:

the robot control module firstly carries out hand-Eye calibration in an Eye-to-hand mode outside a hand to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then, converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.

Specifically, the network loss function is defined as pose loss L _P And geometric loss L _G (ii) a The definition is shown in formulas (1) and (2):

L _P ＝L _R +L _C +L _z (1)

wherein L is _R Represents loss of rotation R, L _C Representing the center (Δ) of a scale-invariant 2D object _x ,Δ _y ) Loss of (L) _z Represents the distance Δ _Z Denotes the predicted values and the true values, respectively, the index X denotes the three-dimensional point on the model M, the indices X, y, z denote the three axes of the camera coordinate system;

in geometric loss, a three-dimensional coordinate map M _3D And a visible object mask map M _mask Adopting L1 loss, object surface slicing characteristic diagram M _SF Using the cross entropy loss CE, as shown in equation (3):

wherein, denotes a predicted value and a true value respectively,

representing a Hadamard product.

The invention discloses a robot grasping system based on 6D pose estimation, which comprises:

the data set generating module is used for rendering and generating a synthetic data set according to information such as the three-dimensional model and the background material of the object: using a blenderProc tool, and randomly sampling the positions of light, a camera and an object to render a high-quality image with labels for training a 6D pose estimation network;

the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;

the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;

the pose estimation network comprises: adding a ResNet module, an upsampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear up-sampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel size of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution kernelsA small 3x3 convolution layer with a step size of 2, two fully connected layers for flattening features, and two separate outputs representing the three-dimensional rotation R _6d And three-dimensional translation t _s The full connection layer of (3).

Further, the pose estimation module of the system specifically executes the following tasks:

firstly, an RGB image with the resolution of 640 multiplied by 480 is input, a target object is cut out according to a target detection result, the image is enlarged to 256 multiplied by 256 by taking the object as the center, and then the enlarged image is input into a pose estimation network according to the following steps:

(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M _SF Three-dimensional coordinate graph M _3D And a visible object mask map M _mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinates on the corresponding two-dimensional pixel coordinates to obtain a dense correspondence map M _2D-3D ；

Further, the operation process of the robot control module of the system is as follows:

firstly, performing hand-Eye calibration by adopting an Eye-to-hand mode outside the hand to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and converting the grabbing points into actually required grabbing points after the grabbing points are converted by a rotation and translation matrix and a hand and eye calibration matrix output by a pose estimation network; then, converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot so that the robot can grab the objects on the desktop in sequence.

Further, in the above-mentioned case,the network loss function of the system is defined as pose loss L _P And geometric loss L _G (ii) a The definition is shown in formulas (1) and (2):

L _P ＝L _R +L _C +L _z (1)

wherein L is _R Denotes the loss of rotation R, L _C Representing the center (Δ) of a scale-invariant 2D object _x ,Δ _y ) Loss of (L) _z Represents the distance Δ _Z Denotes the predicted and true values, respectively, the subscript X denotes the three-dimensional point on the model M, the subscripts X, y, z denote the three axes of the camera coordinate system;

in geometric losses, three-dimensional coordinate plot M _3D And a visible object mask map M _mask Adopting L1 loss to segment a characteristic diagram M on the surface of an object _SF Using the cross entropy loss CE, as shown in equation (3):

wherein, denotes a predicted value and a true value respectively,

representing a Hadamard product.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method generates a synthetic data set according to the three-dimensional model of the object, the background material and other information rendering, and trains and tests the 6D pose estimation network by using the synthetic data set. The collection and labeling of the real data sets often need a large amount of time, and labeling errors may exist, so that the network training result is influenced.

2. The invention provides a novel 6D pose estimation network, a backbone network used by the network is ResNet-101, an attention module is added, and the attention module can improve the performance of CNN frameworks with various depths. The invention introduces an efficient channel attention mechanism ECA, and a strategy of local cross-channel information interaction is used, so that the network focuses on the more important characteristics of the task. Therefore, the performance of the 6D pose estimation network is improved, the pose estimation result is more accurate, and the method can be better applied to robot grabbing.

3. The robot grabbing system not only can grab a single target, but also can complete grabbing tasks of multiple targets and multiple targets of the same kind, can effectively solve grabbing problems under complex situations such as shielding, stacking and the like, and meets the requirements of robot grabbing work in real scenes. Experiments prove that the average capturing success rate of the invention is 97% in a single-target scene, 95% in a multi-target scattering scene and 92% in a multi-target shielding (stacking) scene.

4. Compared with a grabbing system using binocular vision and needing RGB-D images, the robot grabbing system disclosed by the invention does not need depth information and simplifies the system design.

Drawings

Fig. 1 is a structural block diagram of a robot grasping system based on 6D pose estimation according to the present invention.

Fig. 2 is a diagram of a 6D pose estimation network according to an embodiment of the present invention.

Fig. 3 is a visualization result of the object pose in the real scene according to an embodiment of the invention.

Fig. 4 is a schematic view of a robot gripping process according to an embodiment of the present invention.

Detailed Description

The invention relates to a robot grabbing method and a system thereof based on 6D pose estimation, wherein the system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, a data set generating module generates a synthetic data set according to information rendering such as a three-dimensional model and background materials of an object; the pose estimation module inputs the data set into the pose estimation network to be trained to obtain a model file, and can predict the pose information of each object in the input RGB image according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing. The invention provides a robot grabbing technology capable of solving the problems of shielding, similar multiple targets, stacking and the like.

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a structural block diagram of a robot grasping system based on 6D pose estimation according to the present invention. The invention aims to obtain the pose information of an object in a picture through a 6D pose estimation network according to a single RGB image for a robot to grab. As shown in fig. 1, the present invention includes:

a dataset generation module: rendering according to the three-dimensional model of the object, the background material and other information to generate a synthetic data set;

the pose estimation module: inputting the data set into the 6D pose estimation network for training to obtain a model file, and predicting the pose information of each object in the input RGB image according to the trained model file;

a robot control module: and the conversion from the camera coordinate system to the robot base coordinate system is completed through the calibration of the robot eyes, and a grabbing instruction is sent to the robot to realize grabbing.

Network loss function: defined as geometric loss (L) _G ) And loss of pose (L) _P )。

A dataset generation module: according to information such as a three-dimensional model and background materials of an object, a blenderProc tool is used, and through random sampling of light, a camera and the position of an object, a high-quality image with a label can be rendered and used for training and testing a 6D pose estimation network. The synthetic data set can replace the work of manually marking the data set, the efficiency of data set collection work is improved, and the accuracy is high. The data set comprises seven types of objects, each object comprises 1000 pieces of RGB images and mask images of object outlines, and three json files which respectively store camera parameters, true pose labels and object bounding box information. The camera parameters are saved in scene _ camera.json, where cam _ K represents the 3 × 3 camera intrinsic parameter matrix. A true pose annotation is saved in scene _ gt.json file, where cam _ R _ m2c represents a 3 × 3 rotation matrix; cam _ t m2c represents a translation vector of 3 × 1; obj _ id denotes a category. Json file, wherein bbox _ obj represents the bounding box of the object outline, and the form is (x, y, w, h), (x, y) is the coordinate value of the upper left corner of the bounding box, and w and h are the width and height of the bounding box respectively; bbox _ visib represents a bounding box of the visible part of the object outline; px _ count _ all represents the number of pixels in the object contour; px _ count _ visib represents the number of pixels of the visible part of the object outline; visib _ fract represents the percentage of the visible portion of the object outline.

A pose estimation module: firstly, an RGB image with the resolution of 640 x 480 is input, a target object is cut according to a target detection result, the image is enlarged to 256 x 256 by taking the object as a center, and then the image is input into a 6D pose estimation network. The network comprises three parts, the detailed principle is shown in fig. 2, and the specific principle is as follows:

(1) the amplified image is firstly down-sampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 x 8 characteristic diagram. The ECA module firstly adopts global average pooling operation for each channel independently, and then generates channel weight through one-dimensional convolution with convolution kernel size of 5 and a Sigmoid function. The attention module does not need to reduce the channel dimension, and considers the adjacent channel of each channel when carrying out cross-channel information interaction, so that the calculation amount of the attention module can be reduced, the overall operation speed of the network is improved, and the network effect is ensured.

(2) The upsampling module contains one deconvolution block and two bilinear interpolation blocks. The deconvolution block includes one deconvolution layer with convolution kernel size of 3x3 and step size of 2 and two convolution layers with convolution kernel size of 3x3 and step size of 1. The bilinear interpolation block comprises a bilinear upsampled layer with a scaling factor of 2 andtwo convolution layers with convolution kernel size of 3 × 3 and step size of 1. Up-sampling the feature map to 64 x 64 to obtain three geometric feature maps which are respectively object surface slice feature maps (M) _SF ) Three-dimensional coordinate graph (M) _3D ) And visible object mask map (M) _mask ). Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinate on the corresponding two-dimensional pixel coordinate to obtain a dense correspondence map (M) _2D-3D )。

(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose. The regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step length of 2, two full-connection layers are used for flattening features, and the other two full-connection layers respectively output representation three-dimensional rotation (R) _6d ) And three-dimensional translation (t) _s )。

A robot control module: the vision system can obtain the grabbing pose of the target object under the camera coordinate system by using a pose estimation algorithm, however, to realize the grabbing task of the robot, the conversion relation between the robot base coordinate system and the camera coordinate system needs to be known. According to the difference of the position relation between the robot and the camera, the hand-eye calibration is divided into two forms of eyes on the hand and eyes outside the hand. The invention adopts a mode that eyes are outside hands to carry out calibration, and obtains the conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points into actually required grabbing points after the grabbing points are transformed by a rotation and translation matrix and a hand-eye calibration matrix output by a 6D pose estimation network; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues (Rodrigues) formula; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.

Network loss function, defined as geometric loss (L) _G ) And loss of pose (L) _P ). The pose loss is defined as shown in formulas (1) and (2).

L _P ＝L _R +L _C +L _z (1)

Wherein L is _R Represents loss of rotation R, L _C Representing the center (Δ) of a scale-invariant 2D object _x ,Δ _y ) Loss of (L) _z Represents the distance Δ _Z Denotes the predicted values and the true values, respectively, the index X denotes the three-dimensional point on the model M, and the indices X, y, z denote the three axes of the camera coordinate system.

In geometric losses, a three-dimensional graph (M) _3D ) And visible object mask map (M) _mask ) Using L1 loss, object surface segmentation feature map (M) _SF ) Using the cross entropy loss (CE), as shown in equation (3):

wherein, denotes a predicted value and a true value respectively,

representing a Hadamard product.

In summary, the present invention provides a new frame for robot grabbing, which comprises three modules: the robot comprises a data set generation module, a pose estimation module and a robot control module. The data set generating module generates a synthetic data set according to information rendering such as a three-dimensional model and background materials of the object; the pose estimation module inputs the data set into the 6D pose estimation network to be trained to obtain a model file, and the pose information of each object in the input RGB image can be predicted according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize the grabbing of the robot. The technology is suitable for multiple fields of production, service, logistics, medical treatment and the like, can accurately grab the target object from a disordered scene, provides technical support for popularization and industrialization of robot grabbing technology, and has wide market prospect.

Experimental verification of the invention shows that:

the 6D pose estimation network can realize accurate pose estimation from a single RGB image. To compare the performance of this method, one chooses to compare it with other 6D pose estimation networks using only a single RGB image on the Linemod dataset, the Linemod occupancy dataset, and the YCB-Video dataset.

On a Linemod data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 160, and a GTX-3090 display card is used for training a 6D pose estimation network. The methods of Tekin, pix2Pose, self6D and GDR-Net were compared using the ADD (-S) index, as shown in Table 1.

TABLE 1 comparison of Linemod with respect to ADD (-S) index

On a Linemod Occlusion data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 40, and a GTX-3090 video card is used for training a 6D pose estimation network. The results of comparing the method of the present invention with PosecNN, pix2Pose, self6D and GDR-Net, etc., using ADD (-S) index are shown in Table 2.

TABLE 2 comparison of Linemod Occlusion on ADD (-S) index

On a YCB-Video data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 10, and a GTX-3090 Video card is used for training a 6D pose estimation network. YCB-Video is a challenging data set containing multiple symmetric objects, occlusions, and clutter. The method was compared to PosecNN and GDR-Net networks using the ADD-S index and the AUC of ADD-S/ADD (-S) index. AUC of ADD-S/ADD (-S) represents the area under the precision threshold curve, and can be calculated by changing the distance threshold, and the maximum threshold is set to 10cm. Table 3 shows a detailed evaluation of all 21 objects. Wherein denotes a symmetric object.

TABLE 3 comparison on YCB-Video

From the pose estimation experimental results, the 6D pose estimation network is superior to other methods on most objects, and the average index also performs best.

In order to verify the grasping feasibility of the invention, multiple sets of experiments were performed in a real scene, and fig. 3 is a visualization result of the object pose in a multi-target scene, a similar multi-target scene, and a blocking (stacking) scene. Under the condition of sufficient illumination, the grabbing experiment is carried out on seven objects of a ball, a dragon cat, a duck, a rabbit, a square part, a triangular part and a round part, the grabbing process is shown in fig. 4, and the grabbing success rate in three scenes of single-target, multi-target dispersion and multi-target shielding (stacking) is shown in table 4.

TABLE 4 actual scene Capture success Rate

The robot grabbing system has the average grabbing success rate of 97% in a single-target scene, 95% in a multi-target dispersion scene and 92% in a multi-target shielding (stacking) scene.

Claims

1. A robot grabbing method based on 6D pose estimation is characterized in that a robot grabbing system based on 6D pose estimation is constructed, and the robot grabbing system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, according to a single RGB image, the pose information of an object in the image is obtained through a 6D pose estimation network for the robot to grab, and the method comprises the following steps:

step 3, the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;

The composite data set comprises an RGB image of each type of object, a mask image of an object outline and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;

the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear up-sampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel size of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R _6d And three-dimensional translation t _s The full interconnect layer of (1).

2. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the process of the step 2 comprises:

the pose estimation module firstly inputs an RGB image with the resolution of 640 multiplied by 480, cuts a target object according to a target detection result, enlarges the image to 256 multiplied by 256 by taking the object as a center, and then inputs the enlarged image into a pose estimation network according to the following steps:

3. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the step 3 specifically includes:

the robot control module firstly adopts an Eye-to-hand mode outside the hand to calibrate the hand and the Eye to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.

4. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the network loss function is defined as pose loss L _P And geometric loss L _G (ii) a The pose loss is defined as shown in formulas (1) and (2):

L _P ＝L _R +L _C +L _z (1)

in geometric loss, a three-dimensional coordinate map M _3D And a visible object mask map M _mask Adopting L1 loss, object surface slicing characteristic diagram M _SF With cross entropy loss CE, as shown in equation (3):

wherein, denotes a predicted value and a true value respectively,

representing a Hadamard product.

5. A robot gripper system based on 6D pose estimation is characterized by comprising:

the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: using a BlenderProc tool to render a high-quality image with labels through randomly sampling the positions of lamplight, a camera and an object for training a 6D pose estimation network;

the robot control module completes the conversion from a camera coordinate system to a robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;

the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear upsampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel sizes of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R _6d And three-dimensional translation t _s The full interconnect layer of (1).

6. The robot gripping system based on 6D pose estimation according to claim 5, wherein the pose estimation module specifically performs the following tasks:

firstly, inputting an RGB image with the resolution of 640 x 480, cutting a target object according to a target detection result, amplifying the image to 256 x 256 by taking the object as a center, and then inputting the amplified image into a pose estimation network according to the following steps:

7. The robot grabbing system based on 6D pose estimation according to claim 5, wherein the robot control module operates as follows:

firstly, calibrating hands and eyes by Eye-to-hand outside the hands to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.

8. The robot grabbing system based on 6D pose estimation according to claim 5, wherein the network loss function is defined as pose loss L _P And geometric loss L _G (ii) a The pose loss is defined as shown in formulas (1) and (2):

L _P ＝L _R +L _C +L _z (1)

wherein L is _R Represents loss of rotation R, L _C Representing the center (Δ) of a scale-invariant 2D object _x ,Δ _y ) Loss of (L) _z Represents the distance Δ _Z Denotes the predicted and true values, respectively, the subscript X denotes the three-dimensional point on the model M, the subscripts X, y, z denote the three axes of the camera coordinate system;

in geometric loss, a three-dimensional coordinate map M _3D And a visible object mask map M _mask Adopting L1 loss to segment a characteristic diagram M on the surface of an object _SF With cross entropy loss CE, as shown in equation (3):

wherein, denotes a predicted value and a true value respectively,

representing a Hadamard product.