CN115641322A - Robot grabbing method and system based on 6D pose estimation - Google Patents

Robot grabbing method and system based on 6D pose estimation Download PDF

Info

Publication number
CN115641322A
CN115641322A CN202211376147.1A CN202211376147A CN115641322A CN 115641322 A CN115641322 A CN 115641322A CN 202211376147 A CN202211376147 A CN 202211376147A CN 115641322 A CN115641322 A CN 115641322A
Authority
CN
China
Prior art keywords
robot
grabbing
pose estimation
pose
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211376147.1A
Other languages
Chinese (zh)
Inventor
史金龙
顾健
钱强
钱萍
欧镇
田朝晖
於跃成
白素琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202211376147.1A priority Critical patent/CN115641322A/en
Publication of CN115641322A publication Critical patent/CN115641322A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a robot grasping method and a system thereof based on 6D pose estimation, which comprises the steps of constructing a robot grasping system based on 6D pose estimation, wherein the robot grasping system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; the method comprises the following steps: the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing. The invention provides a robot grabbing technology capable of solving the problems of shielding, similar multiple targets, stacking and the like.

Description

Robot grabbing method and system based on 6D pose estimation
Technical Field
The invention belongs to the technical field of computer vision, and relates to a robot grabbing method and system based on 6D pose estimation.
Background
According to the grabbing representation, application occasions and the like, the robot grabbing task can be divided into 2D plane grabbing and 6-DoF space grabbing.
In 2D planar grasping, the target object is placed in a plane with the grasping hand perpendicular to the plane. In this case, the pose information of the target object is the 2D plane position and the 1D rotation angle. The 2D planar grabbing method mostly uses an oriented rectangle as the grabbing configuration. A large number of 2D grab candidate boxes are first generated and then further optimized to get the final grab box. The two-dimensional plane grasping method is limited to grasping from one direction. Therefore, the method is not suitable for grabbing from any angle, and if the robot is required to grab more flexibly, the grabbing posture of the object 6-DoF needs to be acquired.
The core task of 6-DoF space grabbing is to estimate the 6D pose of a target object. The 6D pose estimation of the object refers to the estimation of the 3D position [ x, y, z ] and the 3D posture [ Rx, ry, rz ] of the object in the camera coordinate system, and represents the space state of the object. In a real environment, the pose estimation of an object can meet various problems such as illumination change, background interference, noise influence and the like, so that the robot can accurately and quickly grasp the object with great challenge.
A robot grabbing method based on single-view key point voting (computer integrated manufacturing system: 1-16[2022-10-24], ISSN 1006-5911, CN 11-5946/TP) introduces a method for realizing robot grabbing by obtaining pose information of an object by using a key point voting method. The method has the following defects: the method uses a real data set to train and test the 6D pose estimation network, the work of collecting the real data set is complicated, and a large amount of time cost is required to be invested; the grabbing of similar multi-target objects cannot be realized, but a plurality of similar objects are often contained in most real environments, particularly in industrial environments.
Disclosure of Invention
The invention aims to overcome the defects of the existing robot grabbing technology and provides a robot grabbing method and system based on 6D pose estimation.
In order to solve the technical problems, the following technical scheme is adopted.
The invention discloses a robot grasping method based on 6D pose estimation, which comprises the steps of constructing a robot grasping system based on 6D pose estimation, wherein the robot grasping system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, according to a single RGB image, the pose information of an object in the image is obtained through a 6D pose estimation network for a robot to grab, and the method comprises the following steps:
step 1, the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: using a BlenderProc tool to render a high-quality image with labels through randomly sampling the positions of lamplight, a camera and an object for training a 6D pose estimation network;
step 2, the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;
step 3, the robot control module completes the conversion from a camera coordinate system to a robot base coordinate system through robot hand-eye calibration, and sends a grabbing instruction to the robot to realize grabbing;
the network loss function is defined as geometric loss L G And pose loss L P
The synthetic data set comprises RGB images of each type of object, mask images of object outlines and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;
the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises a deconvolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3x3 and step length of 2 and two convolution layers with convolution kernel size of 3x3 and step length of 1; the bilinear interpolation block comprises a bilinear upsampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel sizes of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R 6d And three-dimensional translation t s The full connection layer of (3).
Specifically, the process of step 2 includes:
the pose estimation module firstly inputs an RGB image with the resolution of 640 x 480, cuts a target object according to a target detection result, enlarges the image to 256 x 256 by taking the object as a center, and then inputs the enlarged image into a pose estimation network according to the following steps:
(1) the amplified image is firstly downsampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 multiplied by 8 characteristic diagram;
(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M SF Three-dimensional coordinate graph M 3D And a visible object mask map M mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinate on the corresponding two-dimensional pixel coordinate to obtain a dense correspondence map M 2D-3D
(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose.
Specifically, the step 3 includes:
the robot control module firstly carries out hand-Eye calibration in an Eye-to-hand mode outside a hand to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then, converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.
Specifically, the network loss function is defined as pose loss L P And geometric loss L G (ii) a The definition is shown in formulas (1) and (2):
L P =L R +L C +L z (1)
Figure BDA0003926671110000031
wherein L is R Represents loss of rotation R, L C Representing the center (Δ) of a scale-invariant 2D object xy ) Loss of (L) z Represents the distance Δ Z Denotes the predicted values and the true values, respectively, the index X denotes the three-dimensional point on the model M, the indices X, y, z denote the three axes of the camera coordinate system;
in geometric loss, a three-dimensional coordinate map M 3D And a visible object mask map M mask Adopting L1 loss, object surface slicing characteristic diagram M SF Using the cross entropy loss CE, as shown in equation (3):
Figure BDA0003926671110000032
wherein, denotes a predicted value and a true value respectively,
Figure BDA0003926671110000033
representing a Hadamard product.
The invention discloses a robot grasping system based on 6D pose estimation, which comprises:
the data set generating module is used for rendering and generating a synthetic data set according to information such as the three-dimensional model and the background material of the object: using a blenderProc tool, and randomly sampling the positions of light, a camera and an object to render a high-quality image with labels for training a 6D pose estimation network;
the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;
the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;
the network loss function is defined as geometric loss L G And pose loss L P
The synthetic data set comprises RGB images of each type of object, mask images of object outlines and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;
the pose estimation network comprises: adding a ResNet module, an upsampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear up-sampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel size of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution kernelsA small 3x3 convolution layer with a step size of 2, two fully connected layers for flattening features, and two separate outputs representing the three-dimensional rotation R 6d And three-dimensional translation t s The full connection layer of (3).
Further, the pose estimation module of the system specifically executes the following tasks:
firstly, an RGB image with the resolution of 640 multiplied by 480 is input, a target object is cut out according to a target detection result, the image is enlarged to 256 multiplied by 256 by taking the object as the center, and then the enlarged image is input into a pose estimation network according to the following steps:
(1) the amplified image is firstly downsampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 multiplied by 8 characteristic diagram;
(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M SF Three-dimensional coordinate graph M 3D And a visible object mask map M mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinates on the corresponding two-dimensional pixel coordinates to obtain a dense correspondence map M 2D-3D
(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose.
Further, the operation process of the robot control module of the system is as follows:
firstly, performing hand-Eye calibration by adopting an Eye-to-hand mode outside the hand to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and converting the grabbing points into actually required grabbing points after the grabbing points are converted by a rotation and translation matrix and a hand and eye calibration matrix output by a pose estimation network; then, converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot so that the robot can grab the objects on the desktop in sequence.
Further, in the above-mentioned case,the network loss function of the system is defined as pose loss L P And geometric loss L G (ii) a The definition is shown in formulas (1) and (2):
L P =L R +L C +L z (1)
Figure BDA0003926671110000041
wherein L is R Denotes the loss of rotation R, L C Representing the center (Δ) of a scale-invariant 2D object xy ) Loss of (L) z Represents the distance Δ Z Denotes the predicted and true values, respectively, the subscript X denotes the three-dimensional point on the model M, the subscripts X, y, z denote the three axes of the camera coordinate system;
in geometric losses, three-dimensional coordinate plot M 3D And a visible object mask map M mask Adopting L1 loss to segment a characteristic diagram M on the surface of an object SF Using the cross entropy loss CE, as shown in equation (3):
Figure BDA0003926671110000042
wherein, denotes a predicted value and a true value respectively,
Figure BDA0003926671110000043
representing a Hadamard product.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method generates a synthetic data set according to the three-dimensional model of the object, the background material and other information rendering, and trains and tests the 6D pose estimation network by using the synthetic data set. The collection and labeling of the real data sets often need a large amount of time, and labeling errors may exist, so that the network training result is influenced.
2. The invention provides a novel 6D pose estimation network, a backbone network used by the network is ResNet-101, an attention module is added, and the attention module can improve the performance of CNN frameworks with various depths. The invention introduces an efficient channel attention mechanism ECA, and a strategy of local cross-channel information interaction is used, so that the network focuses on the more important characteristics of the task. Therefore, the performance of the 6D pose estimation network is improved, the pose estimation result is more accurate, and the method can be better applied to robot grabbing.
3. The robot grabbing system not only can grab a single target, but also can complete grabbing tasks of multiple targets and multiple targets of the same kind, can effectively solve grabbing problems under complex situations such as shielding, stacking and the like, and meets the requirements of robot grabbing work in real scenes. Experiments prove that the average capturing success rate of the invention is 97% in a single-target scene, 95% in a multi-target scattering scene and 92% in a multi-target shielding (stacking) scene.
4. Compared with a grabbing system using binocular vision and needing RGB-D images, the robot grabbing system disclosed by the invention does not need depth information and simplifies the system design.
Drawings
Fig. 1 is a structural block diagram of a robot grasping system based on 6D pose estimation according to the present invention.
Fig. 2 is a diagram of a 6D pose estimation network according to an embodiment of the present invention.
Fig. 3 is a visualization result of the object pose in the real scene according to an embodiment of the invention.
Fig. 4 is a schematic view of a robot gripping process according to an embodiment of the present invention.
Detailed Description
The invention relates to a robot grabbing method and a system thereof based on 6D pose estimation, wherein the system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, a data set generating module generates a synthetic data set according to information rendering such as a three-dimensional model and background materials of an object; the pose estimation module inputs the data set into the pose estimation network to be trained to obtain a model file, and can predict the pose information of each object in the input RGB image according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing. The invention provides a robot grabbing technology capable of solving the problems of shielding, similar multiple targets, stacking and the like.
The present invention will be described in further detail with reference to the accompanying drawings.
Fig. 1 is a structural block diagram of a robot grasping system based on 6D pose estimation according to the present invention. The invention aims to obtain the pose information of an object in a picture through a 6D pose estimation network according to a single RGB image for a robot to grab. As shown in fig. 1, the present invention includes:
a dataset generation module: rendering according to the three-dimensional model of the object, the background material and other information to generate a synthetic data set;
the pose estimation module: inputting the data set into the 6D pose estimation network for training to obtain a model file, and predicting the pose information of each object in the input RGB image according to the trained model file;
a robot control module: and the conversion from the camera coordinate system to the robot base coordinate system is completed through the calibration of the robot eyes, and a grabbing instruction is sent to the robot to realize grabbing.
Network loss function: defined as geometric loss (L) G ) And loss of pose (L) P )。
A dataset generation module: according to information such as a three-dimensional model and background materials of an object, a blenderProc tool is used, and through random sampling of light, a camera and the position of an object, a high-quality image with a label can be rendered and used for training and testing a 6D pose estimation network. The synthetic data set can replace the work of manually marking the data set, the efficiency of data set collection work is improved, and the accuracy is high. The data set comprises seven types of objects, each object comprises 1000 pieces of RGB images and mask images of object outlines, and three json files which respectively store camera parameters, true pose labels and object bounding box information. The camera parameters are saved in scene _ camera.json, where cam _ K represents the 3 × 3 camera intrinsic parameter matrix. A true pose annotation is saved in scene _ gt.json file, where cam _ R _ m2c represents a 3 × 3 rotation matrix; cam _ t m2c represents a translation vector of 3 × 1; obj _ id denotes a category. Json file, wherein bbox _ obj represents the bounding box of the object outline, and the form is (x, y, w, h), (x, y) is the coordinate value of the upper left corner of the bounding box, and w and h are the width and height of the bounding box respectively; bbox _ visib represents a bounding box of the visible part of the object outline; px _ count _ all represents the number of pixels in the object contour; px _ count _ visib represents the number of pixels of the visible part of the object outline; visib _ fract represents the percentage of the visible portion of the object outline.
A pose estimation module: firstly, an RGB image with the resolution of 640 x 480 is input, a target object is cut according to a target detection result, the image is enlarged to 256 x 256 by taking the object as a center, and then the image is input into a 6D pose estimation network. The network comprises three parts, the detailed principle is shown in fig. 2, and the specific principle is as follows:
(1) the amplified image is firstly down-sampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 x 8 characteristic diagram. The ECA module firstly adopts global average pooling operation for each channel independently, and then generates channel weight through one-dimensional convolution with convolution kernel size of 5 and a Sigmoid function. The attention module does not need to reduce the channel dimension, and considers the adjacent channel of each channel when carrying out cross-channel information interaction, so that the calculation amount of the attention module can be reduced, the overall operation speed of the network is improved, and the network effect is ensured.
(2) The upsampling module contains one deconvolution block and two bilinear interpolation blocks. The deconvolution block includes one deconvolution layer with convolution kernel size of 3x3 and step size of 2 and two convolution layers with convolution kernel size of 3x3 and step size of 1. The bilinear interpolation block comprises a bilinear upsampled layer with a scaling factor of 2 andtwo convolution layers with convolution kernel size of 3 × 3 and step size of 1. Up-sampling the feature map to 64 x 64 to obtain three geometric feature maps which are respectively object surface slice feature maps (M) SF ) Three-dimensional coordinate graph (M) 3D ) And visible object mask map (M) mask ). Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinate on the corresponding two-dimensional pixel coordinate to obtain a dense correspondence map (M) 2D-3D )。
(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose. The regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step length of 2, two full-connection layers are used for flattening features, and the other two full-connection layers respectively output representation three-dimensional rotation (R) 6d ) And three-dimensional translation (t) s )。
A robot control module: the vision system can obtain the grabbing pose of the target object under the camera coordinate system by using a pose estimation algorithm, however, to realize the grabbing task of the robot, the conversion relation between the robot base coordinate system and the camera coordinate system needs to be known. According to the difference of the position relation between the robot and the camera, the hand-eye calibration is divided into two forms of eyes on the hand and eyes outside the hand. The invention adopts a mode that eyes are outside hands to carry out calibration, and obtains the conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points into actually required grabbing points after the grabbing points are transformed by a rotation and translation matrix and a hand-eye calibration matrix output by a 6D pose estimation network; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues (Rodrigues) formula; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.
Network loss function, defined as geometric loss (L) G ) And loss of pose (L) P ). The pose loss is defined as shown in formulas (1) and (2).
L P =L R +L C +L z (1)
Figure BDA0003926671110000071
Wherein L is R Represents loss of rotation R, L C Representing the center (Δ) of a scale-invariant 2D object xy ) Loss of (L) z Represents the distance Δ Z Denotes the predicted values and the true values, respectively, the index X denotes the three-dimensional point on the model M, and the indices X, y, z denote the three axes of the camera coordinate system.
In geometric losses, a three-dimensional graph (M) 3D ) And visible object mask map (M) mask ) Using L1 loss, object surface segmentation feature map (M) SF ) Using the cross entropy loss (CE), as shown in equation (3):
Figure BDA0003926671110000072
wherein, denotes a predicted value and a true value respectively,
Figure BDA0003926671110000073
representing a Hadamard product.
In summary, the present invention provides a new frame for robot grabbing, which comprises three modules: the robot comprises a data set generation module, a pose estimation module and a robot control module. The data set generating module generates a synthetic data set according to information rendering such as a three-dimensional model and background materials of the object; the pose estimation module inputs the data set into the 6D pose estimation network to be trained to obtain a model file, and the pose information of each object in the input RGB image can be predicted according to the trained model file; and the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize the grabbing of the robot. The technology is suitable for multiple fields of production, service, logistics, medical treatment and the like, can accurately grab the target object from a disordered scene, provides technical support for popularization and industrialization of robot grabbing technology, and has wide market prospect.
Experimental verification of the invention shows that:
the 6D pose estimation network can realize accurate pose estimation from a single RGB image. To compare the performance of this method, one chooses to compare it with other 6D pose estimation networks using only a single RGB image on the Linemod dataset, the Linemod occupancy dataset, and the YCB-Video dataset.
On a Linemod data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 160, and a GTX-3090 display card is used for training a 6D pose estimation network. The methods of Tekin, pix2Pose, self6D and GDR-Net were compared using the ADD (-S) index, as shown in Table 1.
TABLE 1 comparison of Linemod with respect to ADD (-S) index
Figure BDA0003926671110000081
On a Linemod Occlusion data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 40, and a GTX-3090 video card is used for training a 6D pose estimation network. The results of comparing the method of the present invention with PosecNN, pix2Pose, self6D and GDR-Net, etc., using ADD (-S) index are shown in Table 2.
TABLE 2 comparison of Linemod Occlusion on ADD (-S) index
Figure BDA0003926671110000082
On a YCB-Video data set, the batch size of the experiment is set to be 24, the learning rate is 1e-4, the iteration times are 10, and a GTX-3090 Video card is used for training a 6D pose estimation network. YCB-Video is a challenging data set containing multiple symmetric objects, occlusions, and clutter. The method was compared to PosecNN and GDR-Net networks using the ADD-S index and the AUC of ADD-S/ADD (-S) index. AUC of ADD-S/ADD (-S) represents the area under the precision threshold curve, and can be calculated by changing the distance threshold, and the maximum threshold is set to 10cm. Table 3 shows a detailed evaluation of all 21 objects. Wherein denotes a symmetric object.
TABLE 3 comparison on YCB-Video
Figure BDA0003926671110000083
Figure BDA0003926671110000091
From the pose estimation experimental results, the 6D pose estimation network is superior to other methods on most objects, and the average index also performs best.
In order to verify the grasping feasibility of the invention, multiple sets of experiments were performed in a real scene, and fig. 3 is a visualization result of the object pose in a multi-target scene, a similar multi-target scene, and a blocking (stacking) scene. Under the condition of sufficient illumination, the grabbing experiment is carried out on seven objects of a ball, a dragon cat, a duck, a rabbit, a square part, a triangular part and a round part, the grabbing process is shown in fig. 4, and the grabbing success rate in three scenes of single-target, multi-target dispersion and multi-target shielding (stacking) is shown in table 4.
TABLE 4 actual scene Capture success Rate
Figure BDA0003926671110000092
The robot grabbing system has the average grabbing success rate of 97% in a single-target scene, 95% in a multi-target dispersion scene and 92% in a multi-target shielding (stacking) scene.

Claims (8)

1. A robot grabbing method based on 6D pose estimation is characterized in that a robot grabbing system based on 6D pose estimation is constructed, and the robot grabbing system comprises a data set generation module, a pose estimation module, a robot control module and a corresponding network loss function; under the system, according to a single RGB image, the pose information of an object in the image is obtained through a 6D pose estimation network for the robot to grab, and the method comprises the following steps:
step 1, the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: using a BlenderProc tool to render a high-quality image with labels through randomly sampling the positions of lamplight, a camera and an object for training a 6D pose estimation network;
step 2, the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;
step 3, the robot control module completes the conversion from the camera coordinate system to the robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;
the network loss function is defined as geometric loss L G And pose loss L P
The composite data set comprises an RGB image of each type of object, a mask image of an object outline and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;
the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear up-sampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel size of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R 6d And three-dimensional translation t s The full interconnect layer of (1).
2. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the process of the step 2 comprises:
the pose estimation module firstly inputs an RGB image with the resolution of 640 multiplied by 480, cuts a target object according to a target detection result, enlarges the image to 256 multiplied by 256 by taking the object as a center, and then inputs the enlarged image into a pose estimation network according to the following steps:
(1) the amplified image is firstly downsampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 multiplied by 8 characteristic diagram;
(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M SF Three-dimensional coordinate graph M 3D And a visible object mask map M mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinate on the corresponding two-dimensional pixel coordinate to obtain a dense correspondence map M 2D-3D
(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose.
3. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the step 3 specifically includes:
the robot control module firstly adopts an Eye-to-hand mode outside the hand to calibrate the hand and the Eye to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.
4. The robot grasping method based on the 6D pose estimation network according to claim 1, wherein the network loss function is defined as pose loss L P And geometric loss L G (ii) a The pose loss is defined as shown in formulas (1) and (2):
L P =L R +L C +L z (1)
Figure FDA0003926671100000021
wherein L is R Denotes the loss of rotation R, L C Representing the center (Δ) of a scale-invariant 2D object xy ) Loss of (L) z Represents the distance Δ Z Denotes the predicted and true values, respectively, the subscript X denotes the three-dimensional point on the model M, the subscripts X, y, z denote the three axes of the camera coordinate system;
in geometric loss, a three-dimensional coordinate map M 3D And a visible object mask map M mask Adopting L1 loss, object surface slicing characteristic diagram M SF With cross entropy loss CE, as shown in equation (3):
Figure FDA0003926671100000022
wherein, denotes a predicted value and a true value respectively,
Figure FDA0003926671100000023
representing a Hadamard product.
5. A robot gripper system based on 6D pose estimation is characterized by comprising:
the data set generating module generates a synthetic data set according to the three-dimensional model of the object and the background material information by rendering: using a BlenderProc tool to render a high-quality image with labels through randomly sampling the positions of lamplight, a camera and an object for training a 6D pose estimation network;
the pose estimation module inputs the data set into a 6D pose estimation network to be trained to obtain a model file, and predicts pose information of each object in the input RGB image according to the trained model file;
the robot control module completes the conversion from a camera coordinate system to a robot base coordinate system through the calibration of the robot eyes, and sends a grabbing instruction to the robot to realize grabbing;
the network loss function is defined as geometric loss L G And pose loss L P
The synthetic data set comprises RGB images of each type of object, mask images of object outlines and three json files for respectively storing camera parameters, true pose annotation and object bounding box information;
the pose estimation network comprises: adding a ResNet module, an up-sampling module and a regression module of an ECA channel attention mechanism; the up-sampling module comprises an anti-convolution block and two bilinear interpolation blocks; the deconvolution block comprises a deconvolution layer with convolution kernel size of 3 × 3 and step size of 2 and two convolution layers with convolution kernel size of 3 × 3 and step size of 1; the bilinear interpolation block comprises a bilinear upsampling layer with a proportionality coefficient of 2 and two convolution layers with convolution kernel sizes of 3 multiplied by 3 and step length of 1; the regression module comprises three convolution layers with convolution kernel size of 3 × 3 and step size of 2, two full-connected layers for flattening features, and two output units respectively representing three-dimensional rotation R 6d And three-dimensional translation t s The full interconnect layer of (1).
6. The robot gripping system based on 6D pose estimation according to claim 5, wherein the pose estimation module specifically performs the following tasks:
firstly, inputting an RGB image with the resolution of 640 x 480, cutting a target object according to a target detection result, amplifying the image to 256 x 256 by taking the object as a center, and then inputting the amplified image into a pose estimation network according to the following steps:
(1) the amplified image is firstly downsampled by a ResNet module added with an ECA channel attention mechanism to obtain an 8 multiplied by 8 characteristic diagram;
(2) the feature map is up-sampled to 64 x 64 by an up-sampling module to obtain three geometric feature maps which are respectively object surface slice feature maps M SF Three-dimensional coordinate graph M 3D And a visible object mask map M mask (ii) a Wherein the three-dimensional coordinate map superimposes the three-dimensional space coordinates on the corresponding two-dimensional pixel coordinates to obtain a dense correspondence map M 2D-3D
(3) And inputting the surface fragment feature map and the dense corresponding map into a regression module, and directly regressing the 6D object pose.
7. The robot grabbing system based on 6D pose estimation according to claim 5, wherein the robot control module operates as follows:
firstly, calibrating hands and eyes by Eye-to-hand outside the hands to obtain a conversion relation between a robot base coordinate system and a camera coordinate system, namely a hand-Eye calibration matrix; secondly, selecting two points on the three-dimensional model of the object as grabbing points, and transforming the grabbing points through a rotation and translation matrix and a hand-eye calibration matrix output by a pose estimation network to form actually required grabbing points; then converting the rotation matrix into a rotation vector, namely a grabbing angle, through a Rodrigues formula of Rodrigues; and finally, after the grabbing point and the grabbing angle are obtained, the label information of the object is combined and transmitted to the robot through network programming, and a grabbing command is sent to the robot, so that the robot can grab the object on the desktop in sequence.
8. The robot grabbing system based on 6D pose estimation according to claim 5, wherein the network loss function is defined as pose loss L P And geometric loss L G (ii) a The pose loss is defined as shown in formulas (1) and (2):
L P =L R +L C +L z (1)
Figure FDA0003926671100000041
wherein L is R Represents loss of rotation R, L C Representing the center (Δ) of a scale-invariant 2D object xy ) Loss of (L) z Represents the distance Δ Z Denotes the predicted and true values, respectively, the subscript X denotes the three-dimensional point on the model M, the subscripts X, y, z denote the three axes of the camera coordinate system;
in geometric loss, a three-dimensional coordinate map M 3D And a visible object mask map M mask Adopting L1 loss to segment a characteristic diagram M on the surface of an object SF With cross entropy loss CE, as shown in equation (3):
Figure FDA0003926671100000042
wherein, denotes a predicted value and a true value respectively,
Figure FDA0003926671100000043
representing a Hadamard product.
CN202211376147.1A 2022-11-04 2022-11-04 Robot grabbing method and system based on 6D pose estimation Pending CN115641322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211376147.1A CN115641322A (en) 2022-11-04 2022-11-04 Robot grabbing method and system based on 6D pose estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211376147.1A CN115641322A (en) 2022-11-04 2022-11-04 Robot grabbing method and system based on 6D pose estimation

Publications (1)

Publication Number Publication Date
CN115641322A true CN115641322A (en) 2023-01-24

Family

ID=84948875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211376147.1A Pending CN115641322A (en) 2022-11-04 2022-11-04 Robot grabbing method and system based on 6D pose estimation

Country Status (1)

Country Link
CN (1) CN115641322A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664843A (en) * 2023-06-05 2023-08-29 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664843A (en) * 2023-06-05 2023-08-29 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation
CN116664843B (en) * 2023-06-05 2024-02-20 北京信息科技大学 Residual fitting grabbing detection network based on RGBD image and semantic segmentation
CN117237451A (en) * 2023-09-15 2023-12-15 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance
CN117237451B (en) * 2023-09-15 2024-04-02 南京航空航天大学 Industrial part 6D pose estimation method based on contour reconstruction and geometric guidance

Similar Documents

Publication Publication Date Title
CN109870983B (en) Method and device for processing tray stack image and system for warehousing goods picking
CN111179324B (en) Object six-degree-of-freedom pose estimation method based on color and depth information fusion
CN110674829B (en) Three-dimensional target detection method based on graph convolution attention network
CN115641322A (en) Robot grabbing method and system based on 6D pose estimation
CN111553949B (en) Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning
JP2017120648A (en) Reconstructing 3d modeled object
CN111368852A (en) Article identification and pre-sorting system and method based on deep learning and robot
CN112836734A (en) Heterogeneous data fusion method and device and storage medium
CN114952809B (en) Workpiece identification and pose detection method, system and mechanical arm grabbing control method
CN114724120B (en) Vehicle target detection method and system based on radar vision semantic segmentation adaptive fusion
US20220415030A1 (en) AR-Assisted Synthetic Data Generation for Training Machine Learning Models
CN113159232A (en) Three-dimensional target classification and segmentation method
CN112712589A (en) Plant 3D modeling method and system based on laser radar and deep learning
CN114882109A (en) Robot grabbing detection method and system for sheltering and disordered scenes
CN115238758A (en) Multi-task three-dimensional target detection method based on point cloud feature enhancement
CN115578460A (en) Robot grabbing method and system based on multi-modal feature extraction and dense prediction
WO2021167586A1 (en) Systems and methods for object detection including pose and size estimation
CN114998578A (en) Dynamic multi-article positioning, grabbing and packaging method and system
CN114119753A (en) Transparent object 6D attitude estimation method facing mechanical arm grabbing
CN108898679A (en) A kind of method of component serial number automatic marking
EP3905130A1 (en) Computer-implemented method for 3d localization of an object based on image data and depth data
EP3905107A1 (en) Computer-implemented method for 3d localization of an object based on image data and depth data
JP6016242B2 (en) Viewpoint estimation apparatus and classifier learning method thereof
CN112975957A (en) Target extraction method, system, robot and storage medium
CN117011380A (en) 6D pose estimation method of target object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination