Disclosure of Invention
In view of the above, there is a need to provide a method for a robot to grab a transparent object based on deep learning, that is, a three-dimensional geometry of the transparent object is accurately estimated from an RGB-D image by using a deep learning method for the robot to operate, so as to solve the task of grabbing the transparent object in a home scene by the robot.
In order to realize the purpose, the invention is realized according to the following technical scheme:
a method for grabbing a transparent object by a robot based on deep learning comprises the following steps:
step S1: completing the establishment of a hardware environment of a system for grabbing the transparent object by the robot;
step S2: completing the calibration of a camera of a system for grabbing the transparent object by the robot;
step S3: and finishing the training of a grasping planning model based on the convolutional neural network and the grasping of the robot in a real environment.
Further, the hardware environment of the robot grasping transparent object system comprises a depth camera, at least one computer with ROS dynamics, at least one robot with a gripper, and at least one object to be grasped;
the depth camera is used for acquiring 3D visual data and is installed on the robot;
the computer is used for finishing the training of grabbing the network model;
the robot is used for grabbing an object to be grabbed.
Further, when the camera shoots an object, the camera captures a depth image and a color image at the same time, when the camera is calibrated, the color image and the depth image need to be calibrated, and each pixel point of the depth image corresponds to each pixel point of the color image through calibration, where the step S2 specifically includes the following steps:
step S21: determining internal parameters and external parameters of a binocular camera through camera calibration, and completing the transformation from a world coordinate system to a camera coordinate system;
step S22: and determining the relative position between the camera and the end effector through hand-eye calibration, and finishing the transformation of a camera coordinate system and a robot end effector coordinate system.
Further, the specific implementation method of step S21 includes:
the transformation of the world coordinate system into the camera coordinate system is described using the rotation matrix R and the translation vector T, as shown in equation (1):
in the formula (1), R1、T1Is an external reference of the Levoeye camera, R2、T2Is external reference of the right eye camera, which is obtained by camera calibration, (X)W,YW,ZW) Is the coordinate of a point in space under the world coordinate system, (X)1,Y1,Z1) Is the coordinate of a point in space under the coordinate system of the eye-lens camera (X)2,Y2,Z2) A point in space is a coordinate under a coordinate system of a right-eye camera;
taking the left eye camera coordinate system as a reference, taking the rotation matrix from the right eye camera coordinate system to the left eye camera coordinate system as R ', taking the translation vector as T', then:
according to formula (1) and formula (2):
the position of the calibration plate is kept unchanged when the binocular camera is used for shooting, the left-eye camera and the right-eye camera shoot images of the calibration plate at the same time, a plurality of groups of image pairs are collected and then led into the tool box, the tool box automatically calculates a rotation matrix and a translation vector between the two cameras, and the rotation matrix and the translation vector are used for completing the transformation from a world coordinate system to a camera coordinate system.
Further, the specific implementation method of step S22 includes:
the method comprises the following steps of solving transformation from a camera coordinate system to a robot end effector coordinate system through hand-eye calibration, wherein a hand represents an end effector, an eye represents a camera, and in the hand-eye calibration process, 4 coordinate systems are involved, namely a calibration plate coordinate system B, a camera coordinate system C, an end effector coordinate system T and a robot base coordinate system R;
using transformation matrices
Describing the transformation of the calibration plate coordinate system B to the robot base coordinate system R,
is represented as follows:
in the formula (4), the reaction mixture is,
expressing a transformation matrix from the coordinate system B of the calibration plate to the coordinate system C of the camera, namely camera external parameters, and obtaining the transformation matrix through camera calibration;
a transformation matrix representing the coordinate system T of the end effector to the coordinate system R of the robot base is obtained through parameters on the robot demonstrator;
a hand-eye matrix to be solved is obtained;
in the calibration process, the position of the calibration plate is kept unchanged, the robot is controlled to shoot images of the calibration plate from different positions, and two positions are selected for analysis, so that the following formula (5) can be obtained:
in the formula (5), the reaction mixture is,
calibration board for respectively representing position i and position i +1A transformation matrix from coordinate system B to robot base coordinate system R,
respectively representing transformation matrixes from a position i and a position i +1 of the end effector coordinate system T to a robot base coordinate system R,
respectively representing the hand-eye matrix to be solved at the position i and the position i +1,
respectively representing transformation matrixes from a position i and a position i +1 calibration board coordinate system B to a camera coordinate system C; because the relative position between the calibration plate and the robot base is not changed, and the relative position between the robot end effector and the camera is not changed, the method comprises the following steps
This is obtained simultaneously for formula (6):
in the formula (6), the reaction mixture is,
are all known quantities, and are finally solved to obtain
I.e. a transformation matrix from the camera coordinate system to the robot end effector coordinate system.
Further, the specific implementation method of step S3 includes:
s31: utilizing a depth camera to scan and capture a color image and a depth image of a transparent object;
s32: filtering the acquired image;
s33: completing transparent object detection and segmentation by using a ClearGrasp deep learning algorithm;
s34: and searching and scoring the grabbing position of the object by using a contact line searching method, and accurately grabbing the object after the optimal grabbing position is obtained.
Further, in step S32, a gaussian filtering algorithm with balanced speed and effect is selected to filter the acquired image, where the gaussian filtering formula is shown in equation (7):
in equation (7), f (x, y) represents a gaussian function value, the squares of x and y represent the distances between other pixels in the neighborhood and the center pixel in the neighborhood, respectively, and σ represents a standard deviation.
Further, the specific implementation method of step S33 includes:
predicting a surface normal, identifying a boundary and segmenting a transparent object from the filtered image by adopting a ClearGrasp deep learning method, wherein the segmented mask is used for modifying the input depth image; then, the depth of all the surfaces of the high-transparency objects in the scene is reconstructed by using a global optimization algorithm, and the edges, the occlusion and the segmentation of the 3D reconstruction are optimized by using the predicted surface normal.
Further, in step S33, the cleargraph includes 3 neural networks, and the outputs of the 3 neural networks are integrated for global optimization;
the 3 neural networks include: a transparent object segmentation network, an edge identification network and a surface normal vector estimation network;
transparent object segmentation network: inputting a single RGB picture, and outputting a pixel Mask of a transparent object in a scene, namely judging that each pixel point belongs to a transparent or non-transparent object, and removing the pixel judged as the transparent object in subsequent optimization to obtain a modified depth map;
edge identification network: for a single RGB picture, outputting information of a shielding edge and a connected edge, which helps a network to better distinguish different edges in the picture and make more accurate prediction on the edge with discontinuous depth;
surface normal vector estimation: using the RGB picture as input, and performing L2 regularization on the output;
reconstructing the three-dimensional surface of the missing depth area of the transparent object by using the global optimization algorithm, filling the removed depth area by using the normal vector of the surface of the predicted transparent object, and observing the depth discontinuity of the information displayed by the shielding edge, wherein the depth discontinuity is expressed by the following formula:
E=λDED+λSES+λNENB (8)
in the formula (8), E represents the predicted depth, EDDistance representing predicted depth and observed original depth, ESRepresenting depth differences of adjacent points, ENDenotes the consistency of the normal vector of the predicted depth and the predicted surface, B denotes the boundary occlusion based on whether the pixel occludes the boundary, lambdaD、λS、λNRepresenting the correlation coefficient.
Further, in step S34, the direction of the best capture position is the main direction of the object image gradient, the main position extraction is performed on the depth image of the object to increase the speed of selecting the capture position, that is, gradient values are calculated on the x-axis and the y-axis, respectively, and the gradient direction of each pixel is calculated and arranged and counted through a histogram, wherein the method for calculating the object gradient and calculating the gradient direction is as follows:
using [ -1,0,1 [ ]]And [ -1,0,1 [ -1]TThe two convolution kernels perform two-dimensional convolution on the image to calculate the object gradient;
the gradient magnitude and direction are calculated as follows:
in the above formula, gxAnd gyRespectively representing gradient values in x and y directions, g representing gradient magnitude, and theta representing gradient direction;
after obtaining the gradient, a threshold value g is setThreshAt 250 f, the robot has enough depth to place the splint for effective grasping only if the gradient is greater than the threshold value, i.e. the robot has sufficient depth to place the splint for effective grasping
In the process of grabbing a transparent object by a robot, two contact lines exist when a clamping jaw is in contact with the object, and the conditions for selecting the proper contact lines are as follows:
the gradient directions of two contact lines are basically opposite;
the distance between the two contact lines does not exceed the maximum opening distance of the gripper;
the depth of the two contact lines is not more than 1/2 of the maximum depth in the clamping jaw;
the depth difference between the shallowest point in the area contained between the two contact lines and the shallowest point of the contact line does not exceed the internal depth of the clamping jaw;
the following formula was used to evaluate the grasping reliability of a pair of contact wires:
wherein G represents the grasping reliability,/
1、l
2Respectively showing the lengths of two contact lines of the clamping jaw and the transparent object to be grabbed, L showing the width of the clamping jaw,
for the purpose of evaluating the length of the contact line,
evaluation of the ratio of the lengths of the two contact lines,/
maxIndicating the long strip in the contact line, l
minWhich represents the short strip of the strip,
for evaluating contact line fitting degree of paw,d
lRepresenting the shallowest point of the contact line, d
sRepresenting the shallowest point in the rectangular frame area, and using sin theta to evaluate the error degree of two contact lines, wherein theta is an acute angle formed by a connecting line of the midpoints of the two contact lines and the contact lines;
all contact line combinations are traversed through equation (12), and the combination with the highest score is selected as the best grasping position.
The invention has the advantages and positive effects that: aiming at the problem that the transparent object is difficult to grasp, the invention provides a clearGrasp-based deep learning algorithm which is characterized in that 3D data of a high-transparency object can be accurately predicted through an RGB-D camera.
Examples
Fig. 1 is a schematic flow chart of a method for grabbing a transparent object by a robot based on deep learning according to the present invention, and as shown in fig. 1, the present invention provides a method for grabbing a transparent object by a robot based on deep learning, which includes the following steps:
step S1: completing the establishment of a hardware environment of a system for grabbing the transparent object by the robot;
step S2: completing the calibration of a camera of a system for grabbing the transparent object by the robot;
step S3: and finishing the training of a grasping planning model based on the convolutional neural network and the grasping of the robot in a real environment.
Specifically, the hardware environment of the system for grabbing the transparent object by the robot is shown in fig. 2, and comprises an Inter Realsense depth camera, at least one ROS dynamic Ubantu18.04 computer, at least one UR5 robot with a gripper and at least one object to be grabbed;
the Inter Realsense depth camera is used for collecting 3D visual data and is installed on the UR5 robot;
the Ubantu18.04 computer is used for finishing the training of grabbing the network model;
the UR5 robot is used to grab objects to be grabbed.
Specifically, when the depth camera shoots an object, the depth camera captures a depth image and a color image at the same time, when the camera is calibrated, the color image and the depth image need to be calibrated, and each pixel point of the depth image corresponds to each pixel point of the color image through calibration, where the step S2 specifically includes the following steps:
step S21: determining internal parameters and external parameters of a binocular camera through camera calibration, and completing the transformation from a world coordinate system to a camera coordinate system;
step S22: and determining the relative position between the camera and the end effector through hand-eye calibration, and finishing the transformation of a camera coordinate system and a robot end effector coordinate system.
Specifically, the method for implementing step S21 includes:
the transformation of the world coordinate system into the camera coordinate system is described using the rotation matrix R and the translation vector T, as shown in equation (1):
in the formula (1), R1、T1Is an external reference of the Levoeye camera, R2、T2Is an external reference of the right eye camera, which can be obtained by camera calibration, (X)W,YW,ZW) Is the coordinate of a point in space under the world coordinate system, (X)1,Y1,Z1) Is the coordinate of a point in space under the coordinate system of the eye-lens camera (X)2,Y2,Z2) A point in space is a coordinate under a coordinate system of a right-eye camera;
taking the left eye camera coordinate system as a reference, taking the rotation matrix from the right eye camera coordinate system to the left eye camera coordinate system as R ', taking the translation vector as T', then:
according to formula (1) and formula (2):
the position of the calibration plate is kept unchanged when the binocular camera is used for shooting, the left-eye camera and the right-eye camera shoot images of the calibration plate at the same time, a plurality of groups of image pairs are collected and then led into a tool kit of Matlab, the tool kit automatically calculates a rotation matrix and a translation vector between the two cameras, and the transformation from a world coordinate system to a camera coordinate system can be completed by using the rotation matrix and the translation vector.
Specifically, the method for implementing step S22 includes:
the method comprises the following steps of solving transformation from a camera coordinate system to a robot end effector coordinate system through hand-eye calibration, wherein a hand represents an end effector, an eye represents a camera, and in the hand-eye calibration process, 4 coordinate systems are involved, namely a calibration plate coordinate system B, a camera coordinate system C, an end effector coordinate system T and a robot base coordinate system R;
using transformation matrices
Describing the transformation of the calibration plate coordinate system B to the robot base coordinate system R,
is represented as follows:
in the formula (4), the reaction mixture is,
a transformation matrix representing the coordinate system B of the calibration plate to the coordinate system C of the camera, namely camera external parameters, can be obtained through camera calibration;
a transformation matrix representing the coordinate system T of the end effector to the coordinate system R of the robot base can be obtained through parameters on the robot demonstrator;
a hand-eye matrix to be solved is obtained;
in the calibration process, the position of the calibration plate is kept unchanged, the robot is controlled to shoot images of the calibration plate from different positions, and two positions are selected for analysis, so that the following formula (5) can be obtained:
in the formula (5), the reaction mixture is,
respectively representing transformation matrixes from a coordinate system B of the calibration board at the position i and a coordinate system R of the robot base at the position i +1,
respectively representing transformation matrixes from a position i and a position i +1 of the end effector coordinate system T to a robot base coordinate system R,
respectively representing the hand-eye matrix to be solved at the position i and the position i +1,
respectively representing transformation matrixes from a position i and a position i +1 calibration board coordinate system B to a camera coordinate system C; because the relative position between the calibration plate and the robot base is not changed, and the relative position between the robot end effector and the camera is not changed, the method comprises the following steps
This is obtained simultaneously for formula (6):
in the formula (6), the reaction mixture is,
and
are all known quantities, and are finally solved to obtain
I.e. a transformation matrix from the camera coordinate system to the robot end effector coordinate system.
Specifically, the method for implementing step S3 includes:
s31: utilizing a RealSense RGB-D camera to scan and capture a color image and a depth image of a transparent object;
s32: filtering the acquired image;
s33: completing transparent object detection and segmentation by using a ClearGrasp deep learning algorithm;
s34: and searching and scoring the grabbing position of the object by using a contact line searching method, and accurately grabbing the object after the optimal grabbing position is obtained.
Specifically, in step S32, a gaussian filtering algorithm with balanced speed and effect is selected to filter the acquired image, where the gaussian filtering formula is shown in equation (7):
in equation (7), f (x, y) represents a gaussian function value, the squares of x and y represent the distances between other pixels in the neighborhood and the center pixel in the neighborhood, respectively, and σ represents a standard deviation.
Specifically, a cleargraph deep learning algorithm model network structure is shown in fig. 3, and the specific implementation method of step S33 includes:
predicting a surface normal, identifying a boundary and segmenting a transparent object from the filtered image by adopting a ClearGrasp deep learning method, wherein the segmented mask is used for modifying the input depth image; then, the depth of all the surfaces of the high-transparency objects in the scene is reconstructed by using a global optimization algorithm, and the edges, the occlusion and the segmentation of the 3D reconstruction are optimized by using the predicted surface normal.
Specifically, in step S33, the cleargraph includes 3 neural networks, and the outputs of the 3 neural networks are integrated for global optimization;
the 3 neural networks include: a transparent object segmentation network, an edge identification network and a surface normal vector estimation network;
transparent object segmentation network: inputting a single RGB picture, and outputting a pixel Mask of a transparent object in a scene, namely judging that each pixel point belongs to a transparent or non-transparent object, and removing the pixel judged as the transparent object in subsequent optimization to obtain a modified depth map;
edge identification network: for a single RGB picture, outputting information of a shielding edge and a connected edge, which helps a network to better distinguish different edges in the picture and make more accurate prediction on the edge with discontinuous depth;
surface normal vector estimation: using the RGB picture as input, and performing L2 regularization on the output;
reconstructing the three-dimensional surface of the missing depth area of the transparent object by using the global optimization algorithm, filling the removed depth area by using the normal vector of the surface of the predicted transparent object, and observing the depth discontinuity of the information displayed by the shielding edge, wherein the depth discontinuity can be expressed by the following formula:
E=λDED+λSES+λNENB (8)
in the formula (8), E represents the predicted depth, EDDistance representing predicted depth and observed original depth, ESRepresenting depth differences of adjacent points, ENDenotes the consistency of the normal vector of the predicted depth and the predicted surface, B denotes the boundary occlusion based on whether the pixel occludes the boundary, lambdaD、λS、λNRepresenting the correlation coefficient.
Specifically, in step S34, the direction of the best capture position is the main direction of the object image gradient, the main position extraction is performed on the depth image of the object to accelerate the selection speed of the capture position, that is, gradient values are calculated on the x-axis and the y-axis, respectively, and the gradient direction of each pixel is calculated, and the gradient directions are arranged and counted through a histogram, wherein the method for calculating the object gradient and calculating the gradient direction is as follows:
using [ -1,0,1 [ ]]And [ -1,0,1 [ -1]TThe two convolution kernels perform two-dimensional convolution on the image to calculate the object gradient;
the gradient magnitude and direction are calculated as follows:
in the above formula, gxAnd gyRespectively representing gradient values in x and y directions, g representing gradient magnitude, and theta representing gradient direction;
after obtaining the gradient, a threshold value g is setThreshAt 250 f, the robot has enough depth to place the splint for effective grasping only if the gradient is greater than the threshold value, i.e. the robot has sufficient depth to place the splint for effective grasping
In the process of grabbing a transparent object by a robot, two contact lines exist when a clamping jaw is in contact with the object, and the conditions for selecting the proper contact lines are as follows:
the gradient directions of two contact lines are basically opposite;
the distance between the two contact lines does not exceed the maximum opening distance of the gripper;
the depth of the two contact lines is not more than 1/2 of the maximum depth in the clamping jaw;
the depth difference between the shallowest point in the area contained between the two contact lines and the shallowest point of the contact line does not exceed the internal depth of the clamping jaw;
the following formula was used to evaluate the grasping reliability of a pair of contact wires:
wherein G represents the grasping reliability,/
1、l
2Respectively showing the lengths of two contact lines of the clamping jaw and the transparent object to be grabbed, L showing the width of the clamping jaw,
for the purpose of evaluating the length of the contact line,
evaluation of twoLength ratio of strip contact line, /)
maxIndicating the long strip in the contact line, l
minWhich represents the short strip of the strip,
for evaluating the contact line engaging the paw, d
lRepresenting the shallowest point of the contact line, d
sRepresenting the shallowest point in the rectangular frame area, and using sin theta to evaluate the error degree of two contact lines, wherein theta is an acute angle formed by a connecting line of the midpoints of the two contact lines and the contact lines;
all contact line combinations are traversed through equation (12), and the combination with the highest score is selected as the best grasping position.
The invention has the advantages and positive effects that: aiming at the problem that the transparent object is difficult to grasp, the invention provides a clearGrasp-based deep learning algorithm which is characterized in that 3D data of a high-transparency object can be accurately predicted through an RGB-D camera.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.