CN114851201A

CN114851201A - Mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction

Info

Publication number: CN114851201A
Application number: CN202210551360.5A
Authority: CN
Inventors: 欧林林; 徐靖; 禹鑫燚; 周利波; 魏岩
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-05
Anticipated expiration: 2042-05-18
Also published as: CN114851201B

Abstract

The invention relates to a mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction. The method comprises the following steps: the method comprises the following steps: calibrating a mechanical arm base coordinate system and a camera coordinate system by a Zhang-Yongyou camera calibration method and an ArUco Markers calibration method; step two: performing three-dimensional scene reconstruction on the acquired image information by using a TSDF function so as to reduce environmental noise points between objects; step three: establishing a reinforcement learning network model; step four: reversely projecting the predicted grabbing attitude of the end effector to a three-dimensional reconstruction scene, and judging the grabbing quality predicted this time; step five: the grabbing movement of the mechanical arm is completed through the forward and backward kinematics of the robot; step six: performing reinforcement learning model training to enable the mechanical arm to complete the grabbing action; the invention overcomes the defects of the prior art, provides the mechanical arm six-degree-of-freedom vision closed-loop grabbing system which is easy to realize and high in applicability and is based on TSDF three-dimensional reconstruction, reduces environmental noise caused by shielding and stacking among objects, and reduces depth errors caused by interference of a single vision sensor. In addition, the system can realize quick and real-time target detection and complete the grabbing action while ensuring high precision.

Description

Mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction

Technical Field

The invention belongs to a technology for grabbing an object by a mechanical arm with six degrees of freedom in an unstructured environment, and particularly belongs to the field of robot grabbing control.

Background

Object grabbing has always been an important problem in robotics, but there has been no satisfactory corresponding solution. Due to the multifunctional control system of the robot arm, the end effector with high freedom degree can be flexibly controlled in three-dimensional space, and the smart grabbing of objects and the dynamic response capability to the environment change are realized. Recently, with the rapid development of deep learning and reinforcement learning and the construction of a corresponding system, a plurality of feasible ideas are provided for the intelligent mechanical arm grabbing mode.

Although the mechanical arm 6 has high practical value in the grabbing control with freedom degrees and is suitable for more complex operating environments, most of the current data-driven grabbing algorithms only carry out grabbing from top to bottom (4 Dof: x, y, z, yaw) in a simple desktop setting or utilize physical analysis to grasp a proper grabbing posture. However, due to the limitation of three-dimensional motion grabbing, the algorithm has great limitation to the application scene, and the mechanical arm end effector can only approach the object from the vertical downward direction, and in some cases, cannot grab the object along the direction. For example, it is difficult for the gripper to grasp a horizontally placed plate. In addition, the methods based on physical analysis not only need to calculate a large amount of experimental environment data, but also need to calculate and estimate an accurate target model, so that a large amount of time and calculation cost are inevitably generated, and the physical model of an unstructured target object is often not universal, so that most of algorithms are difficult to be directly applied to grabbing of novel objects, and the robustness and the generalization of the system are poor.

Thus, the mechanical arm 6 degree of freedom (6-DOF) grasping idea is proposed. Although, the pointentgpd proposed by Hongzhuo Liang employs a two-step sampling-evaluation method, a large number of samples are evaluated to determine a reliable grab pose. However, this method is certainly quite time consuming. (Hongzhu Liang et al, "Detecting gram configurations from point sets". In:2019International Conference on Robotics and Automation (ICRA). IEEE.2019, pp.3629-3635). Florence et al perform gesture transitions from the existing grab gesture. However, the success rate of the algorithm is relatively low for the geometric shapes of the target objects and objects which are not recorded in the data set, so that the algorithm cannot be popularized to a new application scene. (Peter Florence, Lucas Manueli, and Russ Tedrake. "Dense Object Nets: Learning Dense Visual Object Descriptors By and For robotics management". In: Conference on Robot Learning (CoRL) (2018)). Recently, a point cloud scene is reconstructed from a partial view acquired by an RGBD camera, and then after object features are extracted by using a designed Pointernet + + network model, 6Dof grabbing posture regression is directly completed and grabbing planning is performed on the reconstructed point cloud scene. (P.Ni, W.Zhang, X.Zhu, and Q.Cao, "Learning an end-to-end spatial gradient generation and refining algorithm from simulation," Machine Vision and Applications, vol.32, No.1, pp.1-12,2021.).

Disclosure of Invention

The invention overcomes the defects of the prior art and provides the six-degree-of-freedom grabbing method of the mechanical arm, which is easy to implement and high in applicability. The invention designs a feature extraction network and a Policy Gradient reinforcement learning framework, and can realize positioning and six-degree-of-freedom plan grabbing of unrecorded objects in a training data set.

The method takes an image sequence as input, and firstly constructs a mechanical arm coordinate system and a camera coordinate system through a Zhang Yongyou camera calibration method and an Aruco Markers calibration method. Color and depth information of the capture stage is then acquired using an Intel Rrealsense D435 depth camera. And performing three-dimensional scene reconstruction on the acquired image information through a Truncated Signaled Distance Function (TSDF) so as to reduce environmental noise points between objects. Intercepting a two-dimensional overlook image of a reconstructed scene from top to bottom, taking the two-dimensional overlook image as input of a neural network model for capturing in the text design, coding and characteristic extraction are carried out on the input image by the model, and a predicted capture area and a mechanical arm capture posture of a current environment are input. The structure of the network model is shown in figure 1. The model adopts a Renet full convolution feature extraction network architecture to learn to extract unstructured features from RGB-D partial observation. The model channel-connects color features and depth features, using 2 additional Batch Normalization (BN), Nonlinear activation functions (ReLU), and 1 × 1 conditional Layers interlaces. And then, aggregating feature embedding of the same pixel unit by using Average Pooling (Average Pooling), and performing linear up-sampling to enable convolution features to be shared in different positions and directions, wherein output pictures are consistent with input resolution, selecting coordinates of a hotspot pattern block with the maximum grabbing probability, corresponding the coordinates to a hemispherical three-dimensional grabbing gesture point (see figure 2) defined before an experiment, and taking the corresponding 6-dimensional gesture as a predicted 6-degree-of-freedom grabbing gesture. And then, back projecting the predicted grabbing attitude input by the model into a reconstructed three-dimensional scene, intercepting a depth image under the current attitude from the three-dimensional scene by using a TSDF function, and judging the grabbing quality at this time according to the depth information at the two ends of the clamper. And finally, when the depth of the target coordinate point in the rendered image is less than or equal to the depth (15cm) of the finger tips at the two ends of the gripper, inputting the predicted gesture into the positive kinematics of the robot to obtain the corresponding mechanical arm motion track, and finishing the current grabbing action of the mechanical arm. If the crawling is successful, the reward of the agent is assigned to be 1, otherwise, the reward of the agent is assigned to be 0. The system grabbing flow chart is shown in figure 3.

The invention relates to a mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction, which comprises the following specific steps of:

step 1: calibrating a mechanical arm base coordinate system and a camera coordinate system by a Zhang friend camera calibration method and an ArUco Markers calibration method:

firstly, the Intel D415 depth camera is vertically fixed at the tail end of the mechanical arm, so that the camera and the clamping jaw of the end effector are in fixed posture transformation, and the binocular camera can acquire image information of objects on the grabbing operation table. Then, calibrating the internal reference and the external reference of the camera by a Zhang Zhengyou camera calibration method and an Aruco Markers calibration method, and constructing a three-dimensional coordinate system with the mechanical arm base as an origin. The calibration formulas of the internal reference and the external reference of the camera are as follows:

wherein x is _w ,y _w ,z _w Three-dimensional coordinates under a world coordinate system; x is the number of _c ,y _c ,z _c The coordinates are coordinates under an image coordinate system; u and v are pixel coordinate systems; z is a radical of _m Is a point in the assumed camera coordinate system; p is an imaging point in an image coordinate system; f is an imaging distance; d _x ,d _y The coordinates of the pixel points in the image coordinate system are obtained; r and T are a rotation matrix and a translation matrix. Equation (1) is a world coordinate system to camera coordinate system equation. Formula (2) is a formula from a camera coordinate system to an ideal image coordinate system. Equation (3) is a camera coordinate system to pixel coordinate system matrix, i.e., an internal reference matrix.

Step 2: and (3) carrying out three-dimensional scene reconstruction on the acquired image information by using a TSDF function so as to reduce the environmental noise points between the objects:

step 2.1: in an actual experiment, a single sensor is adopted to obtain visual information, and information (color information and depth information) captured by a binocular camera on a grabbing operating platform is obtained through OpenCV.

Step 2.2: due to light, mutual shielding of objects and other environmental factors, a large noise point exists in a depth image acquired by a sensor, and therefore the capturing and prediction of the network model on the image are influenced. Firstly, spherical point sampling is carried out on a grabbing operation table before grabbing each time, and depth errors in a single image are reduced by sampling local images for multiple times. Secondly, the sampling point posture and the sampling image information are fused into a three-dimensional reconstruction voxel grid by utilizing a TSDF function. For a single voxel, not only the intuitive x, y, z coordinates are contained, but also two other values (SDF and CAM) are used to represent the voxel-to-surface distance. Where SDF is a signed distance function for depth map fusion, which uses 0 for the surface, positive values for points outside the model, negative values for points inside the model, and a value proportional to the distance of the point to the nearest surface. The surface is easily extracted as the zero crossing point of the function.

Step 2.3: the coordinate system where the voxel grid is located is defined as a world coordinate system, and the voxel coordinate can be expressed as

Pose of the camera is

The camera's reference matrix is K.

Therefore, the voxel grid in the world coordinate system is subjected to transformation matrix obtained by ICP registration

Projecting the image data to a camera coordinate system, and converting the image data to an image coordinate system I according to a camera internal reference matrix K _ij Wherein i ∈ (0,1,..., n), j ∈ (0,1,..., m).

After projecting the voxel grid to the image coordinate system, it is first necessary to calculate the initial SDF value for each voxel, as shown in equation (18)

In the formula

Indicating the position information of the jth voxel in the world coordinate system,

indicates the position information of the ith camera position in the world coordinate system, dep (I) _ij ) And the depth value of the jth voxel in the ith camera depth image coordinate system is represented.

The sdf value for each voxel is then truncated according to equation (19).

Wherein tsdf is _i For voxel truncation distance value, trunc represents artificially set truncation distance, which can be understood as an error value of camera depth information, if the error is larger, the value is set to be larger, otherwise, much information acquired by the depth camera may be lost, and the item trunc is set to be 1.

After truncating the sdf value for each voxel, the tsdf value for each voxel is recalculated according to equation (20):

wherein ω is _j (x, y, z) is the weight of the voxel in the current frame global voxel grid, ω _j-1 (x, y, z) is the weight of the voxel in the voxel grid of the previous frame, tsdf _j For the distance of the voxel in the current frame global data cube to the object surface, tsdf _j-1 Is the distance, V, from the voxel in the last frame of global data cube to the surface of the object _.z Denotes the z-axis coordinate of the voxel in the camera coordinate system, and D (u, v) denotes the depth value at the current frame depth image (u, v).

Equation (9) calculates a weight value for each voxel.

w _j ＝w _j-1 +w _j (9)

And rendering the environment information of the grabbing operation table from top to bottom, and taking the environment information as the input of the network model.

And step 3: establishing a reinforcement learning network model:

step 3.1: designing a mechanical arm grabbing strategy, and defining an action space for reinforcement learning: the invention designs a two-dimensional convolution network to directly regress the 6-DoF grabbing posture by increasing the network dimension. The gripper defines the gripping action as shown in figure 2.

Step 3.2: converting each element in the array into an angle of rotation of the tail end of the mechanical arm around three coordinate axes of x, y and z respectively, wherein a specific conversion formula is as follows:

r _z ＝(best_pix_ind[2]*θ) (12)

wherein a is _x The rotation angle of the tail end of the mechanical arm around the x axis is represented as the roll angle of the end effector; b is a mixture of _y The rotation angle of the tail end of the mechanical arm around the y axis is expressed, and the rotation angle is the pitch angle of the end effector; r is a radical of hydrogen _z The rotation angle of the tail end of the mechanical arm around the z axis is represented as the yaw angle of the end effector;

deviation values were set for the experiments.

Step 3.3: designing a feature extraction network model: the model adopts a Resnet full convolution feature extraction network architecture to learn to extract unstructured features from RGB-D partial observation. The multilayer feature extraction Convolutional layer extracts the features of the color image and the depth image, and then channel-connects the color feature and the depth feature by using a Concat function, and keeps the resolution of the output thermodynamic diagram consistent with the input image by using 2 additional Batch Normalization (BN), Nonlinear activation functions (ReLU), 1 × 1 volumetric Layers, and upsampling. And selecting the corresponding gesture in the action space corresponding to the maximum vector value coordinate in the tensor map as the predicted grabbing gesture to be output.

Step 3.4: the network is forward-inferred by the following formula:

where equation (13) represents the expected return in state s, action a, where g _t Indicating the grabbing action taken at time t, s _t Indicating the state at time t, r _t Representing the reward at time t; equation (14) represents the total reward function of the network; equation (15) is a state distribution function; equation (16) represents a state-action function.

Step 3.5, designing a reinforcement learning network loss function, and further improving the detection precision of the algorithm by adopting a calculation cross entropy function:

where τ is s ₀ a ₀ s ₁ a ₁ ...s _n a _n .., Markov process.

Due to the fact that

P _r { a | s } ═ pi (s, a), so equation (18) can be obtained.

The weight update function is as follows:

wherein f is _ω SxA → R is a pair

When f is an approximation function of _ω When the minimum value Δ ω is 0, the formula (21) can be derived

Step 3.6: setting reinforcement learning reward: if the crawling is successful, the reward of the agent is assigned to be 1, otherwise, the reward of the agent is assigned to be 0.

And 4, step 4: and (3) reversely projecting the predicted grabbing attitude of the end effector to a three-dimensional reconstruction scene, and judging the grabbing quality of the prediction: and reversely projecting the 6-degree-of-freedom grabbing posture output by the network into a three-dimensional voxel scene established before grabbing, and intercepting a depth image under the predicted posture through the projection of a viewing pyramid, wherein the projection of the viewing pyramid is shown in an attached figure 4. Where the viewing pyramid is a three-dimensional volume whose position is related to the camera's external reference matrix, the shape of the viewing pyramid determines how the model is projected from the pixel space to the plane. Perspective projection uses a pyramid as the view frustum, with the camera located at the vertex of the pyramid. The pyramid is truncated by two planes, one behind the other, forming a Frustum of a pyramid called View cluster, the model of which is shown in fig. 5, wherein only the model located inside cluster is visible. Two planes are more specific in this clipping space, called near clip plane (near clip plane) and round clip plane (far clip plane), respectively.

The distance position between the parallel clamping jaw and the object is inferred by comparing the depth information of the two ends of the clamp in the rendered image, so that the grabbing quality of the predicted grabbing posture is judged, and the closed loop feedback of the robot arm 6-degree-of-freedom grabbing system is formed.

And 5: the grabbing and moving of the mechanical arm are completed through the forward and backward kinematics of the robot:

firstly, when the depth of the two ends of the gripper in the depth image rendered in the step 4 is less than or equal to the fingertip depth of the gripper, namely the predicted gesture of the gripper is positioned on the two sides of the object, the predicted gesture at this time is judged to be capable of being grabbed. And then, controlling the mechanical arm to execute a grabbing plan according to the predicted grabbing posture: and 6 joint angle degrees of the mechanical arm in the current state are solved through the inverse kinematics of the robot. And then inputting the predicted 6-dimensional grabbing posture of the tail end of the mechanical arm output by the strong learning network in the step 3 into positive kinematics of the robot, and solving the movement track of the end effector of the mechanical arm converted from the current posture to the predicted posture point. And when the end effector of the mechanical arm moves to the predicted grabbing posture, the robot sends a clamp closing signal to try to grab. After the gripper is closed, the end effector vertically moves upwards by 15cm, a depth image of the effector after moving upwards is obtained from an Intel Rrealsense D435 camera, and whether actual grabbing succeeds or not is judged by calculating the depth of two ends of the gripper. When the capturing is successful, the reinforcement learning model reward is assigned to be 1; and when the grabbing fails, the reinforcement learning network reward is assigned to be 0.

Step 6: and (3) performing reinforcement learning model training to ensure that the mechanical arm finishes the grabbing action:

and continuously repeating the steps 2 to 5 in a CoppelliaSim Edu simulation environment, reducing a loss function in the reinforcement learning model through reinforcement learning reward, and updating the model weight parameter. And finally, importing the trained weight parameters into a real mechanical arm UR5, carrying out experimental debugging, repeating the steps 1 to 5, and completing a six-degree-of-freedom closed-loop grabbing task of the mechanical arm.

In conclusion, the method has the advantage that the 6-dimensional grabbing posture of the mechanical arm is directly predicted and output by increasing the two-dimensional neural network dimension. Meanwhile, the invention carries out three-dimensional reconstruction on the complex grabbing environment through the TSDF function, reduces the environmental noise caused by shielding and stacking among objects, and reduces the depth error caused by the interference of a single vision sensor. Meanwhile, a reinforcement learning network is designed aiming at the method, the defects of complex calculation and high time cost of mechanical arm grabbing gesture deduction through traditional physics are overcome, and the problem that the mechanical arm 6-DOF grabbing gesture cannot be applied to target objects which are not recorded in a data set is solved. The method not only ensures the high grabbing success rate of the mechanical arm model, but also is beneficial to the generalization of reinforcement learning, namely the method can be applied to new grabbed objects, and solves the problems of time-consuming calculation of the traditional method and reduction of instability of the point cloud model of the input part. The invention realizes the real-time detection of the captured object and the function of 6-DOF capture.

Drawings

FIG. 1 is a block diagram of a network model in the present invention;

FIG. 2 is a diagram of the attitude angles of the hemispherical three-dimensional gripper defined by the present invention;

FIG. 3 is a flow chart of the system capture of the present invention;

FIG. 4 is a view of a viewing pyramid projection;

FIG. 5 is a View Frustum model diagram;

FIG. 6 is an exemplary graph of an experiment capture;

Detailed Description

The invention is further illustrated in the following with reference to the accompanying drawings.

The invention discloses a mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction, which is shown in the attached drawing 6 as an example of object grabbing and comprises the following specific processes:

firstly, an Intel D415 depth camera is vertically fixed at the tail end of a mechanical arm, so that the camera and a clamping jaw of an end effector are in fixed posture transformation, and a binocular camera can acquire image information of an object on a grabbing operation table. Then, calibrating the internal reference and the external reference of the camera by a Zhang Zhengyou camera calibration method and an ArUco Markers calibration method, and constructing a three-dimensional coordinate system with the mechanical arm base as an origin. The calibration formulas of the internal reference and the external reference of the camera are as follows:

wherein x is _w ,y _w ,z _w Three-dimensional coordinates under a world coordinate system; x is a radical of a fluorine atom _c ,y _c ,z _c The coordinates are under an image coordinate system; u and v are pixel coordinate systems; z is a radical of _m Is a point in the assumed camera coordinate system; p is an imaging point in an image coordinate system; f is an imaging distance; d _x ,d _y The coordinates of the pixel points in the image coordinate system are obtained; r and T are a rotation matrix and a translation matrix. Equation (1) is a world coordinate system to camera coordinate system equation. Formula (2) is a formula from a camera coordinate system to an ideal image coordinate system. Equation (3) is a camera coordinate system to pixel coordinate system matrix, i.e., an internal reference matrix.

Step 2.2: due to light, mutual shielding of objects and other environmental factors, a large noise point exists in a depth image acquired by a sensor, and therefore the capturing and prediction of the network model on the image are influenced. Firstly, spherical point sampling is carried out on a grabbing operation table before grabbing each time, and depth errors in a single image are reduced by sampling local images for multiple times. Secondly, the sampling point posture and the sampling image information are fused into a three-dimensional reconstruction voxel grid by utilizing a TSDF function. For a single voxel, not only the intuitive x, y, z coordinates are contained, but also two other values (SDF and CAM) are used to represent the voxel-to-surface distance. Where SDF is a signed distance function for depth map fusion, which uses 0 for the surface, positive values for points outside the model, negative values for points inside the model, and a value proportional to the distance of the point to the nearest surface. The surface is easily extracted as the zero crossing point of the function. If the depth value D (u, V) at the ith frame depth image D (u, V) is not 0, comparing D (u, V) with z in the voxel camera coordinates V (x, y, z), and if D (u, V) is larger than z, indicating that the voxel is closer to the camera for reconstructing the outer surface of the scene; if D (u, v) is less than z, this voxel is further from the camera and is reconstructed as the interior surface of the scene. I.e. the voxel point is outside the surface (closer to the camera side), the SDF value is positive and the voxel is inside the surface, the SDF value is negative.

Pose of the camera is

The camera's reference matrix is K.

In the formula

indicates the position information of the ith camera position in the world coordinate system, dep (I) _ij ) Representing the depth value of the jth voxel in the ith camera depth image coordinate system.

The sdf value for each voxel is then truncated according to equation (19).

Where, trunc represents an artificially set truncation distance, which may be understood as an error value of the camera depth information, if the error is larger, the value is set to be larger, otherwise, much information acquired by the depth camera may be lost, and this entry trunc is set to 1.

wherein omega _j (x, y, z) is the weight of the voxel in the current frame global voxel grid, ω _j-1 (x, y, z) is the weight of the voxel in the voxel grid of the previous frame, tsdf _j For the distance of the voxel in the current frame global data cube to the object surface, tsdf _j-1 Is the distance, V, from the voxel in the last frame of global data cube to the surface of the object _.z Denotes the z-axis coordinate of the voxel in the camera coordinate system, and D (u, v) denotes the depth value at the current frame depth image (u, v).

Equation (9) calculates a weight value for each voxel.

w _j ＝w _j-1 +w _j (9)

And step 3: establishing a reinforcement learning network model:

step 3.1: designing a mechanical arm grabbing strategy, and defining an action space for reinforcement learning: the target grasp is generally divided into two grasp position areas and a three-dimensional rotation amount, and the grasp area and the three-dimensional rotation amount are predicted by two network models, respectively. The method uses a two-dimensional convolution network to directly regress the 6-DoF grabbing posture by increasing the network dimension. The invention is defined in the space of grabbing motion of a single object as 16 × 28 motions, wherein 16 is the rotation motion (22.5 °) of the end effector around the z coordinate axis, and 28 × 28 is the number of the rotation motions performed on the x and y coordinate axes by the end effector. The gripper defines the gripping action as shown in figure 2.

r _z ＝(best_pix_ind[2]*θ) (12)

wherein a is _x The rotation angle of the tail end of the mechanical arm around the x axis is expressed, and the rotation angle is the roll angle of the end effector; b _y The rotation angle of the tail end of the mechanical arm around the y axis is expressed, and the rotation angle is the pitch angle of the end effector; r is _z The rotation angle of the tail end of the mechanical arm around the z-axis is represented, and the yaw angle of the end effector is obtained.

In the actual grab example, the deviation value is brought in

And theta is 180/16, and the predicted grabbing posture is calculated.

Step 3.3: designing a feature extraction network model: the model adopts a Resnet full convolution feature extraction network architecture to learn to extract unstructured features from RGB-D partial observation. The multi-layer feature extraction Convolutional layer extracts the features of the color image and the depth image, then channel connects the color features and the depth features through the Concat function, and keeps the resolution of the output thermal diagram consistent with the resolution of the input image by adopting 2 additional Batch Normalization (BN), Nonlinear activation functions (ReLU), 1 × 1 conditional Layers, and upsampling. And selecting the corresponding gesture in the action space corresponding to the maximum vector value coordinate in the tensor map as the predicted grabbing gesture to be output.

Step 3.4: the network is forward-inferred by the following formula:

wherein the formula (13) represents in the state _s Expected reward under action a, where g _t Indicating the grabbing action taken at time t, s _t Indicating the state at time t, r _t Representing the reward at time t; equation (14) represents the total reward function of the network; equation (15) is a state distribution function; equation (16) represents a state-action function.

where τ is s ₀ a ₀ s ₁ a ₁ ...s _n a _n .., Markov process.

Due to the fact that

P _r { a | s } ═ pi (s, a), so equation (18) can be obtained.

The weight update function is as follows:

wherein f is _ω SxA → R is a pair

And 4, step 4: and (3) reversely projecting the predicted grabbing attitude of the end effector to a three-dimensional reconstruction scene, and judging the grabbing quality of the prediction: and reversely projecting the 6-degree-of-freedom grabbing posture output by the network into a three-dimensional voxel scene established before grabbing, and intercepting a depth image under the predicted posture through the projection of a viewing pyramid, wherein the projection of the viewing pyramid is shown in an attached figure 4. Where the view frustum is a three-dimensional volume whose position is related to the camera's external parameter matrix, the shape of the view frustum determines how the model is projected from the pixel space to the plane. Perspective projection uses a pyramid as the view frustum, with the camera located at the vertex of the pyramid. The pyramid is truncated by two planes, one behind the other, forming a Frustum of a pyramid called View cluster, the model of which is shown in fig. 5, wherein only the model located inside cluster is visible. In this clipping space, two planes are more specialized, called near clip plane (near clip plane) and far clip plane (far clip plane), respectively.

firstly, when the depth of the two ends of the gripper in the depth image rendered in the step 4 is less than or equal to the fingertip depth of the gripper, namely the predicted gesture of the gripper is positioned on the two sides of the object, the predicted gesture at this time is judged to be capable of being grabbed. And then, controlling the mechanical arm to execute a grabbing plan according to the predicted grabbing posture: 6 joint angle degrees of the mechanical arm in the current state are solved through inverse kinematics of the robot. And then inputting the predicted 6-dimensional grabbing posture of the tail end of the mechanical arm output by the strong learning network in the step 3 into positive kinematics of the robot, and solving the movement track of the end effector of the mechanical arm converted from the current posture to the predicted posture point. And when the end effector of the mechanical arm moves to the predicted grabbing posture, the robot sends a clamp closing signal to try to grab. After the gripper is closed, the end effector vertically moves upwards by 15cm, a depth image of the effector after moving upwards is obtained from an Intel Rrealsense D435 camera, and whether actual grabbing succeeds or not is judged by calculating the depth of two ends of the gripper. When the capturing is successful, the reinforcement learning model reward is assigned to be 1; and when the grabbing fails, the reinforcement learning network reward is assigned to be 0.

and (4) importing the trained model weight parameters into a real mechanical arm UR5, carrying out experimental debugging, repeating the steps 1 to 5, and completing a six-degree-of-freedom closed-loop grabbing task of the mechanical arm. Through 200-round grabbing tests, the grabbing success rate of the mechanical arm six-degree-of-freedom grabbing method based on the TSDF for the unstructured objects is 85.3% (under the condition that 10 target objects are randomly stacked). When the number of target objects is increased to 15 random stacking situations, the grabbing success rate of the invention is still 81.5%. In conclusion, the reinforcement learning network is designed, the defects that the calculation of the grabbing attitude of the mechanical arm is complex and the time cost is high through the traditional physical derivation are overcome, and the problem that the 6-DOF grabbing attitude of the mechanical arm cannot be applied to target objects which are not recorded in a data set is solved. The method provided by the invention not only ensures a high capturing success rate of the mechanical arm model, but also is beneficial to the generalization of reinforcement learning, can be applied to unrecorded captured objects, and solves the problems of time-consuming calculation of the traditional method and reduction of instability of the point cloud model of the input part.

Claims

1. A mechanical arm six-degree-of-freedom vision closed-loop grabbing method based on TSDF three-dimensional reconstruction is characterized in that: the method comprises the following steps:

step 1: calibrating a mechanical arm base coordinate system and a camera coordinate system by a Zhang-Yongyou camera calibration method and an ArUco Markers calibration method;

step 2: performing three-dimensional scene reconstruction on the acquired image information by using a TSDF function so as to reduce environmental noise points between objects;

and step 3: establishing a reinforcement learning network model;

and 4, step 4: reversely projecting the predicted grabbing attitude of the end effector to a three-dimensional reconstruction scene, and judging the grabbing quality predicted this time;

and 5: the grabbing movement of the mechanical arm is completed through the forward and backward kinematics of the robot;

step 6: and performing reinforcement learning model training to enable the mechanical arm to complete the grabbing action.

2. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 1 comprises the following specific steps:

firstly, the Intel D415 depth camera is vertically fixed at the tail end of the mechanical arm, so that the camera and the clamping jaw of the end effector are in fixed posture transformation, and the binocular camera can acquire image information of an object on the grabbing operation table. Then, calibrating the internal reference and the external reference of the camera by a Zhang Zhengyou camera calibration method and an Aruco Markers calibration method, and constructing a three-dimensional coordinate system with the mechanical arm base as an origin. The calibration formulas of the internal reference and the external reference of the camera are as follows:

wherein x is _w ,y _w ,z _w Three-dimensional coordinates under a world coordinate system; x is the number of _c ,y _c ,z _c The coordinates are under an image coordinate system; u and v are pixel coordinate systems; z is a radical of _m Is a point in the assumed camera coordinate system; p is an imaging point in an image coordinate system; f is an imaging distance; d _x ,d _y The coordinates of the pixel points in the image coordinate system are obtained; r and T are a rotation matrix and a translation matrix. Equation (1) is a world coordinate system to camera coordinate system equation. Formula (2) is a formula from a camera coordinate system to an ideal image coordinate system. Equation (3) is a camera coordinate system to pixel coordinate system matrix, i.e., an internal reference matrix.

3. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 2 comprises the following specific steps:

2.1): in an actual experiment, a single sensor is adopted to obtain visual information, and information (color information and depth information) captured by a binocular camera on a grabbing operating platform is obtained through OpenCV.

2.2): due to light, mutual shielding of objects and other environmental factors, a large noise point exists in a depth image acquired by a sensor, and therefore the capturing and prediction of the network model on the image are influenced. Firstly, spherical point sampling is carried out on a grabbing operation table before grabbing each time, and depth errors in a single image are reduced by sampling local images for multiple times. Secondly, the sampling point posture and the sampling image information are fused into a three-dimensional reconstruction voxel grid by utilizing a TSDF function. For a single voxel, not only the intuitive x, y, z coordinates are contained, but also two other values (SDF and CAM) are used to represent the voxel-to-surface distance. Where SDF is a signed distance function for depth map fusion, which uses 0 for the surface, positive values for points outside the model, negative values for points inside the model, and a value proportional to the distance of the point to the nearest surface. The surface is easily extracted as the zero crossing point of the function. As shown in fig. 2-4, if the depth value D (u, V) at the i-th frame depth image D (u, V) is not 0, comparing D (u, V) with z in the voxel camera coordinates V (x, y, z), and if D (u, V) is greater than z, indicating that the voxel is closer to the camera for reconstructing the outer surface of the scene; if D (u, v) is less than z, this voxel is further from the camera and is reconstructed as the interior surface of the scene. I.e. the voxel point is outside the surface (closer to the camera side), the SDF value is positive and the voxel is inside the surface, the SDF value is negative.

2.3): the coordinate system where the voxel grid is located is defined as a world coordinate system, and the voxel coordinate can be expressed as

Position of cameraIs in the posture of

The camera's reference matrix is K.

After projecting the voxel grid to the image coordinate system, it is first necessary to calculate the initial SDF value for each voxel, as shown in equation (6)

In the formula

The sdf value for each voxel is then truncated according to equation (19).

Equation (9) calculates a weight value for each voxel.

w _j ＝w _j-1 +w _j (9)

4. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 3 comprises the following specific steps:

3.1): designing a mechanical arm grabbing strategy, and defining an action space for reinforcement learning: the target grasp is generally divided into two grasp position areas and a three-dimensional rotation amount, and the grasp area and the three-dimensional rotation amount are predicted by two network models, respectively. The method uses a two-dimensional convolution network to directly regress the 6-DoF grabbing posture by increasing the network dimension. The invention is defined in the space of grabbing motion of a single object as 16 × 28 motions, wherein 16 is the rotation motion (22.5 °) of the end effector around the z coordinate axis, and 28 × 28 is the number of the rotation motions performed on the x and y coordinate axes by the end effector. The gripper defines the gripping action as shown in figure 2.

3.2): converting each element in the array into an angle of rotation of the tail end of the mechanical arm around three coordinate axes of x, y and z respectively, wherein a specific conversion formula is as follows:

a _x ＝((best_pix_ind[0]-14)*30/28)-pi (10)

b _y ＝((best_pix_ind[1]-14)*30/28) (11)

r _z ＝(best_pix_ind[2]*180/16) (12)

wherein a is _x The rotation angle of the tail end of the mechanical arm around the x axis is represented as the roll angle of the end effector; b _y The rotation angle of the tail end of the mechanical arm around the y axis is expressed, and the rotation angle is the pitch angle of the end effector; r is _z The rotation angle of the tail end of the mechanical arm around the z axis is represented as the yaw angle of the end effector, and theta is an experimental set deviation value.

3.3): designing a feature extraction network model: the model adopts a Resnet full convolution feature extraction network architecture to learn to extract unstructured features from RGB-D partial observation. The multi-layer feature extraction Convolutional layer extracts the features of the color image and the depth image, then channel connects the color features and the depth features through the Concat function, and keeps the resolution of the output thermal diagram consistent with the resolution of the input image by adopting 2 additional Batch Normalization (BN), Nonlinear activation functions (ReLU), 1 × 1 conditional Layers, and upsampling. And selecting the corresponding gesture in the action space corresponding to the maximum vector value coordinate in the tensor map as the predicted grabbing gesture to be output.

3.4): the network is forward-inferred by the following formula:

And 3.5) designing a reinforcement learning network loss function, and adopting a calculation cross entropy function, thereby further improving the detection precision of the algorithm:

where τ is s ₀ a ₀ s ₁ a ₁ ...s _n a _n .., Markov process.

Due to the fact that

P _r { a | s } ═ pi (s, a), so equation (18) can be obtained.

The weight update function is as follows:

wherein f is _ω SxA → R is a pair

3.6): setting reinforcement learning reward: if the crawling is successful, the reward of the agent is assigned to be 1, otherwise, the reward of the agent is assigned to be 0.

5. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 4 comprises the following specific steps:

and (3) reversely projecting the predicted grabbing attitude of the end effector into a three-dimensional reconstruction scene, and judging the grabbing quality of the prediction: and reversely projecting the 6-degree-of-freedom grabbing posture output by the network into a three-dimensional voxel scene established before grabbing, and intercepting a depth image under the predicted posture through the projection of a viewing pyramid, wherein the projection of the viewing pyramid is shown in an attached figure 4. Where the viewing pyramid is a three-dimensional volume whose position is related to the camera's external reference matrix, the shape of the viewing pyramid determines how the model is projected from the pixel space to the plane. Perspective projection uses a pyramid as the view frustum, with the camera located at the vertex of the pyramid. The pyramid is truncated by two planes, one behind the other, forming a Frustum of a pyramid called View cluster, the model of which is shown in fig. 5, wherein only the model located inside cluster is visible. In this clipping space, two planes are more specialized, called near clip plane (near clip plane) and far clip plane (far clip plane), respectively.

6. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 5 comprises the following steps:

and when the depth of the two ends of the gripper in the rendered depth image is less than or equal to the fingertip depth of the gripper, namely the predicted gesture of the gripper is positioned on the two sides of the object, judging that the predicted gesture can be grabbed. And then, controlling the mechanical arm to execute a grabbing plan according to the predicted grabbing posture: 6 joint angle degrees of the mechanical arm in the current state are solved through inverse kinematics of the robot. And then inputting the predicted 6-dimensional grabbing posture of the tail end of the mechanical arm output by the strong learning network in the step 3 into positive kinematics of the robot, and solving the movement track of the end effector of the mechanical arm converted from the current posture to the predicted posture point. And when the end effector of the mechanical arm moves to the predicted grabbing posture, the robot sends a clamp closing signal to try to grab. After the gripper is closed, the end effector vertically moves upwards by 15cm, a depth image of the effector after moving upwards is obtained from an Intel Rrealsense D435 camera, and whether actual grabbing succeeds or not is judged by calculating the depth of two ends of the gripper. When the capturing is successful, the reinforcement learning model reward is assigned to be 1; and when the grabbing fails, the reinforcement learning network reward is assigned to be 0.

7. The mechanical arm six-degree-of-freedom visual closed-loop grabbing method based on TSDF three-dimensional reconstruction as claimed in claim 1, wherein: the step 6 comprises the following specific steps: