CN114387513A

CN114387513A - Robot grabbing method and device, electronic equipment and storage medium

Info

Publication number: CN114387513A
Application number: CN202111659492.1A
Authority: CN
Inventors: 胡哲源; 刘雪峰; 李青锋; 牛建伟; 任涛
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-22

Abstract

The embodiment of the invention provides a robot grabbing method and device, electronic equipment and a storage medium, and relates to the field of robot intelligent control. Firstly, shooting a target object under a plurality of visual angles by using a camera arranged on a robot to obtain an image set and camera parameters corresponding to each visual angle; then inputting the image set and camera parameters into a pre-trained detection model, predicting the three-dimensional coordinates of key points of the target object to obtain a prediction result, and calculating the pose of the target object according to the spatial geometrical relationship between the prediction result and the key points; and finally, controlling the robot to finish the grabbing of the target object according to the pose of the target object and the pre-trained grabbing model.

Description

Robot grabbing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of robot intelligent control, in particular to a robot grabbing method and device, electronic equipment and a storage medium.

Background

With the development of intelligent manufacturing technology and information technology, the intelligent level of the robot is continuously improved, and the robot is widely applied to various industries, particularly in the industrial field, and the use of the industrial robot greatly improves the production efficiency.

In industrial production, robot gripping is one of the most common and basic operations, and the research on improving the accuracy and stability of robot gripping is of great significance for further improving the industrial production efficiency.

Disclosure of Invention

The invention provides a robot grabbing method, a robot grabbing device, an electronic device and a storage medium, which can determine the pose of a target object according to images obtained by shooting the target object under multiple visual angles by a camera and a pre-trained detection model, and control a robot to grab the target object according to a pre-established grabbing model.

Embodiments of the invention may be implemented as follows:

in a first aspect, an embodiment of the present invention provides a robot grabbing method, which is applied to an electronic device, where the electronic device is in communication connection with a robot, and the robot is equipped with a camera; the method comprises the following steps:

acquiring an image set obtained by shooting a target object under a plurality of visual angles by the camera and camera parameters corresponding to each visual angle;

inputting the image set and each camera parameter into a detection model, predicting the image set by using the detection model to obtain a predicted value of a first coordinate of the target object, wherein the first coordinate is a three-dimensional coordinate of a key point of the target object;

calculating a pose parameter of the target object according to the predicted value of the first coordinate and the space geometric relationship between the key points, wherein the pose parameter is used for representing the space position and the posture of the target object;

and controlling the robot to complete the grabbing of the target object according to the pose parameters and the grabbing model, wherein the grabbing model is obtained by training in a reinforcement learning mode.

In one possible implementation, the detection model includes a two-dimensional detection network and a three-dimensional detection network, and the image set includes images corresponding to a plurality of view angles;

the step of inputting the image set and each camera parameter into a detection model, and predicting the image set by using the detection model to obtain a predicted value of the first coordinate of the target object includes:

inputting the image set into the target detection network, and utilizing the target detection network to cut the images in the image set to obtain cut images;

inputting the clipped images into the two-dimensional detection network, and performing feature extraction on each clipped image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to each image and a confidence coefficient corresponding to each image, wherein the two-dimensional feature map is used for representing two-dimensional coordinates of key points of the target object in the image, and the confidence coefficient is used for representing the truth of the target object under a view angle corresponding to each image;

and inputting all the two-dimensional feature maps, the confidence degrees and the camera parameters into the three-dimensional detection network, and processing all the two-dimensional feature maps by using the three-dimensional detection network with the confidence degrees as weights to obtain a predicted value of the first coordinate of the target object.

In one possible embodiment, the three-dimensional detection network comprises a three-dimensional mapping network, a three-dimensional convolution network and a loss function regression model;

the step of inputting all the two-dimensional feature maps, the confidence degrees and the camera parameters into the three-dimensional detection network, and processing all the two-dimensional feature maps by using the three-dimensional detection network with the confidence degrees as weights to obtain the predicted value of the first coordinate of the target object includes:

inputting all the two-dimensional feature maps, the confidence degrees and the camera parameters into the three-dimensional mapping network, and obtaining a three-dimensional feature map of the target object by utilizing a back projection mode, wherein the three-dimensional feature map is used for representing three-dimensional space information of the target object, the three-dimensional feature map comprises a plurality of channels, and each channel corresponds to a key point of each target object one to one;

inputting the three-dimensional characteristic graph into the three-dimensional convolution network, and performing characteristic extraction on the three-dimensional characteristic graph by using the three-dimensional convolution network to obtain a three-dimensional characteristic graph corresponding to each key point;

and inputting the three-dimensional feature maps corresponding to all the key points into the loss function regression model, and performing normalization processing on the three-dimensional feature maps corresponding to all the key points by using the loss function regression model to obtain a predicted value of the first coordinate of the target object.

In a possible implementation, the robot includes a gripper, and the step of controlling the robot to complete the gripping of the target object according to the pose parameters and the gripping model includes:

acquiring state parameters at the moment t, wherein the state parameters comprise pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw and the opening and closing state of the clamping jaw;

inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient into the grabbing model to obtain action parameters at the moment t, wherein the action parameters comprise the speed of the tail end of the clamping jaw, the rotating angle of the clamping jaw and the opening and closing state of the clamping jaw;

controlling the robot to move according to the action parameters, finishing the action at the time t, and acquiring the state parameters at the time t + 1;

and repeatedly executing the step of inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence into the grabbing model to obtain the action parameters at the moment t until the robot finishes grabbing the target object.

In one possible embodiment, the detection model is trained by:

acquiring a training sample and a label corresponding to the training sample, wherein the training sample comprises a first image and camera parameters, and the label represents the three-dimensional coordinates of the key point of the reference object;

inputting the first image into the target detection network, and utilizing the target detection network to cut the first image to obtain a cut first image;

inputting the clipped first image into the two-dimensional detection network, and performing feature extraction on the clipped first image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to the first image and a confidence coefficient corresponding to the first image;

inputting the two-dimensional feature map corresponding to the first image, the confidence coefficient corresponding to the first image and the camera parameters into the three-dimensional detection network, and processing the two-dimensional feature maps corresponding to all the first images by using the three-dimensional detection network with the confidence coefficient corresponding to the first image as a weight to obtain a prediction result of a second coordinate of the reference object, wherein the second coordinate is a three-dimensional coordinate of a key point of the reference object;

and carrying out back propagation training on the detection model based on the prediction result of the second coordinate of the reference object, the label and a preset loss function to obtain the trained detection model.

In one possible embodiment, the loss function is:

wherein alpha is the weight of the loss term of the two-dimensional coordinates of the keypoint,

a loss function representing the two-dimensional feature map,

a loss function representing the three-dimensional feature map;

the loss function of the two-dimensional feature map is:

wherein, F_n,kRepresenting a predicted value of the two-dimensional feature map,

a tag value representing the two-dimensional feature map;

the loss function of the three-dimensional feature map is as follows:

wherein d is_kA predicted value representing the three-dimensional coordinates of the keypoint,

a true value representing the three-dimensional coordinates of the key point, gamma represents a weight,

the method is used for enhancing the accuracy of the prediction result output by the three-dimensional detection network.

In a possible embodiment, the robot further includes a base, and the step of obtaining the training sample and the label corresponding to the training sample includes:

calibrating the camera to obtain a first conversion matrix between a camera coordinate system and a clamping jaw tail end coordinate system;

acquiring a reference image obtained by shooting the reference object under a plurality of visual angles by the camera and camera parameters corresponding to each visual angle;

marking the coordinates of the key points of the reference object in the reference image to obtain the first image;

calculating to obtain a reference three-dimensional coordinate of a key point of the reference object in the camera coordinate system by using a direct linear transformation method according to the first image and the camera parameters;

converting the reference three-dimensional coordinate into a real coordinate under a base coordinate system according to the first conversion matrix and a preset conversion matrix, wherein the preset conversion matrix represents a conversion relation between the clamping jaw tail end coordinate system and the base coordinate system;

and respectively taking the marked reference image and the real coordinates as labels corresponding to the training samples and the training samples.

In one possible embodiment, the grasping model is trained by:

inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence into the pre-constructed grabbing model, wherein the grabbing model comprises an actor network and a critic network;

obtaining action parameters at the time t by utilizing the actor network of the grabbing model;

controlling the robot to move according to the action parameters, finishing the action at the time t, and acquiring reward parameters;

inputting the reward parameters into the critic network, and predicting a Q value by using the critic network to obtain a prediction result of the Q value, wherein the Q value is used for evaluating the value generated by the robot moving according to the action parameters;

and performing back propagation training on the grabbing model according to the prediction result of the Q value, a preset loss function and a preset gradient function to obtain a trained grabbing model.

In one possible embodiment, the gripping model is constructed by:

determining a state function according to the pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw, the opening and closing state of the clamping jaw, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient; wherein the pose parameters of the jaw tip characterize the spatial position and pose of the jaw tip relative to the base;

determining an action function according to the speed of the tail end of the clamping jaw, the rotating angle of the clamping jaw and the opening and closing state of the clamping jaw;

determining a reward function according to the grabbing result of the robot, the distance between the clamping jaw and the target object and the moving direction of the robot;

and constructing the grabbing model by utilizing a multi-view-based reinforcement learning grabbing algorithm according to the state function, the action function and the reward function.

In a second aspect, an embodiment of the present invention further provides a robot gripping device, which is applied to an electronic device, where the electronic device is in communication connection with a robot, and the robot is equipped with a camera; the device comprises:

the acquisition module is used for acquiring an image set obtained by shooting a target object under a plurality of visual angles by the camera and camera parameters corresponding to each visual angle;

the detection module is used for inputting the image set and each camera parameter into a detection model, predicting the image set by using the detection model to obtain a predicted value of a first coordinate of the target object, wherein the first coordinate is a three-dimensional coordinate of a key point of the target object;

the calculation module is used for calculating a pose parameter of the target object according to the predicted value of the first coordinate and the space geometric relationship between the key points, wherein the pose parameter is used for representing the space position and the posture of the target object;

and the control module is used for controlling the robot to complete the grabbing of the target object according to the pose parameters and the grabbing model, wherein the grabbing model is obtained by training in a reinforcement learning mode.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described robotic grasping method.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above robot capture method.

Compared with the prior art, the robot grabbing method, the robot grabbing device, the electronic equipment and the storage medium provided by the embodiment of the invention have the advantages that firstly, a camera installed on a robot is used for shooting a target object under multiple visual angles, and camera parameters corresponding to an image set and each visual angle are obtained; then inputting the image set and camera parameters into a pre-trained detection model, predicting the three-dimensional coordinates of key points of the target object to obtain a prediction result, and calculating the pose of the target object according to the spatial geometrical relationship between the prediction result and the key points; and finally, controlling the robot to finish the grabbing of the target object according to the pose of the target object and the pre-trained grabbing model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is an application scenario diagram of a robot grasping method according to an embodiment of the present invention.

Fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart of a robot gripping method according to an embodiment of the present invention.

Fig. 4 is an exemplary diagram of a target object coordinate system according to an embodiment of the invention.

Fig. 5 is a schematic flowchart of step S120 in the robot grasping method shown in fig. 3.

Fig. 6 is a schematic structural diagram of a detection model according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of a two-dimensional detection network according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a three-dimensional convolution network according to an embodiment of the present invention.

Fig. 9 is a schematic flowchart of step S140 in the robot grasping method shown in fig. 3.

Fig. 10 is a schematic flowchart of a training method of a detection model according to an embodiment of the present invention.

Fig. 11 is a flowchart illustrating step S210 in the training method of the detection model illustrated in fig. 10.

Fig. 12 is a schematic flowchart of a training method for a grab model according to an embodiment of the present invention.

Fig. 13 is a schematic block diagram of a robot gripping device according to an embodiment of the present invention.

Icon: 10-an electronic device; 20-a network; 11-a memory; 12-a processor; 13-a bus; 200-a robotic grasping device; 201-an acquisition module; 202-a detection module; 203-a calculation module; 204-control module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

In the prior art, a robot grabbing process based on vision mainly comprises two main steps of target object posture identification and robot motion planning and grabbing.

And (3) identifying the posture of the target object:

the gesture recognition of the target object mainly uses a visual recognition mode to estimate the position and the gesture (6DoF, 6 Degree of Freedom,6DoF pose) of the object relative to a certain coordinate system. The most widely researched technology at present is an object 6DoF pose estimation technology, and the 6DoF object pose estimation technology is a technology for determining the position and the posture of an object in a captured scene, wherein position information is mainly expressed by three-dimensional coordinates in a specific space coordinate system; common representations of pose information are 4: a pose transformation matrix, an Euler angle, a rotation quaternion and an axis angle.

The method based on 6DoF object recognition allows the robot to grab objects from different angles in a three-dimensional space, but the method needs an algorithm to give information of six dimensions including the position (three-dimensional space coordinates) and the posture (Euler angle) of the object, and the task cannot be completed only by two-dimensional image information.

In order to obtain the pose of an object, a depth camera is usually used to shoot the object, a color image and a corresponding depth information image of the object are obtained, point cloud information of the shot object is restored, and the 6DoF pose information of the object is solved by matching the point cloud with the existing three-dimensional object model through feature information.

Robot motion planning and grabbing:

the current common robot motion planning and grabbing technology is mainly based on a Dynamic Motion Primitives (DMP) method: the dynamic motion primitive is one of the motion expression forms of the robot which is common at present, and can be used as a motion feedback controller to flexibly adjust motion actions without manually adjusting motion parameters or worrying about the stability of a motion system.

For the gesture recognition of the target object, the object 6DoF pose estimation technology is only suitable for grabbing a known object, the grabbing mode needs to be set in advance, and when an unknown object is encountered, even if the shape of the object is similar to the existing model shape in the model library, the algorithm may fail, so that the object pose detection accuracy is poor. In addition, the object three-dimensional key point detection technology mainly uses a color image and a depth image as input of an algorithm, and detection accuracy of key points may be affected in a complex capture scene (for example, problems of overlapping, shielding and the like exist among objects).

For robot motion and planning, a robot motion planning and grabbing technology based on dynamic motion primitives needs to establish a strict and accurate mathematical model for a robot and a surrounding environment, and although the motion process of the robot is stable and controllable, the generalization capability of a system is general and is generally difficult to be applied to a new environment.

In view of the above problems, the present embodiment provides a robot grabbing method, in which a pose of a target object is determined according to a color image obtained by shooting the target object with a camera under multiple viewing angles, and a robot is controlled to complete grabbing of the target object according to a grabbing model obtained by using reinforcement learning training, so that precision of robot grabbing is improved, and generalization capability is strong.

As described in detail below.

Referring to fig. 1, fig. 1 is a diagram illustrating an application scenario of a robot grabbing method according to an embodiment of the present invention, including an electronic device 10, a robot, a camera, and a target object. The robot and camera are both connected to the electronic device 10 through the network 20; the robot comprises a base, a mechanical arm and a clamping jaw, and the camera is installed at the tail end of the mechanical arm.

Block schematic diagram of electronic device 10 as shown in fig. 2, the electronic device 10 may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a server, or other electronic devices with processing capabilities. Electronic device 10 includes memory 11, processor 12, and bus 13. The memory 11 and the processor 12 are connected by a bus 13.

The memory 11 is used for storing a program, such as the robot gripping apparatus 200, the robot gripping apparatus 200 includes at least one software functional module which can be stored in the memory 11 in a form of software or firmware (firmware), and the processor 12 executes the program after receiving an execution instruction to implement the robot gripping method in the embodiment.

The Memory 11 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 12 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the decoding control method in the present embodiment may be implemented by integrated logic circuits of hardware in the processor 12 or instructions in the form of software.

The processor 12 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.

The robot may be a robot with gripping capabilities comprising a base, a robot arm and a gripper, the robot arm being movable in various directions in space.

The camera can be a color camera with a shooting function, is fixed at the tail end of the mechanical arm through a support and can shoot the target object under multiple visual angles through the movement of the mechanical arm to obtain color images under multiple visual angles.

The network 20 may be a wide area network or a local area network, or a combination of both, using wireless links for data transmission.

When the robot captures the target object, firstly, the target object is shot by the camera under a plurality of visual angles, so as to obtain a color image and a camera pose corresponding to the plurality of visual angles, wherein the camera pose represents a three-dimensional coordinate and a rotation angle of the camera 20 relative to the base, and the color image and the camera pose are sent to the electronic equipment 10; after receiving the color image and the camera pose, the electronic device 10 controls the robot to complete the grabbing of the target object according to the pre-stored detection model and grabbing model.

On the basis of the electronic device 10 shown in fig. 2, the robot grasping method provided in the present embodiment is described. Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a robot grabbing method according to this embodiment, where the method is applied to an electronic device 10, the electronic device 10 is in communication with a robot, and the robot is equipped with a camera. The method comprises the following steps:

and S110, acquiring an image set obtained by shooting the target object at a plurality of visual angles by the camera and camera parameters corresponding to each visual angle.

In this embodiment, an Intel real sense D435 camera is used as a visual sensor to capture an image, and it should be noted that this embodiment only uses this camera to capture a color image, and does not use depth information acquired by the camera.

Generally, 4 viewing angles are selected to shoot a target object, and 4 color images obtained by shooting are used as an image set. The selected viewing angles may be, for example, a top viewing angle, a front viewing angle, a left viewing angle, and a right viewing angle, each corresponding to a color image.

The use of camera parameters to characterize a camera refers to the position information and pose information of the camera relative to the base, wherein the position information is typically represented by three-dimensional coordinates and the pose information is typically represented by a rotation matrix.

For example, the camera parameters corresponding to n view angles may be expressed as:

P_n＝[R_n，T_n]

wherein, P_nIs a 3 x 4 matrix representing camera parameters, R_nIs a three-dimensional matrix representing the pose information of the camera, T_nIs a three-dimensional column vector representing the three-dimensional coordinates of the camera relative to the base.

It should be noted that the camera parameters are actually measured, and since the angle of view for shooting is generally 4 fixed angles of view, the camera parameters can be regarded as 4 known quantities stored in the electronic device.

And S120, inputting the image set and each camera parameter into the detection model, and predicting the image set by using the detection model to obtain a predicted value of a first coordinate of the target object, wherein the first coordinate is a three-dimensional coordinate of a key point of the target object.

In this embodiment, the detection model is used to predict the three-dimensional coordinates of the key points of the target object, input the camera parameters corresponding to the image and each view angle at each view angle, and output the predicted values of the three-dimensional coordinates of the key points of the target object.

The key points are used to represent the pose of the target object, for example, the target object is a cube, and the key points may be eight vertices of the cube.

The predicted value of the first coordinate may be represented by a three-dimensional vector, for example, the predicted value of the three-dimensional coordinate of the kth keypoint is:

[x_k，y_k，z_k]^T

wherein x is_k，y_kAnd z_kThe x-axis coordinate, the y-axis coordinate and the z-axis coordinate of the kth key point in the base coordinate system are respectively.

And S130, calculating a pose parameter of the target object according to the predicted value of the first coordinate and the space geometric relationship between the key points, wherein the pose parameter is used for representing the space position and the posture of the target object.

In this embodiment, the spatial geometric relationship between the key points may be obtained by establishing an object coordinate system, the coordinate system is established as shown in fig. 4, taking the target object as a mug, taking the cup mouth as the origin of the coordinate system, calculating a vector from the cup mouth to the cup bottom and taking the vector as a z-axis, taking a normal vector of a plane formed by the z-axis and the key points of the cup handle as a y-axis, and determining the direction of the x-axis according to the orthogonal property of the three axes.

And S140, controlling the robot to complete the grabbing of the target object according to the pose parameters and the grabbing model, wherein the grabbing model is obtained by training in a reinforcement learning mode.

In the embodiment, since the grabbing model is obtained by training in a reinforcement learning manner, after the pose parameters are obtained, the next action of the robot needs to be determined according to the environment state, so that the method is opposite to the traditional method based on the motion primitives.

Referring to fig. 5, the step S120 is described in detail below, and on the basis of fig. 3, the step S120 may include the following detailed steps:

and S1201, inputting the image set into the target detection network, and cutting the images in the image set by using the target detection network to obtain cut images.

In this embodiment, in order to better understand the processing procedure of the detection model on the image set and the camera parameters, the following description will be made with reference to fig. 6. Referring to fig. 6, fig. 6 shows the structure of the detection model and the processing procedure of the detection model on the image set and the camera parameters.

As shown in fig. 6, the detection model includes an object detection network, a two-dimensional detection network, and a three-dimensional detection network, and the image set includes images corresponding to a plurality of view angles.

The target detection network is realized based on RFB-Net and is used for identifying the position of a target grabbing object in each image, cutting the original image and improving the accuracy of key point detection.

Before inputting an image set into a two-dimensional detection network, preprocessing the image in the image set by using a target detection network, inputting the image set into the target detection network, cutting each image by using the target detection network to obtain the cut image, and removing blank parts in the image to the maximum extent by using the cut image so that key points of a target object in the image can be displayed more clearly and the subsequent feature extraction is facilitated.

RFB-Net provides the concept of a Receptive Field module (RFB), which can enhance the ability of a lightweight convolutional neural network model to learn deep features and construct a rapid and accurate target detector on the basis.

Specifically, the receptive field module performs feature extraction on the image by using a dilation convolution mode by utilizing a multi-branch pooling layer and receptive field convolution kernels with different sizes. And embedding the receptive field module group into the shallow layer of the SSD target detection network to construct an advanced single-stage detector RFB-Net. RFB-Net obtains higher detection precision while guaranteeing the original lightweight detector detection speed.

And S1202, inputting the image set into a two-dimensional detection network, and performing feature extraction on each image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to each image and a confidence coefficient corresponding to each image, wherein the two-dimensional feature map is used for representing two-dimensional coordinates of key points of the target object in the image, and the confidence coefficient is used for representing the true degree of the target object under a visual angle corresponding to each image.

In the present embodiment, the two-dimensional coordinates of the keypoints in the image refer to the pixel coordinates in the image.

Referring to fig. 6, after the target detection network, the obtained clipped image is input to the two-dimensional detection network, feature extraction is performed on the clipped image by using the two-dimensional detection network, and a two-dimensional feature map corresponding to each image and a confidence corresponding to each image are output.

The two-dimensional detection network is realized based on a convolutional neural network, and comprises a residual error network, a first neural network and a second neural network, and the structure of the two-dimensional detection network is shown in fig. 7.

The residual error network is a backbone network, the cut images are scaled to be of a uniform size and input into the residual error network, an initial two-dimensional image is output, and two-dimensional coordinates of key points in the initial two-dimensional image are labeled.

Compared with the traditional convolution network and the full-connection network, the residual error network can effectively solve the problem of information error or information loss in the neural network characteristic extraction process, and the completeness of information is ensured by directly connecting the data input into the residual error module to the output of the module in a cross-layer mode. The network only needs to learn the residual error part of input and output, thereby greatly reducing the difficulty of network parameter learning. The residual error network with more layers does not have the problem of gradient disappearance, and the precision of the detection model is greatly improved.

After the residual network, the initial feature images are processed using two different neural networks, the first for predicting confidence scores ω of keypoints in each image_n,kAnd taking the average value of the confidence scores of all the key points in each image as the confidence of the image under the corresponding view angle. As shown in the following formula, the confidence corresponding to the nth image, i.e., the nth viewThe expression of the corresponding confidence under the angle is:

wherein k is the number of key points included in the nth image.

The robot grabbing operation can be more effectively designed by taking the characteristics of different confidence degrees of different visual angles as grabbing prior knowledge, namely, the higher the confidence degree of a certain visual angle is, the less shielding the object under the visual angle is, the higher the detection precision of the key point is, and the more favorable the robot grabbing is realized by the posture of the object at the visual angle.

The second neural network comprises 4 deconvolution layers, the initial two-dimensional feature map is processed by the second neural network to obtain a two-dimensional feature map, each feature map comprises a plurality of channels, and the channels correspond to a plurality of key points in each image.

Compared with the existing networks using upsampling and cross-layer connection, the second neural network in the embodiment of the invention improves the resolution of the feature map in a simpler mode such as deconvolution, accelerates the prediction speed of the two-dimensional detection network, and improves the detection precision.

And S1203, inputting all the two-dimensional feature maps, the confidence degrees and the camera parameters into a three-dimensional detection network, and processing all the two-dimensional feature maps by using the three-dimensional detection network with the confidence degrees as weights to obtain a predicted value of the first coordinate of the target object.

Step S1203 is described in detail below, and step S1203 may include the following detailed steps:

the method comprises the steps of firstly, inputting all two-dimensional feature maps, confidence degrees and camera parameters into a three-dimensional mapping network, and obtaining a three-dimensional feature map of a target object by utilizing a back projection mode, wherein the three-dimensional feature map is used for representing three-dimensional space information of the target object, the three-dimensional feature map comprises a plurality of channels, and each channel corresponds to a key point of each target object one by one.

In this embodiment, the three-dimensional detection network includes a three-dimensional mapping network, a three-dimensional convolution network, and a loss function regression model.

With reference to fig. 6, the two-dimensional feature map and the confidence level output by the two-dimensional detection network and the camera parameters are input into the three-dimensional mapping network to output the three-dimensional feature map of the target object, and it should be noted that the camera parameters are input into the detection model together with the image set, except that the camera parameters are not used in the target detection network and the two-dimensional detection network.

The specific process of processing the two-dimensional characteristic diagram by using a back projection mode is as follows;

given a positive integer N_x，N_y，N_zRespectively representing the number of unit cells contained in the x, y and z directions, and constructing a cell containing N by taking the detected object as the center_x×N_y×N_zThree-dimensional voxel F of individual cells_3DWherein the unit side length of each cell is represented as L_sizeThe coordinates of a point within a voxel may be represented by (i)_x，i_y，i_z)(i_x，i_y，i_z∈N_x×N_y×N_z×L_size) And (4) showing.

After reverse projection, the size of the output three-dimensional feature map is N_x×N_y×N_zAnd M represents the number of channels contained in the three-dimensional feature map, and the number of key points is equal. The coordinates of the kth key point of the target object in the three-dimensional feature map can be expressed as:

wherein,

is shown in camera C_nTo a point (i) in space under the posture of (a)_x,i_y,i_z) Two-dimensional feature maps obtained by performing projection operations, F_n{.}_kExpressing that bilinear interpolation mapping is carried out on the two-dimensional characteristic values to obtain characteristic graphs of k channels, and obtaining the characteristic graphs of k channels according to the first characteristic graphConfidence conf obtained by neural network_nAnd the three-dimensional feature maps are fused into three-dimensional feature maps of k channels as weights.

And secondly, inputting the three-dimensional characteristic graph into a three-dimensional convolution network, and performing characteristic extraction on the three-dimensional characteristic graph by using the three-dimensional convolution network to obtain the three-dimensional characteristic graph corresponding to each key point.

In this embodiment, after back projection, it is necessary to blend the voxels F of the plurality of view angle features_3DFeature extraction is performed due to F_3DThe present invention is a feature representation of three-dimensional spatial information, and the embodiment of the present invention considers performing convolution operation on the three-dimensional spatial information using a three-dimensional convolution network.

F_3DThe method belongs to a multi-channel three-dimensional feature map, so that the method constructs a multi-channel three-dimensional convolution network as a feature extractor. Using a three-dimensional convolution structure like V2V-PoseNet, V2V-PoseNet is a neural network architecture for three-dimensional gesture recognition, which takes a voxelized three-dimensional feature map as an input and outputs the spatial position of each key point and a corresponding probability estimate.

For F_3DThe method has the characteristics that the network structure of the original V2V-PoseNet is improved, so that the method is more suitable for the key point detection task of the embodiment of the invention. The network comprises four types of modules, wherein the first type is a basic three-dimensional convolution module which consists of a three-dimensional convolution layer, a batch regularization layer and a ReLU activation function. The modules are positioned at the head and the tail of the network; the second type is a three-dimensional residual error module, which is obtained by expanding a two-dimensional residual error network in the depth dimension; the third type is a down-sampling module, which mainly comprises a three-dimensional convolution layer and a maximum pooling layer. The last type is an upsampling module, which consists of a three-dimensional deconvolution layer, a batch regularization layer and a ReLU activation function.

Introducing a batch regularization layer and an activation function after the three-dimensional deconvolution layer helps to simplify the learning process of the network model. The convolution kernel size of the basic three-dimensional convolution module is 7 multiplied by 7, and feature maps of 64 channels are output; the convolution kernel size of the three-dimensional residual error module is 3 multiplied by 3; the convolution kernel size of the down-sampling module and the up-sampling module is 2 × 2 × 2, and the convolution step size is 2.

The structure of the three-dimensional convolution network in this embodiment is similar to a hourglass model, the structure of the three-dimensional convolution network is shown in fig. 8, an input three-dimensional feature map firstly passes through a basic three-dimensional convolution module and a down-sampling module, then three-dimensional residual modules are sequentially used for extracting effective object local features, and then the effective object local features pass through an encoder and a decoder which are composed of an up-sampling module and a down-sampling module.

In the encoder, the down-sampling module reduces the size of the three-dimensional feature map through convolution operation, but increases the number of feature channels, and the increase of the number of the feature map channels corresponds to the increase of extracted features, which is beneficial to improving the detection performance of the network; in the decoder, the up-sampling module expands the size of the three-dimensional feature map and simultaneously reduces the number of channels to compress the extracted features, and the expansion of the size of the three-dimensional feature map in the decoder is helpful for a network to determine the spatial information of key points.

The network layers of the encoder and the decoder are connected with each other through voxel addition of each feature map with the same scale, so that the decoder can more stably up-sample the feature maps. Outputting a three-dimensional characteristic diagram of the kth key point after the input characteristic passes through an encoder and a decoder, and marking the three-dimensional characteristic diagram as V_k。

And thirdly, inputting the three-dimensional characteristic graphs corresponding to all the key points into a loss function regression model, and performing normalization processing on the three-dimensional characteristic graphs corresponding to all the key points by using the loss function regression model to obtain a predicted value of the first coordinate of the target object.

In this embodiment, the loss function regression model is softmax, and the detection probabilities of all the object key points are normalized by softmax, which is denoted as V_k' the concrete formula is as follows.

W, H, D represent the width, height and channel depth, i, respectively, of the three-dimensional feature map_KVoxel coordinate of k-th key point, exp (.) -index operation. Finally, the three-dimensional coordinates of the kth key point of the target object are:

after the three-dimensional coordinates of each key point are obtained, the pose parameters of the target object are obtained through step S130, and then the robot is controlled to complete the grabbing of the target object according to the pose parameters and the grabbing model. As described in detail below.

On the basis of fig. 3, please refer to fig. 9, step S140 may include the following detailed steps:

and S1401, acquiring state parameters at the moment t, wherein the state parameters comprise pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw and the opening and closing state of the clamping jaw.

In this embodiment, the state parameter at time t is used to characterize the state of the robot at time t, and may be represented by a set:

wherein,

showing the pose parameter of the clamping jaw tail end at the moment t,

is the cartesian coordinate of the clamping jaw end under the base coordinate system,

by rotational quaternion of attitude of jaw end, i.e.

Representing the velocity of the jaw tip; g ═ {0, 1} represents the open and closed state of the jaws, 0 represents the jaws open, 1 represents the jaws closed, and in the default state, the jaws are open.

And S1402, inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient into the grabbing model to obtain action parameters at the time t, wherein the action parameters comprise the speed of the tail end of the clamping jaw, the rotation angle of the clamping jaw and the opening and closing state of the clamping jaw.

In this embodiment, the camera parameter corresponding to the maximum confidence is the confidence conf corresponding to all viewing angles_nMaximum value conf of (N ∈ {1,2, …, N })_maxCorresponding camera pose, denoted P_cmax。

The motion parameters are used for characterizing the motion that the robot needs to complete, and can be represented by a set:

wherein,

the speed of the tail end of the clamping jaw of the robot, namely the relative value of the tail end position of the clamping jaw at the next moment and the tail end position of the clamping jaw at the current moment, theta represents the rotation angle of the tail end of the clamping jaw, and g represents the opening and closing state of the clamping jaw.

And S1403, controlling the robot to move according to the motion parameters, finishing the motion at the time t, and acquiring the state parameters at the time t + 1.

Step S1402 is repeatedly executed until the robot finishes the grasping of the target object.

The whole process of the robot grasping method is introduced above, and the robot grasping method uses two models, namely a detection model and a grasping model, which are obtained by training in advance, and the model after training can be actually predicted. In a popular way, the process of model training is a process of continuously updating parameters in a model, and training data with labels are needed in the training process to better fit the data, so that the trained model can be predicted on data without labels or data which is not seen before.

On the basis of the above, the following describes the training method of the detection model and the capture model in detail.

Referring to fig. 10, the training method of the detection model may include the following steps:

s210, obtaining a training sample and a label corresponding to the training sample, wherein the training sample comprises a first image and camera parameters, and the label represents a three-dimensional coordinate of a key point of a reference object.

In this embodiment, the training samples are used for training the detection model, and after the training samples are input into the detection model, the detection model outputs a prediction result. The label is used for evaluating the accuracy of the prediction result, and the closer the prediction result and the label are, the more accurate the prediction result is, namely, the better the model training effect is.

The first image is obtained by shooting a reference object under a plurality of visual angles by a camera, the number of camera parameters is multiple, and the plurality of camera parameters correspond to the plurality of visual angles in a one-to-one mode.

It should be noted that there may be a plurality of reference objects, the obtained first images are in multiple groups, and each group of first images includes first images corresponding to multiple viewing angles; correspondingly, the number of the labels is also multiple, and each label corresponds to each reference object one by one.

S220, inputting the first image into a target detection network, and utilizing the target detection network to cut the first image to obtain the cut first image.

In this embodiment, corresponding to step S1201, the cropping processing of the first image by the target detection network is similar to the cropping processing of the image set, and details thereof are not repeated here.

In order to shorten the training time of the detection model, the present embodiment selects to use the training samples and the labels corresponding to the training samples to perform fine tuning on the target detection network on the basis of the existing network training weights. And (3) matching the training samples with labels corresponding to the training samples according to a ratio of 8: 2 are divided into a training set and a test set. The network model is trained on an Ubuntu16.04 system, and the trained and tested first image is uniformly scaled to a uniform resolution after the image enhancement preprocessing operation based on histogram equalization.

Meanwhile, the embodiment uses data such as random cutting, random expansion and the like to enhance the robustness of the detection model. During model training, the data sample size (batch-size) of each batch of training is set to be 32, the training is performed for 300 times in total, the initial learning rate is 0.004, and the learning rate is reduced to one tenth of the original learning rate in 120 th, 180 th and 280 th training times. The parameter attenuation value (weight-decay) was set to 0.0001 and the momentum factor (momentum) was 0.9. In order to prevent the situation of gradient explosion, the present embodiment uses a method of model "warm-up", i.e., the learning rate of the first 10 training rounds is gradually increased from 0.0001 to 0.004.

And S230, inputting the first image into a two-dimensional detection network, and performing feature extraction on the first image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to the first image and a confidence coefficient corresponding to the first image.

In this embodiment, the process of extracting the feature of the first image by using the two-dimensional detection network is similar to S1202, and is not described herein again.

And S240, inputting the two-dimensional feature map corresponding to the first image, the confidence coefficient corresponding to the first image and the camera parameters into a three-dimensional detection network, and processing the two-dimensional feature maps corresponding to all the first images by using the three-dimensional detection network with the confidence coefficient corresponding to the first image as a weight to obtain a prediction result of a second coordinate of the reference object, wherein the second coordinate is the three-dimensional coordinate of the key point of the reference object.

In this embodiment, the process of processing the two-dimensional feature maps corresponding to all the first images by using the three-dimensional detection network is similar to that in step S1203, and details are not repeated here.

In the training process of the detection model, in the back projection stage, the two-dimensional feature maps of different view angles are respectively filled into a unit cell (the side length L of each unit cell) containing 64 × 64 × 64 cells according to the camera parameters of each view angle_size0.375cm, i.e. 24cm x 24 cm). The whole detection model is trained by using an Adam optimizer, the initial learning rate is set to be 0.001, and an attenuation strategy with an attenuation coefficient of 0.3 is set every 20 training rounds. The model was trained with 200 epochs, with 8 samples per batch of training data.

For a three-dimensional convolution network in a three-dimensional detection network, the three-dimensional data enhancement technology is used for improving the robustness of the three-dimensional convolution network, and the specific implementation method comprises the following two types:

first, Random volume embedding: similar to the random clipping method of the two-dimensional image, the three-dimensional feature map is randomly embedded into the detection space S (S belongs to N)_x×N_y×N_z) In (1).

Second, Random rotation (Random rotation): the algorithm randomly rotates the three-dimensional feature map by 0 degrees, 90 degrees, 180 degrees or 270 degrees along the vertical axis, and when random rotation operation is used, only the parameters of the three-dimensional convolution module are updated when the detection model reversely propagates.

And S250, performing back propagation training on the detection model based on the prediction result of the second coordinate of the reference object, the label and a preset loss function to obtain the trained detection model.

In this embodiment, the back propagation training may be understood as a method for updating parameters of the detection model, and after multiple times of back propagation training, when an error between a prediction result of the detection model and a label is smaller than a preset value, the training of the model is completed.

It should be noted that, although the detection model includes a plurality of sub-networks, the entire detection model uses an end-to-end training mode, that is, a training sample and a label corresponding to the training sample are input, and a prediction result of the second coordinate is output.

Thus, the loss function of the detection model can be expressed as:

wherein alpha is the weight of the loss term of the two-dimensional coordinates of the key point,

a loss function representing a two-dimensional profile,

a loss function representing a three-dimensional feature map.

The loss function of the two-dimensional profile is:

wherein, F_n,kA predicted value of the two-dimensional feature map is represented,

the label values of the two-dimensional feature map are represented.

The loss function of the three-dimensional feature map is:

wherein d is_kA predicted value of three-dimensional coordinates representing the key point,

the true value of the three-dimensional coordinates representing the keypoint, gamma the weight,

In the present embodiment, the loss function

The two-dimensional detection network can extract useful key point pixel characteristics of the first image of each visual angle, and two-dimensional coordinates of key points are accurately predicted. Loss function

In (1)

Representing the loss of each keypoint for predictionThree-dimensional space information of object key points; the weight γ may be set to 0.01.

Referring to fig. 11, step S210 is described in detail below, and on the basis of fig. 10, step S210 may include the following detailed steps:

s2101, the camera is calibrated, and a first conversion matrix between a camera coordinate system and a clamping jaw end coordinate system is obtained.

In this embodiment, since the camera is fixed to the end of the robot arm, a calibration method using eye-in-hand (eye-in-hand) is considered. At this time, the position relationship between the camera coordinate system and the clamping jaw end coordinate system is fixed, and the position relationship between the camera coordinate system and the base coordinate system changes constantly, so that a transformation matrix from the camera coordinate system to the clamping jaw end coordinate system end needs to be solved. According to the pose relationship of each coordinate system, the following transformation equation can be written:

wherein p is^baseA pose matrix representing points in the calibration plate fixed at a certain position to the base coordinate system,

representing a transformation matrix between the jaw end coordinate system and the base coordinate system, p being a known quantity^endA pose matrix representing the coordinate system from a point in the checkerboard fixed at a certain position to the end of the clamping jaw,

representing a transformation matrix, p, between the jaw end coordinate system and the base coordinate system^cameraRepresenting a matrix of positions of points in a checkerboard fixed at a certain position to the camera coordinate system.

Keeping the relative position of the base and the calibration plate unchanged, and solving a transformation matrix between the camera and the clamping jaw tail end coordinate system through multiple movements of the mechanical arm at different visual angles. Through test verification, when the re-projection error of the calibrated camera is about 0.2-0.3 mm, the requirement of the robot for grabbing objects is met.

S2102, a reference image obtained by shooting a reference object at a plurality of view angles by a camera and a camera parameter corresponding to each view angle are acquired.

In this embodiment, as mentioned above, generally, 4 view angles are selected to shoot the reference object, so as to obtain the reference images corresponding to the 4 view angles and the camera parameters corresponding to the 4 view angles.

S2103, the key point coordinates of the reference object in the reference image are marked to obtain a first image.

In this embodiment, after acquiring reference images of multiple viewing angles and corresponding camera parameters, it is first necessary to manually label two-dimensional pixel coordinates of key points of a reference object in each reference image.

In the embodiment, a Labelme tool is used for labeling key points, the Labelme is two-dimensional image labeling software with strong functions, labeling of object class labels, target detection boundary boxes and instance segmentation masks can be completed besides the key points, and the software can automatically generate labeling information files with corresponding formats only by clicking and dragging the image to be labeled conveniently through a mouse, so that reading of real labels (GT) during neural network model training is facilitated.

And S2104, calculating to obtain a reference three-dimensional coordinate of the key point of the reference object in a camera coordinate system by using a direct linear transformation method according to the first image and the camera parameters.

In this embodiment, since the two-dimensional pixel coordinates of the key points of the reference object and the corresponding camera parameters are known for N (N ∈ {1,2 … N }) view angles:

P_n＝[R_n，T_n]

as previously described, P_nA 3 x 4 matrix.

In order to obtain the three-dimensional coordinates of the real reference object key points, the present embodiment uses a Direct Linear Transformation (DLT) method to convert the two-dimensional coordinates of the key points of n views: t ═ u_n，v_n，1]^TReduced to three dimensionsCoordinates are as follows: w ═ x, y, z, 1]^T。

The n views can be expressed as:

wherein λ is_nThe depth value of the three-dimensional coordinate point is an unknown quantity; [ R ]_n,1|T_n,1]、[R_n,2|T_n,2]、[R_n,3|T_n,3]Representing the 1,2, 3 rows of the camera parameter matrix, respectively. And finally solving the reference three-dimensional coordinates of the key points of the reference object in the camera coordinate system through matrix Singular Value Decomposition (SVD) after the elimination and the matrix transformation.

S2105, converting the reference three-dimensional coordinate into a real coordinate under a base coordinate system according to the first conversion matrix and a preset conversion matrix, wherein the preset conversion matrix represents a conversion relation between a clamping jaw tail end coordinate system and the base coordinate system.

S2106, the labeled reference image and the real coordinate are respectively used as a training sample and a label corresponding to the training sample.

Referring to fig. 12, the training method of the grab model may include the following steps:

and S310, acquiring state parameters at the moment t, wherein the state parameters comprise pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw and the opening and closing state of the clamping jaw.

And S320, inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient into a pre-constructed grabbing model, wherein the grabbing model comprises an actor network and a critic network.

In this embodiment, the crawling model includes one actor network and two critic networks, and their corresponding target networks, which are used for stable training, and include 6 neural networks in total.

And S330, obtaining the action parameters at the time t by using the actor network of the captured model.

And S340, controlling the robot to move according to the motion parameters, finishing the motion at the time t, and acquiring reward parameters.

In this embodiment, the robot observes the state S at the current moment^tInputting the actor network to output corresponding decision action A^tRealizing one-time movement process and obtaining the reward parameter R at the current moment^tState S with the next moment^t+1. The algorithm will aggregate (S)^t，A^t，R^t，S^t+1) And storing the experience in an experience pool as an experience, and randomly selecting a batch of experiences to update the parameters of the grasping model in the neural network training process.

Reward parameter R^tCan be represented as R (S)^t,A^t) Indicates that the robot is in the current state S^tDown-selection execution action A^tThe value of the reward later generated.

And S350, inputting the reward parameters into the critic network, and predicting the Q value by using the critic network to obtain a prediction result of the Q value, wherein the Q value is used for evaluating the value generated by the robot moving according to the action parameters.

In this embodiment, the Q value is denoted as Q (S)^t，A^t) Means at the current state S^tDown-selection execution action A^tThereafter, the desired value of the reward parameter sum is obtained up to the end of the task (final state) for assessing the value of the action.

And S360, performing back propagation training on the grasping model according to the prediction result of the Q value, a preset loss function and a preset gradient function to obtain the trained grasping model.

In this embodiment, two critic networks and corresponding target networks are denoted as δ_iAnd delta'_iI is e {1,2 }. Due to different initial values of the network parameters, the two critic networks can predict Q values with different sizes. As shown in the following formula, the algorithm selects the smaller Q value as the target of network update of the critic, and prevents the deviation caused by overestimation of the Q value:

wherein only one critic target network is updated each time

Parameter of (a), y^tThe target of the update for both networks.

In addition, the grasping model may generate a small error at each update, and after the network is updated for multiple times, the errors gradually accumulate to cause poor performance of the final algorithm. Therefore, in addition to using a policy-based delay updating technique similar to that in the DDPG algorithm, the algorithm also performs numerical smoothing on a part of the region around the motion space to reduce errors, i.e., adds a certain amount of noise to the motion output by the actor network, so that the original formula becomes:

where the noise ζ can be regarded as a regularization term to make the update of the value function smoother, the use noise here is different from the noise used in the DDPG algorithm: DDPG only adds noise to the finally output action interacting with the environment, and the purpose is to improve the exploration capacity of the algorithm to the action space.

Whereas the existing TD3 algorithm outputs action a for the actor network before computing the Q value of the target critic network^tNoise is added with the aim of making the prediction of the Q value more accurate and robust. Critic network for updating parameters

Approximating the target network y according to the following loss function^tThe value of (c):

marking the actor network and the corresponding target network as eta and eta' respectively, and in order to update the parameters of the actor network eta, minimizing the gradient of the strategy function by the following formula to obtain the action when the Q value is maximum:

wherein N is_mRepresenting the number of experiences randomly drawn from the experience pool.

In addition, the robot has a high mechanical structure complexity and a large grabbing space range, so that the robot has a high searching space dimension during the interaction with the environment, and reward feedback of grabbing tasks is sparse. In order to obtain an optimal grabbing action strategy, the algorithm needs to spend a lot of training time.

Therefore, in the embodiment, a post Experience playback mechanism (HER) is introduced, so that even in the primary training stage with a low capturing success rate, the robot can quickly learn a useful capturing planning strategy, and the convergence rate of the algorithm is increased.

The post experience playback mechanism is to use the thought of 'post Zhuge Liang' to guide the intelligent agent to learn the strategy. Although the intelligent agent cannot obtain positive rewards in the interaction of the current round, certain experience can still be accumulated according to the current result; although the search failed, it is possible to record the "failed" experience, replace the expected result with the achieved result, and construct a special experience pool for similar experiences. When a certain interaction meets the condition that the target of the stored experience is consistent, the experience can be converted into a successful experience, and the intelligent agent is helped to learn effectively.

In order for the robot to achieve the ability to "learn from failure" in the grabbing task, it is necessary to additionally introduce the concept of destination (Goal) G, which represents the state S^tThe destination state G can be reached by a series of decision processes. Meanwhile, in order to guarantee that the status S that the grabbing result is unsuccessful can be obtained^tCorresponding experience is used for the empirical replay training, state S^tShould also be regarded as an objectAnd a state G.

Accordingly, in the original experience set (S)^t，A^t，R^t，S^t+1) Based on the target state G, the conversion of the original experience data to the new experience representation, namely S, is completed through simple connection operation^t||G，A^t，R^t，S^t+1G, where | represents a join operation. For each iterative exploration, k new targets G 'are obtained using a method of sampling future states'_kAnd corresponding k sets of empirical values

After a plurality of objective states are sampled by using the HER mechanism, the learning experience distribution of the neural network model changes due to the unconstrained change of the values of the corresponding reward functions, and the deviation may increase the instability of the model training to some extent. In contrast, the HER mechanism is used in the early stage of training, when the number of training rounds reaches a certain threshold value, the target state G is replaced by the actual capture ending state, and the neural network parameter updating is completed by using the actual reward.

The construction of the grasping model can comprise the following detailed steps:

firstly, determining a state function according to the pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw, the opening and closing state of the clamping jaw, the pose parameters of a target object and the camera parameters corresponding to the maximum confidence coefficient; and the pose parameters of the tail end of the clamping jaw represent the spatial position and the pose of the tail end of the clamping jaw relative to the base.

In this embodiment, the state function can be expressed as:

wherein,

as pose parameters of the target object, other parametersThe numbers have been explained above and are not described in detail herein.

And secondly, determining an action function according to the speed of the tail end of the clamping jaw, the rotating angle of the clamping jaw and the opening and closing state of the clamping jaw.

In this embodiment, the action function can be expressed as:

and thirdly, determining a reward function according to the grabbing result of the robot, the distance between the clamping jaw and the target object and the moving direction of the robot.

In this embodiment, the reward function may be expressed as:

wherein,

in the form of a binary reward function,

0 indicates that the robot fails to grab the object, and 1 indicates that the robot successfully grabs the object;

expressing the opposite number of the distance d between the clamping jaw and the target object, wherein alpha and beta are the weight occupied by the corresponding reward item;

and the reward represents the movement of the robot along the direction of the connecting line of the camera position with the highest visual angle confidence coefficient and the position of the target object, and the calculation formula is as follows:

wherein dist represents whenThe distance between the tail end position of the front clamping jaw and a connecting line of the target object relative to the optimal visual angle,

indicating a previously set distance threshold, set to 80mm in this embodiment.

And fourthly, constructing a grabbing model by utilizing a multi-view-based reinforcement learning grabbing algorithm according to the state function, the action function and the reward function.

The prior art grabbing strategies are generally:

after the robot obtains the pose information of the tail end of the clamping jaw and the pose information of the object, the motion angles of all joints are obtained through inverse kinematics calculation, and then each joint motor controls the robot to move to the position near the target object according to the angle change trend. And the robot selects a proper grabbing point according to the shape characteristics of the object to complete the grabbing operation of the object.

However, this kind of method does not consider the situation that there may be an obstacle in the moving space of the robot, for example, when the target object is blocked, the robot may collide with the blocking object during the grabbing process, thereby causing a certain safety hazard. In this regard, these approaches require the introduction of additional collision detection modules to avoid such occurrences, but this increases the cost of the system set up.

Compared with the prior art, the embodiment of the invention determines the state and the action space of the robot for grabbing the problem by using a reinforcement learning method, and then introduces the detection confidence coefficient into the reward function according to the detection confidence coefficients of different visual angles obtained by the two-dimensional detection network, so that the robot can learn the shielding information around the target object and has better obstacle avoidance capability.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

firstly, the robot grasping method provided by the embodiment predicts the three-dimensional coordinates of the key points of the target object by using a multi-view feature fusion mode, and compared with a method for predicting the 6DoF pose of the object by using a color image and a depth image as input in the prior art, the method can effectively solve the problem of low detection precision caused by object overlapping and shielding in a complex scene.

Secondly, the appearance surface of the target object can be described in a simplified manner by using sparse key points, so that the detection model has better generalization performance.

Finally, a post experience playback mechanism is introduced in the training process of the grabbing model, so that a useful grabbing planning strategy can be rapidly learned even if the robot is in a training primary stage with a low grabbing success rate, and the convergence rate of the algorithm is accelerated.

In order to perform the corresponding steps in the above described embodiments of the robot gripping method, an implementation applied to a robot gripping device is given below.

Referring to fig. 13, fig. 13 is a block diagram illustrating a robot gripping device 200 according to the present embodiment. The robot is applied to the electronic equipment 10, and the electronic equipment 10 is in communication connection with a robot, and the robot is provided with a camera. The robot gripping device 200 includes: the device comprises an acquisition module 201, a detection module 202, a calculation module 203 and a control module 204.

The acquiring module 201 is configured to acquire an image set obtained by shooting a target object at multiple viewing angles by a camera and a camera parameter corresponding to each viewing angle.

The detection module 202 is configured to input the image set and each camera parameter into the detection model, and predict the image set by using the detection model to obtain a predicted value of a first coordinate of the target object, where the first coordinate is a three-dimensional coordinate of a key point of the target object.

And the calculating module 203 is configured to calculate a pose parameter of the target object according to the predicted value of the first coordinate and the spatial geometric relationship between the key points, where the pose parameter is used to represent the spatial position and the posture of the target object.

And the control module 204 is configured to control the robot to complete the grabbing of the target object according to the pose parameters and the grabbing model, where the grabbing model is obtained by training in a reinforcement learning manner.

Optionally, the detection model includes a two-dimensional detection network and a three-dimensional detection network, and the image set includes images corresponding to a plurality of viewing angles;

a detection module 202 configured to:

inputting the image set into a target detection network, and cutting the images in the image set by using the target detection network to obtain cut images;

inputting the image set into the two-dimensional detection network, and performing feature extraction on each image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to each image and a confidence coefficient corresponding to each image, wherein the two-dimensional feature map is used for representing two-dimensional coordinates of key points of the target object in the image, and the confidence coefficient is used for representing the truth of the target object under a view angle corresponding to each image;

and inputting all the two-dimensional feature maps, the confidence coefficient and the camera parameters into a three-dimensional detection network, and processing all the two-dimensional feature maps by using the three-dimensional detection network with the confidence coefficient as a weight to obtain a predicted value of the first coordinate of the target object.

Optionally, the three-dimensional detection network includes a three-dimensional mapping network, a three-dimensional convolution network, and a loss function regression model;

a detection module 202 configured to:

inputting all the two-dimensional feature maps, the confidence degrees and the camera parameters into a three-dimensional mapping network, and obtaining a three-dimensional feature map of a target object by utilizing a back projection mode, wherein the three-dimensional feature map is used for representing three-dimensional space information of the target object and comprises a plurality of channels, and each channel corresponds to a key point of each target object one by one;

inputting the three-dimensional characteristic graph into a three-dimensional convolution network, and performing characteristic extraction on the three-dimensional characteristic graph by using the three-dimensional convolution network to obtain a three-dimensional characteristic graph corresponding to each key point;

and inputting the three-dimensional feature maps corresponding to all the key points into a loss function regression model, and performing normalization processing on the three-dimensional feature maps corresponding to all the key points by using the loss function regression model to obtain a predicted value of the first coordinate of the target object.

Optionally, the robot comprises a jaw;

a control module 204 to:

inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient into the grabbing model to obtain action parameters at the time t, wherein the action parameters comprise the speed of the tail end of the clamping jaw, the rotating angle of the clamping jaw and the opening and closing state of the clamping jaw;

controlling the robot to move according to the motion parameters, finishing the motion at the time t, and acquiring the state parameters at the time t + 1;

and repeating the step of inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence into the grabbing model to obtain the action parameters at the time t until the robot finishes grabbing the target object.

The detection module 202 is further configured to:

inputting the first image into a target detection network, and cutting the first image by using the target detection network to obtain a cut first image;

inputting the first image into a two-dimensional detection network, and performing feature extraction on the first image by using the two-dimensional detection network to obtain a two-dimensional feature map corresponding to the first image and a confidence coefficient corresponding to the first image;

inputting the two-dimensional feature map corresponding to the first image, the confidence coefficient corresponding to the first image and the camera parameters into a three-dimensional detection network, and processing the two-dimensional feature maps corresponding to all the first images by using the three-dimensional detection network by taking the confidence coefficient corresponding to the first image as a weight to obtain a prediction result of a second coordinate of the reference object, wherein the second coordinate is a three-dimensional coordinate of a key point of the reference object;

A detection module 202 configured to:

calibrating a camera to obtain a first conversion matrix between a camera coordinate system and a clamping jaw tail end coordinate system;

acquiring a reference image obtained by shooting a reference object under a plurality of visual angles by a camera and camera parameters corresponding to each visual angle;

marking the key point coordinates of a reference object in a reference image to obtain a first image;

calculating to obtain a reference three-dimensional coordinate of a key point of a reference object in a camera coordinate system by using a direct linear transformation method according to the first image and the camera parameters;

converting the reference three-dimensional coordinate into a real coordinate under a base coordinate system according to the first conversion matrix and a preset conversion matrix, wherein the preset conversion matrix represents a conversion relation between a clamping jaw tail end coordinate system and the base coordinate system;

and respectively taking the marked reference image and the real coordinates as a training sample and a label corresponding to the training sample.

The control module 204 is further configured to:

inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient into a pre-constructed grabbing model, wherein the grabbing model comprises an actor network and a critic network;

obtaining action parameters at the time t by using an actor network of the grabbing model;

controlling the robot to move according to the motion parameters, finishing the motion at the time t, and acquiring reward parameters;

inputting the reward parameters into a critic network, and predicting the Q value by using the critic network to obtain a prediction result of the Q value, wherein the Q value is used for evaluating the value generated by the robot moving according to the action parameters;

and performing back propagation training on the grasping model according to the prediction result of the Q value, the preset loss function and the preset gradient function to obtain the trained grasping model.

A control module 204 to:

determining a state function according to the pose parameters of the tail end of the clamping jaw, the speed of the tail end of the clamping jaw, the opening and closing state of the clamping jaw, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence coefficient; the pose parameters of the tail end of the clamping jaw represent the spatial position and the pose of the tail end of the clamping jaw relative to the base;

It will be apparent to those skilled in the art that the above description of the specific operation of the robotic gripper 200 is provided for convenience and brevity. Reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by the processor 12 to implement the robot grasping method disclosed in the above embodiments.

In summary, according to the robot capture method, the robot capture device, the electronic device, and the storage medium provided in the embodiments of the present invention, first, a camera installed on a robot is used to capture a target object at a plurality of viewing angles, so as to obtain an image set and camera parameters corresponding to each viewing angle; then inputting the image set and camera parameters into a pre-trained detection model, predicting the three-dimensional coordinates of key points of the target object to obtain a prediction result, and calculating the pose of the target object according to the spatial geometrical relationship between the prediction result and the key points; and finally, controlling the robot to finish the grabbing of the target object according to the pose of the target object and the pre-trained grabbing model.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The robot grabbing method is characterized by being applied to electronic equipment, wherein the electronic equipment is in communication connection with a robot, and the robot is provided with a camera; the method comprises the following steps:

2. The method of claim 1, wherein the detection model comprises an object detection network, a two-dimensional detection network, and a three-dimensional detection network, the set of images comprising images corresponding to a plurality of perspectives;

3. The method of claim 2, wherein the three-dimensional detection network comprises a three-dimensional mapping network, a three-dimensional convolution network, and a loss function regression model;

4. The method of claim 1, wherein the robot includes a gripper, and wherein the step of controlling the robot to complete the grasp of the target object in accordance with the pose parameters and a grasp model comprises:

and repeatedly executing the step of inputting the state parameters, the pose parameters of the target object and the camera parameters corresponding to the maximum confidence degree into the grabbing model to obtain the action parameters at the moment t until the robot finishes grabbing the target object.

5. The method of claim 1, wherein the detection model is trained by:

acquiring a training sample and a label corresponding to the training sample, wherein the training sample comprises a first image and camera parameters, and the label represents a three-dimensional coordinate of a key point of a reference object;

6. The method of claim 5, wherein the loss function is:

a loss function representing the two-dimensional feature map,

a loss function representing the three-dimensional feature map;

the loss function of the two-dimensional feature map is:

a tag value representing the two-dimensional feature map;

the loss function of the three-dimensional feature map is as follows:

7. The method of claim 5, wherein the robot further comprises a base, and wherein the step of obtaining a training sample and a label corresponding to the training sample comprises:

8. The method of claim 1, wherein the grip model is trained by:

9. The method of claim 8, wherein the grip model is constructed by:

10. The robot grabbing device is applied to electronic equipment, the electronic equipment is in communication connection with a robot, and the robot is provided with a camera; the device comprises:

11. An electronic device, characterized in that the electronic device comprises:

one or more processors;

memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the robot grasping method according to any one of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the robot grasping method according to any one of claims 1 to 9.