CN113326666A

CN113326666A - Robot intelligent grabbing method based on convolutional neural network differentiable structure searching

Info

Publication number: CN113326666A
Application number: CN202110802383.4A
Authority: CN
Inventors: 胡伟飞; 焦清; 邵金毅; 王楚璇; 刘振宇; 谭建荣; 刘飞香
Original assignee: Zhejiang University ZJU; China Railway Construction Heavy Industry Group Co Ltd
Current assignee: Zhejiang University ZJU; China Railway Construction Heavy Industry Group Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-08-31
Anticipated expiration: 2041-07-15
Also published as: CN113326666B

Abstract

The invention discloses a robot intelligent grabbing method based on convolutional neural network differentiable structure searching, which comprises the steps of firstly constructing a training set and a verification set, then constructing a discrete chain type searching space, relaxing the discrete chain type searching space to be continuous, then establishing a gradient-based neural network double-layer optimization model to optimize a grabbing attitude neural network by taking the neural network calculation speed and precision as optimization targets, and finally obtaining a grabbing attitude generation network with optimal parameters. And inputting the new RGB-D image into the trained network to generate the optimal grabbing posture. According to the robot intelligent grabbing method, grabbing quality judgment and grabbing posture generation are completed through the full convolution neural network, the neural network calculation efficiency is rapidly improved, and the problem of overlarge calculation amount in the optimization process is solved.

Description

Robot intelligent grabbing method based on convolutional neural network differentiable structure searching

Technical Field

The invention relates to the field of robot grabbing algorithms, in particular to a robot intelligent grabbing method based on convolutional neural network differentiable structure searching.

Background

The robot intelligently grabs objects of unknown different shapes and colors by adopting a hand-eye robot system, and places the objects in a specified area. The core for realizing the intelligent grabbing of the robot is to obtain an effective grabbing gesture from an image or a digital model containing object color and shape information.

The existing robot intelligent grabbing method can be divided into a physical analysis method and an empirical model. The physical analysis method directly obtains the proper grabbing gesture from the three-dimensional model of the object through a mechanical analysis method, the calculation amount of the method is large, the constraints existing in many real worlds are simplified, and the problems of poor generalization effect and long calculation time exist. The empirical model method learns the grabbing mode of a specific object through a data set mainly based on a deep learning method, wherein the grabbing posture is predicted by processing a deep image of the object by using a convolutional neural network.

However, the robot intelligent capture puts higher identification precision and faster operation speed requirements on the convolutional neural network, and most of the neural networks applied to robot capture at present are designed manually by experts in the field of deep learning according to professional knowledge and experience, so that a large amount of computing resources and time cost are consumed.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a robot intelligent grabbing method based on convolutional neural network differentiable structure searching, which has the following specific technical scheme:

a robot intelligent grabbing method based on convolutional neural network differentiable structure searching comprises the following steps:

(1) constructing a training set and a verification set of a grabbing posture generation network, wherein the training set and the verification set respectively comprise network input and output, the network input is an RGB-D image, and the output is grabbing quality Q and grabbing angle corresponding to each pixel point

Grabbing the opening W;

(2) constructing a chain search space consisting of a plurality of nodes, and determining candidate convolution calculation operation among the nodes;

(3) relaxing the discrete chained search space to continuous;

(4) simultaneously taking the calculation speed and precision of the neural network as optimization targets, and establishing a gradient-based neural network double-layer optimization model, wherein the double-layer optimization model comprises inner-layer optimization and outer-layer optimization, and the inner layer optimization is to train all weight coefficients w of the neural network by adopting a training set; the outer layer is optimized according to all weight coefficients w of the trained neural network^*Training a neural network operation variable alpha by adopting a verification set;

then selecting an operation variable alpha to form a convolutional neural network, and retraining the weight coefficient by using the training set to obtain a capture attitude generation network with optimal parameters;

(5) inputting an RGB-D image shot by a depth camera positioned at the tail end of the robot into a capture gesture generation network with optimal parameters, and outputting capture quality Q and capture angle corresponding to each pixel point with the same length and width with the input image

Grab and openThree single-channel characteristic images of degree W;

(6) and (5) selecting a pixel point with the largest grabbing quality Q from the image obtained in the step (5), taking the position of the pixel point as the central position of the grabbing frame, and controlling the robot and the mechanical claw to grab the object by the upper computer.

Further, the training set and the verification set in the step (1) are obtained based on the existing robot intelligent capturing RGB-D image given by the data set and the capturing frame for successfully capturing the object, and the capturing quality Q and the capturing angle of each pixel point are generated

Grasping the opening W, and carrying out the following preprocessing on the three characteristics:

equally dividing each grabbing frame into three parts along the grabbing width direction, filling grabbing quality q of one part positioned in the center to be 1, filling a rotation angle phi of each grabbing frame relative to the picture, and filling grabbing opening w; wherein phi is equal to

W is [0, 150 ]](ii) a The two portions located on both sides are filled with the grasping mass q of 0.

Furthermore, the chain search space is composed of a plurality of nodes, each node represents an intermediate result after calculation operation, the nodes are connected through a directed arrow line, and the directed arrow line represents all possible candidate neural network calculation operations; the neural network computing operation refers to the operation of computing by adopting convolution kernels with different sizes and convolution layers with different numbers between two nodes; the nodes are connected in a chain mode, so that the convergence speed of the optimization algorithm is accelerated by maximally utilizing computing resources.

Further, the step (3) of relaxing the discrete chain search space to be continuous is realized by the following steps:

assigning the operation among the original nodes to a normalized continuous variable alpha, and expressing the discrete operation by the continuous variable alpha; the specific calculation method is that each directional arrow line between nodes is multiplied by a corresponding variable alpha, and then the obtained results are summed to be used as a final calculation result, and the calculation formula is as follows:

wherein e is a natural logarithm, x^jIs the output of the jth node and,

for the ith operating variable in the jth node,

is the ith operation equation in the jth node;

the number of operations contained in the jth node.

Further, the inner layer is optimized to adopt a training set to calculate a loss function to train all weight coefficients w of the neural network under the condition that an operation variable alpha is determined among all the neural network nodes;

the outer layer is optimized based on all weight coefficients w of the trained neural network^*Adopting a verification set to calculate a loss function to train a neural network operation variable alpha;

the specific calculation is as in formulas (2) and (3), where formula (3) is an inner optimization function and formula (2) is an outer optimization function: in order to simultaneously take the calculation precision and the time as optimization targets, a delay factor is introduced into an outer-layer optimization function, and the delay factor adjusts a loss function by calculating the quotient of the current neural network floating point calculation number and the target neural network floating point calculation number

Wherein the content of the first and second substances,

a loss function calculated for the validation set;

a loss function calculated for the training set; f is an equation of the floating point calculation number of the neural network; m is a structure of a neural network obtained through discretization; t is a set target floating point calculation number; ω is a constant that controls the magnitude of the delay factor effect.

Further, in the step (4), in order to enable the outer function to determine a correct convergence gradient more quickly, the inner function is calculated and iterated to approach convergence, and then the outer function is updated; meanwhile, whether convergence occurs is judged by observing a loss set obtained by multiple iterations of the inner layer function, so that the optimization process is prevented from stopping at a local optimal solution; the convergence criterion of the inner layer function is defined as follows:

wherein the content of the first and second substances,

wherein N is the number of inner layer function calculations in each group, G_kThe set of values is lost for the kth inner layer function,

is the mean of the kth inner layer function loss value set,

is the maximum value of the kth inner layer function penalty value set,

is the minimum value of the k-th inner layer function loss value set, w_iThe weighting factor, ε, of the network obtained for the ith optimization iteration₁Fluctuating convergence threshold, ε, for a set of loss functions₂The threshold is the loss function set mean convergence threshold.

Further, in the step (4), after the optimization is completed, the operation variables α of the neural network are ranked, the only operation with the highest α value or the first operations with higher α values are selected to form a composite convolutional neural network, and the weight coefficient w is retrained by using the training set to obtain the trained neural network.

Further, the physical grabbing environment where the robot is located comprises a physical robot, two parallel adaptive clamping jaws, a depth camera and an object set to be grabbed; the two parallel self-adaptive clamping jaws and the depth camera are fixed at the tail end of the physical robot, and the relative positions of the two parallel self-adaptive clamping jaws and the depth camera are unchanged in the motion process; the two parallel self-adaptive clamping jaws are perpendicular to the grabbing plane.

The invention has the following beneficial effects:

(1) compared with other intelligent grabbing methods, the grabbing posture generation network provided by the invention avoids candidate posture sampling and candidate grabbing evaluation in a color-depth picture, grabbing quality judgment and grabbing posture generation are completed through a full convolution neural network, and the calculation efficiency of the neural network is rapidly improved.

(2) According to the invention, a chain type search framework is adopted, optimization of a discrete network structure is converted into continuous variable optimization by endowing operation variables among network nodes, a neural network double-layer optimization model for optimizing the network structure after training network weights is established, and the problem of overlarge calculated amount in the optimization process is solved.

(3) According to the method, the delay factor is introduced into the outer layer optimization function, so that the calculation speed of the neural network is also brought into an optimization target, the simultaneous optimization of precision and speed is realized, and the optimal capture is closer to the actual industrial scene.

Drawings

FIG. 1 is a schematic diagram of a robot smart grab of the present invention;

FIG. 2 is a schematic diagram of a physical grabbing environment;

FIG. 3 is a flow diagram of a grab decision network workflow;

FIG. 4 is a schematic diagram of a processing method for capturing and generating network training data;

FIG. 5 is a schematic diagram of grab generation network inputs and outputs.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments, and the objects and effects of the present invention will become more apparent, it being understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.

As shown in fig. 1, the physical environment required for intelligent grabbing of the present invention is a hand-eye robot, a two-finger parallel adaptive gripper, a depth camera, and a set of objects to be grabbed; the hand-eye robot and the two-finger parallel self-adaptive clamping jaw are main gripping executing mechanisms and are responsible for transmitting position and posture information to an upper computer; the depth camera is responsible for an upper computer to transmit point cloud information of a grabbed object. In this embodiment, the robot is a 6-axis cooperative robot, the depth camera is a camera capable of acquiring color pictures and 2.5D depth point cloud pictures, and the set of objects to be grabbed is one or more objects randomly placed on a horizontal plane in the working space of the robot. The depth camera is placed with eyes on the hand, i.e. the camera is fixed relative to the end of the robot. The robot can acquire the related posture of a tool coordinate system, and the position and the posture of the depth camera can be acquired through the hand-eye calibration from the camera coordinate system to the tool coordinate system, so that the posture and the working state of main hardware in the current physical environment are determined, and the related point cloud information of the object to be grabbed and placed is obtained.

As shown in fig. 2 and 3, the robot intelligent grabbing method based on the convolutional neural network differentiable structure search operates in an upper computer, the grabbing attitude neural network is constructed, the neural network is optimized through a gradient-based neural network double-layer optimization model, and both the calculation precision and the time are used as optimization targets.

The task of grabbing the pose generation network is to input the RGB image P produced by the same depth camera_cAnd depth image P_dAnd the object given on the picture can be identified and grabbed. In this embodiment, the gripping considered by the network is all perpendicular to the object placement plane, i.e. the object is placed on a horizontal plane and the gripper grips perpendicular to the horizontal plane. Will P_c(color picture, composed of RGB three channels) and P_dAn RGB-D picture composed of (depth pictures, only one channel in depth) is called P_s。

One capture in picture space, perpendicular to the horizontal plane, is defined by g ═ p, Φ, w, q. Where p is (u, v) which determines the pixel position of the grabbing, phi defines the rotation angle of the gripper around the vertical direction during grabbing, w defines the opening of the gripper jaw during grabbing, and q defines the grabbing quality. The larger the value of q, the greater the likelihood of successful grasping at that grasping position.

And the grabbing gesture generation network carries out grabbing prediction on each pixel point of the input picture, and gives a grabbing angle and a grabbing width required when the pixel point is grabbed, and the probability of successful grabbing of the pixel point. Capturing the generated network in FIG. 5The output three characteristic images G ═ { phi, W, Q }, epsilon R^H×W×3The pixel value in each image (u, v) pixel represents the corresponding physical quantity to be captured, and these pixels, combined together, form the output of the network.

To achieve this, the data set needs to be processed to train the grab-generate network. In the cornellgraphing dataset open source dataset, capture color-depth images are given, and some capture frames are given for successful object capture. As shown in fig. 4, the source data set is preprocessed, that is, for the grabbing quality Q, the 1/3 portion of each grabbing frame along the grabbing width and the center is a suitable position for grabbing, and for this portion, the filling grabbing quality Q is 1, and the rotation angle phi of each grabbing frame relative to the picture is set to be equal to

Internal; the filling grabbing opening degree w is set to be [0, 150 ]](ii) a The other portions are considered as the places where the grasping is impossible, and the grasping quality q is set to 0. Similarly, in the generation of Φ, W, for the region with the grasping quality of 0, the grasping angle and the opening degree are also 0 accordingly, and it is no longer considered that grasping is possible.

The neural network structure optimization algorithm is based on gradient, and takes calculation precision and time as optimization targets at the same time, so that a neural network structure with both precision and speed is obtained.

The neural network structure optimization algorithm mainly relates to two aspects of a search space and a search algorithm. The search space is formed by connecting a plurality of nodes in a chain structure, each node represents an intermediate result after calculation operation, the nodes are connected through a directed arrow line, and the directed arrow line represents all possible candidate neural network calculation operations. In order to make the most of the convolution characteristics, the neural network computing operation refers to an operation of computing between two nodes by adopting convolution kernels with different sizes and convolution layers with different numbers; the nodes are connected in a chain mode, so that the convergence speed of the optimization algorithm is accelerated by maximally utilizing computing resources.

After the number of nodes and the candidate operation among the nodes are determined, the operation among the original nodes is endowed with a normalized variable alpha, so that the discrete operation is expressed by a continuous variable alpha, the relaxation and the continuity of a discrete search space are realized, and a gradient-based neural network structure double-layer optimization model can be established. The specific calculation method is that each directional arrow line between nodes is multiplied by a corresponding variable alpha, and then the obtained results are summed to be used as a final calculation result, and the calculation formula is as follows:

wherein e is a natural logarithm, x^jIs the output of the jth node and,

for the ith operating variable in the jth node,

is the ith operation equation in the jth node;

the number of operations contained in the jth node.

The double-layer optimization model comprises an inner layer optimization and an outer layer optimization, wherein the inner layer optimization is that a training set is adopted to train all weight coefficients w of the neural network; the outer layer is optimized according to all weight coefficients w of the trained neural network^*And training the neural network operation variable alpha by adopting the verification set.

The inner layer is optimized to adopt a training set to calculate a loss function to train all weight coefficients w of the neural network under the condition that an operation variable alpha is determined among all neural network nodes; the outer layer is optimized based on all weight coefficients w of the trained neural network^*Adopting a verification set to calculate a loss function to train a neural network operation variable alpha; the specific calculation is shown as formula (2) and (3), wherein formula (3) is an inner layer optimization functionNumerical, equation (2) is the outer optimization function: in order to simultaneously take the calculation precision and the time as optimization targets, a delay factor is introduced into an outer-layer optimization function, and the delay factor adjusts a loss function by calculating the quotient of the current neural network floating point calculation number and the target neural network floating point calculation number

Wherein the content of the first and second substances,

a loss function calculated for the validation set;

In addition, in order to enable the outer layer function to determine the correct convergence gradient more quickly, the inner layer function is calculated and iterated to be close to convergence, and then the outer layer function is updated; meanwhile, whether convergence occurs is judged by observing a loss set obtained by multiple iterations of the inner layer function, so that the optimization process is prevented from stopping at a local optimal solution; the convergence criterion of the inner layer function is defined as follows:

wherein the content of the first and second substances,

is the mean of the kth inner layer function loss value set,

is the maximum value of the kth inner layer function penalty value set,

After the optimization is completed, the operation variables alpha of the neural network are sequenced, the only operation with the highest alpha value or the first operations with higher alpha values are selected to form a composite convolutional neural network, and the weight coefficient w is retrained by using a training set to obtain the trained neural network.

Then, an RGB-D image shot by a depth camera positioned at the tail end of the robot is input into a capture posture generation network with optimal parameters, and capture quality Q and capture angle corresponding to each pixel point with the same length and width with the input image are output

And (5) capturing three single-channel characteristic images of the opening W. And then selecting a pixel point with the maximum grabbing quality Q from the images, taking the position of the pixel point as the central position of the grabbing frame, and controlling the robot and the mechanical claw to grab the object by the upper computer. As shown in fig. 5.

In order to verify the superiority of the method, the method is adopted to grasp and compare with the existing neural network structure optimization algorithm (GA-RG) based on the evolutionary algorithm. The evolutionary algorithm is a typical discrete optimization algorithm, needs a large amount of optimization time, but can perform neural network structure optimization more randomly. Table 1 shows the comparison of DARTS-RG and GA-RG in robot smart grab, and it can be seen that the optimization time of the present invention is much shorter than the GA-RG algorithm in both cases where no delay factor is introduced and where no delay factor is introduced. And when no delay factor is introduced, the grabbing precision of the method is greater than that of the GA-RG algorithm.

TABLE 1 comparison of GA-RG and DARTS-RG Performance

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and although the invention has been described in detail with reference to the foregoing examples, it will be apparent to those skilled in the art that various changes in the form and details of the embodiments may be made and equivalents may be substituted for elements thereof. All modifications, equivalents and the like which come within the spirit and principle of the invention are intended to be included within the scope of the invention.

Claims

1. A robot intelligent grabbing method based on convolutional neural network differentiable structure searching is characterized by comprising the following steps:

Grabbing the opening W;

(3) relaxing the discrete chained search space to continuous;

Three single-channel characteristic diagrams of grabbing opening degree WLike this.

2. The method for intelligently grabbing by robot based on convolutional neural network differentiable structure search as claimed in claim 1, wherein the training set and the verification set in step (1) are obtained based on RGB-D images given by the existing data set for intelligently grabbing by robot and grabbing frames for successfully grabbing objects, and the grabbing quality Q and grabbing angle of each pixel point are generated

3. The convolutional neural network differentiable structure search based robot intelligent grasping method according to claim 1, wherein the chain search space is composed of a plurality of nodes, each node represents an intermediate result after a calculation operation, the nodes are connected through an arrow line, and the arrow line represents all possible candidate neural network calculation operations; the neural network computing operation refers to the operation of computing by adopting convolution kernels with different sizes and convolution layers with different numbers between two nodes; the nodes are connected in a chain mode, so that the convergence speed of the optimization algorithm is accelerated by maximally utilizing computing resources.

4. The convolutional neural network differentiable structure search based robot intelligent grasping method according to claim 1, wherein the step (3) of relaxing the discrete chain search space to continuous is realized by the following steps:

wherein e is a natural logarithm, x^jIs the output of the jth node and,

for the ith operating variable in the jth node,

is the ith operation equation in the jth node;

the number of operations contained in the jth node.

5. The method for robot intelligent grabbing based on convolutional neural network differentiable structure search of claim 1,

the inner layer is optimized to adopt a training set to calculate a loss function to train all weight coefficients w of the neural network under the condition that an operation variable alpha is determined among all neural network nodes;

the outer layer is optimized based on all weight coefficients w of the trained neural network^*Training neural networks using validation set to compute loss functionsAn operating variable α;

Wherein the content of the first and second substances,

a loss function calculated for the validation set;

6. The convolutional neural network differentiable structure search based robot intelligent grasping method as claimed in claim 5, wherein in the step (4), in order to make the outer function determine the correct convergence gradient faster, the inner function calculation is iterated to approach the convergence, and then the outer function is updated; meanwhile, whether convergence occurs is judged by observing a loss set obtained by multiple iterations of the inner layer function, so that the optimization process is prevented from stopping at a local optimal solution; the convergence criterion of the inner layer function is defined as follows:

wherein the content of the first and second substances,

is the mean of the kth inner layer function loss value set,

is the maximum value of the kth inner layer function penalty value set,

is the k inner layer functionMinimum value of the set of loss values, w_iThe weighting factor, ε, of the network obtained for the ith optimization iteration₁Fluctuating convergence threshold, ε, for a set of loss functions₂The threshold is the loss function set mean convergence threshold.

7. The method for intelligently grabbing robots based on convolutional neural network differentiable structure search according to claim 6, wherein in the step (4), after the optimization is completed, the operation variables α of the neural network are ranked, the only operation with the highest α value or the first operations with higher α values are selected to form a composite convolutional neural network, and the weight coefficients w are retrained by using a training set to obtain the trained neural network.

8. The convolutional neural network differential structure searching robot intelligent grabbing method according to claim 1, wherein the physical grabbing environment where the robot is located comprises a physical robot, two parallel adaptive clamping jaws, a depth camera and an object set to be grabbed; the two parallel self-adaptive clamping jaws and the depth camera are fixed at the tail end of the physical robot, and the relative positions of the two parallel self-adaptive clamping jaws and the depth camera are unchanged in the motion process; the two parallel self-adaptive clamping jaws are perpendicular to the grabbing plane.