CN109934864B

CN109934864B - Residual error network deep learning method for mechanical arm grabbing pose estimation

Info

Publication number: CN109934864B
Application number: CN201910192296.4A
Authority: CN
Inventors: 白帆; 姚仁杰; 陈懋宁; 崔哲新
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2023-01-20
Anticipated expiration: 2039-03-14
Also published as: CN109934864A

Abstract

The invention discloses a residual error network deep learning method for estimating a grabbing pose of a mechanical arm, which comprises the following steps: initializing a mechanical arm, and adjusting a wrist camera of the mechanical arm to enable the wrist camera to be located at a known height position above a vertical X0Y plane; acquiring a depth image of an object to be grabbed by the mechanical arm; adopting a pre-trained improved GG-CNN model to perform mapping processing on the depth image, and outputting four 300 x 300 pixel grabbing information images, including grabbing success rate, grabbing angle cosine value, grabbing angle sine value and grabbing width; further acquiring the information of the grabbing angle and the width at the position with the highest success rate; and obtaining the grabbing angle and the grabbing width of the target object under the mechanical arm base coordinate system through coordinate transformation of the grabbing information obtained from the grabbing success rate image. According to the improved GG-CNN model, the residual error network is built through the residual error building module, the fitting effect and the learning capacity of the convolutional neural network are enhanced, and the grabbing precision of the grabbing pose generated is higher.

Description

Residual error network deep learning method for mechanical arm grabbing pose estimation

Technical Field

The invention belongs to an information control technology, and particularly relates to a residual error network deep learning method for estimating a grabbing pose of a mechanical arm.

Background

In recent years, vision-based robotic arm grasping has become a hotspot in current research. Generally, when performing a grabbing action, accurate target detection and positioning need to be achieved first. The traditional target detection is usually static detection, and the target is single, and the target detection is influenced by factors such as the change of the shape, the size and the visual angle, the change of external illumination and the like, so that the generalization capability of the extracted features is not strong, and the robustness is poor. The development of deep learning algorithms has facilitated the advancement of the task of target detection and localization. It is generally accepted by the research community that deep networks generally work better than shallow networks, but deep elevation of the network cannot be achieved by simple stacking of layers. Deep networks are difficult to train due to the gradient vanishing problem. In 2015, technicians in the industry proposed the concept of residual error network (ResNet) to solve the problem of accuracy degradation. In ImageNet hierarchical data sets, very good results were obtained with an extremely deep residual network.

The combination of mechanical arm visual grabbing and deep learning is the main direction of mechanical arm grabbing research at present. Recently, some of the technicians in the industry propose to carry out the optimum pose grabbing research of the object by constructing a grabbing and generating Neural Network (GG-CNN for short), and the convolution Neural Network is constructed by corresponding the pixels of the input depth image and the pixels of the output grabbing information image, so as to realize the prediction of the optimum grabbing pose of the complex object. However, the GG-CNN excessively pursues the recognition and grabbing speeds, so that the recognition accuracy of the neural network is reduced, and the application of the network model in the aspect of mechanical arm grabbing has certain limitation.

Therefore, how to improve the recognition accuracy of the GG-CNN applied to the mechanical arm grabbing pose estimation becomes a problem to be solved at present.

Disclosure of Invention

The invention aims to provide a residual error network deep learning method for mechanical arm grabbing pose estimation, which can effectively improve the generation precision of the optimal grabbing pose of a mechanical arm and enables a GG-CNN model to have higher practicability in the field of high-precision grabbing.

In order to achieve the purpose, the invention adopts the main technical scheme that:

the invention provides a residual error network deep learning method for estimating a grabbing pose of a mechanical arm, which comprises the following steps:

s1, initializing a mechanical arm, and adjusting the mechanical arm to enable a wrist camera to be located at a known height position above a vertical X0Y plane;

s2, obtaining a depth image of an object to be grabbed by the mechanical arm;

s3, cutting the central part of the depth image to obtain an object depth image with 300 multiplied by 300 pixels;

s4, mapping the object depth image by adopting a pre-trained improved GG-CNN model, and outputting four grabbing information images of 300 x 300 pixels, wherein the grabbing information images comprise a grabbing success rate, a grabbing angle cosine value, a grabbing angle sine value and a grabbing width;

s5, selecting a pixel point with the highest power in the grabbing success rate image, and correspondingly corresponding to a grabbing angle cosine value, a grabbing angle sine value and a corresponding pixel point in the grabbing width information image to obtain grabbing angle and width information which serve as grabbing information and have the highest grabbing success rate;

s5, acquiring the grabbing angle and the grabbing width of the target object under a mechanical arm base coordinate system (Cartesian coordinate system) through coordinate transformation of a wrist camera, a mechanical arm wrist and a mechanical arm base according to the grabbing information acquired from the grabbing success rate image;

s6, inputting grabbing information, and controlling the mechanical arm to grab (namely outputting the grabbing position, angle and width of the target object to be grabbed after coordinate transformation so as to control the mechanical arm to grab the target object);

the improved GG-CNN model is characterized in that a residual error network is built in the existing GG-CNN model by building a residual error module, and the fitting effect and the learning capacity of a convolutional neural network are enhanced, so that the grabbing accuracy of the grabbing pose generated by the improved GG-CNN model is higher, the improved GG-CNN model is more sensitive to the change of the position and the shape of an object, and the improved GG-CNN model has practical application value.

Optionally, before step S1, the method comprises:

s0-1, creating a first data set G for training input and output of the improved GG-CNN model based on the existing data set _train (ii) a The first data set comprises images marked as positive capture information and images marked as negative capture information, and the images in the first data set are provided with a plurality of marked capture frames;

s0-2, improving the existing GG-CNN model by constructing a residual error module and constructing a residual error network so as to construct an improved GG-CNN model and ensure that the sizes of input and output images of the improved GG-CNN model are unchanged;

s0-3, using the first data set G _train And training the GG-CNN model improved through the residual error to obtain the trained improved GG-CNN model.

Optionally, the GG-CNN model refined by residual error comprises:

a convolution part, a deconvolution part and an output part;

the convolution portion includes: ten residual error modules are used for carrying out the operations,

wherein the first residual module comprises: 1 convolution residual module with pooling layer, the parameters in the convolution residual module including: 4 filters with step size of 3 × 3;

the second residual module includes: 5 identity residual modules, wherein the parameters of the identity residual modules comprise: 4 filters with step size of 1 × 1;

the third residual module includes: 1 convolution residual module with pooling layer, the parameters in the convolution residual module including: 8 filters with step size of 2 × 2;

the fourth residual module includes: 5 identity residual modules, wherein the parameters of the identity residual modules comprise: 8 filters with step size of 1 × 1;

the fifth residual module includes: 1 convolution residual module with pooling layer, the parameters in the convolution residual module including: 16 filters with step size of 2 × 2;

the sixth residual module includes: 5 identity residual modules, wherein the parameters of the identity residual modules comprise: 16 filters with step size of 1 × 1;

the seventh residual module includes: 1 convolution residual module with pooling layer, the parameters in the convolution residual module including: 32 filters with step size of 5 × 5;

the eighth residual module includes: 5 identity residual modules, wherein the parameters of the identity residual modules comprise: 32 filters with step size of 1 × 1;

the ninth residual module includes: 1 convolution residual module with pooling layer, the parameters in the convolution residual module including: 64 filters with step size of 1 × 1;

the tenth residual module includes: 5 identity residual modules, wherein the parameters of the identity residual modules comprise: 64 filters with step size of 1 × 1;

the deconvolution part comprises 5 deconvolution layers with different parameters;

the number of filters of the first deconvolution layer is 64, the size of each filter is 3 × 3, and the step size is 1 × 1;

the number of filters of the second deconvolution layer is 32, the size of each filter is 5 × 5, and the step size is 5 × 5;

the number of filters of the third deconvolution layer is 16, the size of each filter is 5 × 5, and the step size is 2 × 2;

the number of filters of the fourth deconvolution layer is 8, the size of each filter is 7 × 7, and the step size is 2 × 2;

the number of the fifth deconvolution layer filters is 4, the size of each filter is 9 × 9, and the step size is 3 × 3;

the output part comprises four linearly mapped convolution layers, each convolution layer comprises 1 filter, and the four linearly mapped convolution layers sequentially and respectively map and output the grabbing success rate, the cosine value of the grabbing angle, the sine value of the grabbing angle and the grabbing width.

Optionally, in step S0-3, the following cross-over ratio formula is used to measure the capture accuracy of the GG-CNN network improved by the residual error;

intersection ratio formula:

wherein, C and G respectively represent two known areas, and the intersection and union ratio calculation is the ratio of the intersection and the union between the two areas.

Optionally, the depth image I = R in S2 ^H×W Where H is height, W is width, and the capture description of the depth image is:

grabbing in image space by coordinate transformation of mechanical arm

Conversion to grab in world coordinate g:

wherein o = (u, v) is a position coordinate of a pixel having the highest capturing success rate,

is the angle of rotation in the camera reference frame,

is the grip width in the image coordinates; t is _RC Is a coordinate transformation from the camera coordinate system to the robot arm coordinate system, T _CI Calibration transformation based on camera internal parameters and hand-eye positions between the mechanical arm and the camera;

the output image in S4 is represented as: g = (phi, W, Q) epsilon R ^3×H×W ；

Phi, W and Q are each e R ^3×H×W Respectively representing a grabbing angle, a grabbing width and a grabbing accuracy, wherein the grabbing angle phi is split into a grabbing angle cosine value and a grabbing angle sine value, and phi corresponding to a coordinate o with the highest grabbing success rate, W and Q comprise

And the value of q;

in the step S4, the improved GG-CNN model trained in advance is used to perform mapping processing on the object depth image, specifically: g = M (I);

determining an optimal grab pose in image space from G:

specifically, from output grabbing information G, a pixel with the largest grabbing success rate Q of a Q image is selected, and a coordinate o of the pixel corresponds to phi and W in the output G, so that the position, the angle and the width information of the optimal grabbing pose are obtained;

further, the method can be used for preparing a novel liquid crystal displayTo ground through

Optimal grabbing pose g in calculating world coordinates _best 。

Optionally, the processing procedure of each residual module of the convolution part includes:

each residual error module comprises a main path and an auxiliary path;

the auxiliary path consists of two paths of a path adopting pooling and convolution operation and a shortcut path without operation;

specifically, the main path includes:

1) The input data X is firstly subjected to regularization operation, then passes through an activation layer using a ReLU activation function, and finally is output to the next layer through a filter and a convolution layer;

2) The previous layer is regularized, passes through an activation layer using a ReLU activation function, and finally passes through a filter and a convolution layer to output F (X);

the secondary path includes:

1) Module pooling parameter is true: the input data X passes through the maximum pooling layer, then passes through the convolution layer with the size of a filter of 5 multiplied by 5, the number of filters and the step length of 1 multiplied by 1, and then the W (X) is output;

2) Module pooling parameter false: directly outputting X without any operation;

the outputs of the primary path and the selected secondary path are added as the overall output H (X) of the residual block function.

The invention has the beneficial effects that:

compared with the prior art, the method provided by the invention can improve the generation precision of the optimal grabbing pose of the mechanical arm, so that the improved GG-CNN model in the method provided by the invention has higher practicability in the field of high-precision grabbing.

That is to say, the method proposes that a convolution residual module is firstly constructed, a residual network is constructed by multilayer accumulation of the residual module, the depth of the convolution neural network is deepened, and the deepened convolution residual module is used as a main part of the improved GG-CNN. The invention improves the GG-CNN model, improves the accuracy of the generation of the optimal grabbing pose of the mechanical arm, and makes the network model more practical in the field of high-accuracy grabbing.

Drawings

FIG. 1 is a flow chart of a residual error network deep learning method facing to manipulator grabbing pose estimation in the invention;

FIG. 2 is a schematic diagram of a Cartesian space and image space depiction of the present application;

FIG. 3 is a schematic diagram of a prior art Connell university grab dataset;

FIG. 4 is a schematic diagram of a training data set generation process in the present application;

FIG. 5 is a schematic diagram of a GG-CNN structure in the prior art;

FIG. 6 is a schematic diagram of a portion of the structure used in constructing the residual module of the present application;

FIG. 7 is a schematic diagram of constructing an identity residual block in the present application;

FIG. 8 is a diagram of a convolutional residual block in the present application;

FIG. 9 is a diagram of residual block functions in the present application;

FIG. 10 is a block diagram of the GG-CNN model improved by residual error in the present application;

FIG. 11 is a graph of accuracy versus the model of FIGS. 5 and 10;

fig. 12 is a graph comparing the output effects of the models before and after the model is processed, such as fig. 5 and fig. 10.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

The problem of autonomous grasping of mechanical arms is an important problem in the field of robot research. Aiming at the problem of optimal grabbing pose, the method and the system endow the mechanical arm with vision and combine a deep learning algorithm to realize the intellectualization of mechanical arm grabbing.

In the application, a convolutional neural network (GG-CNN) is generated by adopting the idea of a residual error network to improve capture, a convolutional residual error module (shown in figure 9) is firstly built, the residual error network is built by utilizing multilayer accumulation of the residual error module, the depth of the convolutional neural network is deepened, and the residual error network is used as a main part for improving the GG-CNN. According to the method, the GG-CNN is improved through a deep residual error network, and the accuracy of the optimal grabbing pose generation model of the mechanical arm is improved. Experimental results show that the accuracy of the GG-CNN model improved by using the residual error network reaches 88%, which is far higher than the accuracy of 72% of the original model, the accuracy of predicting the optimal grabbing pose of the mechanical arm by using the model is greatly improved, and the GG-CNN model has certain scientific research significance and application value in the field of mechanical arm visual grabbing.

Fig. 1 illustrates a method provided by an embodiment of the present invention, which may include the following steps:

s1, initializing a mechanical arm, and adjusting the mechanical arm to enable a wrist camera to be located at a known height above a vertical X0Y plane.

In the present embodiment, the description is made with reference to the wrist camera of the robot arm, but in practical application, the wrist camera is not limited thereto, and any camera located at the upper portion of the robot arm and used in cooperation with the robot arm may be used.

S2, obtaining a depth image of the object to be grabbed by the mechanical arm.

And S3, cutting out the central part of the depth image to obtain an object depth image with 300 multiplied by 300 pixels.

The present embodiment does not limit the manner of cropping the depth image, but needs to retain a major portion of the target object of the depth image.

s5, selecting a pixel point with the highest power in the image with the capturing success rate, and corresponding to a capturing angle cosine value, a capturing angle sine value and a corresponding pixel point in the image with the capturing width information to obtain the capturing angle and width information which is used as capturing information and has the highest capturing success rate;

s6, inputting grabbing information, and controlling the mechanical arm to grab (namely outputting grabbing positions, angles and widths of the target object to be grabbed after coordinate transformation so as to control the mechanical arm to grab the target object);

the improved GG-CNN model is characterized in that a residual error network is built in the existing GG-CNN model through building a residual error module, and the fitting function and the learning capability of a convolutional neural network are enhanced, so that the grabbing accuracy of the grabbing pose generated by the improved GG-CNN model is higher, the grabbing pose is more sensitive to the change of the position and the shape of an object, and the improved GG-CNN model has practical application value.

For a better understanding of the aspects of the present application, the following description is made with reference to the accompanying drawings.

1. Gripping scheme based on GG-CNN

1.1 defining grab parameters and transformations

The present application studies the problem of detecting and grabbing on unknown objects perpendicular to a plane, as shown in FIG. 2, with a depth camera acquiring depth images in a given scene

The grabbing is performed on a plane perpendicular to the XOY plane (i.e. the robot base coordinate system, referred to as the robot coordinate system), and in this embodiment, the grabbing may be defined as:

using these pose-describing parameters, a grasping action can be determined, the position being the center of the gripper in Cartesian coordinates p = (x, y, z), the pose comprising the angle of rotation of the end effector about the z-axis

And the required width ω. And the grabbing success rate q represents the possibility of successful grabbing.

The internal parameters of the camera used in the present application are known, whereby a depth image I = R of height H and width W is acquired ^H×W And detecting the capture of the depth image I. In the figureGrab descriptions like in I are:

is the angle of rotation in the camera reference frame (i.e. the aforementioned wrist camera or the camera of the robot arm),

is the grip width in the image coordinates. Through the coordinate transformation of the mechanical arm, the grabbing in the image space can be realized

Conversion to grab in world coordinates g:

T _RC is a coordinate transformation from the camera coordinate system to the robot arm coordinate system, T _CI Is based on camera internal parameters and calibration transformation of hand-eye position between the mechanical arm and the camera, converting from 2D image coordinate to 3D camera coordinate system.

In addition, a group of captures in image space is referred to as a capture map, which is represented as

G＝(Φ,W,Q)∈R ^3×H×W

Where Φ, W and Q are each ∈ R ^3×H×W Respectively representing a grabbing angle, a grabbing width and a grabbing accuracy, wherein the grabbing angle phi is split into a grabbing angle cosine value and a grabbing angle sine value, and phi corresponding to a coordinate o with the highest grabbing success rate, W and Q comprise

And the value of q.

In an ideal case, the grab value of each pixel in the depth image I may be directly calculated, instead of randomly sampling the input image. To this end, the function M in the depth image (or called mapping M/mapping function M) is defined as the transformation from the input depth image to the capture information image:

G＝M(I)

the optimal capture pose in the image space can be calculated from G

and by the equation

Calculating the optimal grabbing pose g in world coordinates _best .

1.2 neural network approximate mapping relationship

It is to be understood that the following detailed description describes how the improved GG-CNN is used to determine the mapping function M.

Approximating the functional map M using a grab-generated convolutional neural network (GG-CNN): i → G. By M _λ A neural network is represented, where λ is the weight after neural network training.

Proves that M _λ (I)＝(Q _λ ,Φ _λ ,W _λ ) M (I) using an L2 loss function

Input I with training set _train And corresponding output G _train Learning and training the neural network as follows:

where G is a set of grabbing parameters at a point p estimated in a cartesian coordinate system, which corresponds to each pixel o. θ has no meaning, but is for convenience of description.

The grab graph G represents a triplet of images: Φ, W, and Q. These parameters are expressed as follows:

q is an image describing the success rate of the grabbing performed at each point (u, v). This value is a scalar in the range of [0,1], where values close to 1 indicate a higher success rate of grabbing.

Φ is an image describing the angle of the capturing performed at each point. Since the typical object is symmetric to grip around π/2 radians, the angle should be in the range of [ - π/2, π/2 ].

W is an image describing the end effector width of the grasp performed at each point. To ensure that the depth is constant, the value of W is in the range of [0,150] pixels, which can be converted to a physical measurement using the depth camera parameters and the measured depth.

1.3 Construction and training of GG-CNN

None of the existing datasets meet the training requirements of GG-CNN, and in order to train the GG-CNN model, a dataset conforming to the input and output of GG-CNN is created from the Connell university grab dataset (shown in FIG. 3). The cornell university grab dataset contains RGB-D images of 885 real objects, of which 5110 are labeled "positive grabs" and 2909 are labeled "negative grabs". Although this is a relatively small capture dataset compared to some newer composite datasets, this data best meets the pixel-by-pixel capture requirements of the present application, since each image provides multiple labeled capture frames.

Random cropping, scaling and rotation are used to increase the number of cornell university grab datasets, creating a set G of 8840 depth images and associated grab images _train And effectively incorporates 51,100 grab examples.

The cornell university grab dataset represents the object to be grabbed as a grab rectangular box using pixel coordinates, thereby calibrating the position and rotation angle of the end effector. To transition from the grabbed rectangular box representation to the image-based representation G, the center third of each grabbed rectangle is selected as the image grippable region, which corresponds to the position of the center of the end effector. And assume that no other region is a valid grab. The data set generation process is shown in fig. 4.

Grabbing success rate Q: consider whether each pixel in the connell university grab dataset is a valid grab as a binary label, and consider Q _train The graspable region of (a) is set to 1, and all other pixels are 0.

Rotation angle Φ: calculating the value of each grabbing rectangle at [ -pi/2, pi/2]Angle within range and setting corresponding phi _train And (4) a region. In order to eliminate the problem that the data with the angle at + -pi/2 may have discontinuity and too large value when the original angle is used. The angle is decomposed into two vector components on the unit circle, yielding [ -1,1]Values within the range, since the antipodal grip is symmetrical around π/2 radians, use two components sin (2 Φ) _train ) And cos (2. Phi.) (II) _train ) They are provided at phi _train ∈[-π/2,π/2]An inner unique value.

Grabbing width W: similar to the angle, the width (in units of maximum) of each grabbing rectangle is calculated, the width of the grabber is expressed, and W is set _T The corresponding parts of (a). During training, W is set _T Is scaled down by a ratio of 1/150 to be [0,1]]Within the range. The width of the end effector may be calculated using the parameters of the camera/cameras and the measured depth.

Inputting a depth image: since the cornell university grabbed data set was captured using a real camera, it already contained real sensor noise, so no noise needs to be added. The depth image is repaired using OpenCV to remove invalid values. The average value of each depth image is subtracted, centering its value at 0 to provide depth invariance.

Through the above definitions and operations, a dataset for training the GG-CNN model was generated from the cornell university grab dataset.

Mapping M by using GG-CNN function model in prior art _λ (I)＝(Q _λ ,Φ _λ ,W _λ ) Approximate generation of a capture information map directly from an input depth image IImage G _λ And taking the depth image of 300 multiplied by 300 as input, and finally obtaining the captured information image through three-layer convolution operation and three-layer deconvolution operation. The GG-CNN complete structure is shown in FIG. 5.

Since the GG-CNN shown in FIG. 5 cannot improve the accuracy in recognition and capture, the GG-CNN structure shown in FIG. 5 is improved in the present application, as shown in FIG. 10, and the improvement process is as follows.

2 improved GG-CNN model based on residual error network

Firstly, the idea of the residual error network is introduced, secondly, two basic modules (such as an identity residual block and a convolution residual block) are explained, and finally, the residual error module is constructed by combining the two basic modules, and the residual error network is constructed by utilizing the residual error module, wherein the structure is shown in fig. 10.

2.1 residual error network

The residual Network takes advantage of the cross-layer link idea of a high speed Network (high-speed Network), but improves the same. Directly transmitting input X to output as an initial result in a mode of constructing a residual block short connections, wherein the output result is

H(X)＝F(X)+X

When F (X) =0, then H (X) = X, i.e. identity mapping. ResNet corresponds to changing the learning objective, and instead of learning a complete output, the difference between the target values H (X) and X, the so-called residual:

F(X)＝H(X)-X

therefore, the objective of the later training is to approach the residual result to 0, so that the accuracy does not decrease as the network deepens.

The residual error jump type structure breaks through the convention that the output of the n-1 layer of the traditional neural network only can be used as the input for the n layer, so that the output of a certain layer can directly cross several layers to be used as the input of a later layer, and the significance of the residual error jump type structure is to provide a new direction for the difficult problem that the accuracy of the whole learning model is not reduced and reversely increased due to the fact that a multi-layer network is superposed.

In ResNet (residual network), the shortcut links allow the gradient to propagate back to the layers further ahead, fig. 6 (a) shows the main path of the neural network, fig. 6 (b) adds a shortcut link to the main path, and by stacking these ResNet modules, a very deep neural network can be constructed.

Two main types of modules are used in ResNet (i.e., identity and convolution blocks), the choice of identity and convolution blocks depending largely on whether the input/output sizes are the same or different. If they are the same, then the identity residual block is used, otherwise the convolution residual block is used.

(1) Constant residual block

The identity residual block is a standard block used in ResNet, corresponding to the case where the input and output have the same dimensions.

The auxiliary path is a shortcut connection (shortcut), and the convolutional layer constitutes the main path. In fig. 7, convolution and ReLU activation operations are also performed, and Batch regularization is added to increase the training speed and prevent overfitting.

(2) Convolution residual block

The convolutional residual block of ResNet is another type of residual block that can be used when the input and output sizes do not match, as shown in fig. 8.

The convolutional layer in the shortcut path is used to resize the input X to different sizes in order to match the output sizes of the shortcut path and the main path.

2.2 improving GG-CNN by introducing residual network

In the application, the idea of the residual error network is introduced into the GG-CNN, and the construction of a deeper neural network model is performed by constructing the residual error module, so that the accuracy of the gripping pose generated by the GG-CNN model is improved, and a better mechanical arm optimal gripping pose generation network is obtained. The constructed residual module structure is shown in fig. 9.

In the present application, the constructed residual module is divided into two major paths, namely a main path and an auxiliary path, wherein the auxiliary path is composed of two paths, namely a path adopting pooling and convolution operation and a shortcut path without operation.

For better illustration, assume that the input is X, and to distinguish the outputs of each path, they are named F (X), W (X), and H (X), respectively, and this section is mainly to explain a single residual block.

The operation on the main path includes:

3) Referring to fig. 9, the input X is first regularized, then passes through the activation layer using the ReLU activation function, and finally passes through the convolutional layer with filter number of filters/2 (where filters are input parameters filters of the module function) and step length of 1 × 1, and is output to the next layer; wherein the filter size is 3 × 3;

4) Regularization operation is carried out on the previous layer, the filter is used for activating an activation layer of a ReLU activation function, finally, the filter number is a convolution layer with filters (wherein the filters are input parameters of module functions), the step length is strides (wherein the strides is input parameters of the module functions), and F (X) is output; wherein the filter size is 5 x 5.

The operations on the secondary path include:

3) The module function pooling parameter is true: the input X passes through the maximum pooling layer with the size of strings (wherein strings is the input parameter strings of the module function), and then passes through the convolution layer with the number of filters (wherein filters is the input parameter filters of the module function) and the step length of 1 multiplied by 1, and then the W (X) is output. Wherein the filter size is 5 × 5;

4) Module function pooling parameter is false: and directly adding the output of the main path and the selected auxiliary path to the output of the X without any operation, wherein the output is used as the integral output H (X) of the residual module function.

In the application, the GG-CNN model is improved by utilizing the residual modules built by the user, and an intermediate structure is built by stacking the residual modules on the premise of ensuring the original input and output sizes to be unchanged, wherein the model structure is shown in FIG. 10.

Specifically, the GG-CNN network improved by residual shown in fig. 10 includes: a convolution part, a deconvolution part and an output part;

the convolution portion includes: ten residual error modules are used for carrying out the residual error correction,

That is to say, in this embodiment, the network output of the residual error portion is transformed by three deconvolution layers to obtain a capture set G required in this application, the deconvolution output is linearly activated, and is mapped to a position picture p of an output layer, an angle diagram Φ formed by a sine picture and a cosine picture of an angle is captured, and a width picture W is captured, so that the GG-CNN network improved by the residual error in this application is formed.

3 results and analysis of the experiment

In the application, a residual error network improved GG-CNN model is used in a mechanical arm grabbing simulation experiment algorithm, an experimental environment is an Ubuntu16.04 system, a pose generation algorithm and a grabbing algorithm programming environment are Python 2, a laboratory server display card GTX1080 is used for accelerating a training process, and multiple improvement tests are performed.

In the training and testing of the network model, the accuracy of the model is measured by referring to the concept of target detection area intersection ratio (IoU). The cross-over ratio is defined as follows

And taking the ratio of the capture frame generated by the network to the aggregate and union of the marked capture frames as the accuracy of the network generation capture of the application.

The method comprises the steps of carrying out experiments, improvement and optimization on original GG-CNN network parameters, improving the accuracy of the GG-CNN network by adjusting the type of an optimizer, the learning rate, regularization parameters, the size of batch data, a loss function, the number of layers of an activation function and a neural network of the network, finally selecting an Adam optimizer after multiple experiments, attenuating the learning rate, setting the size of the batch data to be 32, adopting MSE (mean square error) for the loss function, adopting ReLU for the activation function, constructing a residual error network by utilizing constructed residual error modules, and constructing a deep residual error network by superposing multiple layers of modules.

As shown in fig. 11, the accuracy curves of the residual error network improved GG-CNN model and the original GG-CNN model (as shown in fig. 5) shown in fig. 10 are gradually improved with the increase of epoch, and the accuracy of the models before and after the improvement is basically stable through 100 epoch training. Compared with the accuracy curve of the improved model and the original model, the accuracy of the model before improvement can be clearly seen to be stabilized at about 71%, and the accuracy of the improved model can be stabilized at about 88% finally.

The GG-CNN model improved by the residual error network improves the pose generation accuracy by 17%, which shows that the deep residual error network is built by utilizing the multilayer residual error modules, a deeper grasping generation convolutional neural network model is built, the accuracy of the GG-CNN model can be effectively improved, and the more accurate optimal grasping pose of the mechanical arm is obtained.

In order to test the effect of the capture pose generation network before and after improvement, the data set of the application creates a data set which accords with GG-CNN input and output from a Connell university capture data set. Displaying RGB-D images of real objects in captured data sets of the university of Connell and marks 'positive capture' and 'negative capture' on the images at the same time, wherein the graspable poses of the marks are represented by rectangular frames, and the whole RGB images are placed at the upper left positions; taking the depth image corresponding to the data set as output, representing a graspable pose by using a light grey rectangular box, representing the graspable pose generated by neural network training by using a dark grey rectangular box, and placing the whole depth image at the upper right position; and each pixel point in the grabbing width and the grabbing angle image output by the neural network training has a corresponding grabbing parameter value, the grabbing width image is placed at the lower left position, and the grabbing angle image is placed at the lower right position. The effect before and after the network improvement is generated by the image display capture of a set of four images, and the display is performed by using two

objects

1 and 2, and the effect is shown in fig. 12. In fig. 12, (a) represents the output poses generated by the recognition of the object 1 before the improvement, (b) represents the output poses generated by the recognition of the object 2 before the improvement, (c) represents the output poses generated by the recognition of the object 1 after the improvement, and (d) represents the output poses generated by the recognition of the object 2 after the improvement.

Comparing the output of the network models before and after improvement, firstly observing the effect of a dark gray rectangular frame generated by a convolutional neural network generated by grabbing in a depth image, wherein for an object 1, the width of a grabbing frame generated before the improvement is small and cannot meet actual grabbing, and the grabbing frame generated after the GG-CNN model is improved is not only proper in width but also can meet grabbing requirements in position; for the object 2, the positions of the grabbing frames generated before and after the improvement can meet the actual requirements, and the effect is good. And observing the grabbing width and angle images output by the grabbing generated convolutional neural network, wherein for the object 1 and the object 2, the distribution condition of the grabbed pixel points of the grabbed images generated after the model is improved is more consistent with the pixel distribution condition of the depth images of the actual objects, and the grabbing width and angle values are more in line with the reality. The color of the captured information image output by the network model is more obvious, the improved network model is more sensitive to the difference perception of the size, the shape and the position of an object, and the change of captured information can be better reflected.

The GG-CNN model is improved by constructing the residual error network, so that the accuracy of model generation and grabbing pose is obviously improved, and the grabbing effect is obviously improved.

In the prior art, the GG-CNN model pursues calculation speed, adopts an over-simple neural network structure, reduces the magnitude of neural network parameters, and sacrifices the capture accuracy of a part of network models. According to the method, a residual block function suitable for a network model is constructed by adopting the idea of a residual network, the structure of the GG-CNN model is reconstructed, the accuracy of predicting the optimal grabbing pose of the mechanical arm by the model is greatly improved, although the deeper network means the increase of the calculation time, the grabbing with high quality and high precision is still an important requirement in actual grabbing, and the method has a certain application value in certain fields with higher precision requirements.

It should be understood that the above description of specific embodiments of the present invention is only for the purpose of illustrating the technical lines and features of the present invention, and is intended to enable those skilled in the art to understand the contents of the present invention and to implement the present invention, but the present invention is not limited to the above specific embodiments. It is intended that all such changes and modifications as fall within the scope of the appended claims be embraced therein.

Claims

1. A residual error network deep learning method for estimating a grabbing pose of a mechanical arm is characterized by comprising the following steps:

s2, acquiring a depth image of a target object to be grabbed, which is acquired by a wrist camera of the initialized mechanical arm, wherein the tail end of the mechanical arm is adjusted to enable the wrist camera to be positioned at a preset height position above a vertical X0Y plane;

s3, preprocessing the acquired depth image to obtain an object depth image with 300 x 300 pixels;

s4, mapping the object depth image by adopting a pre-trained improved GG-CNN model, and outputting four captured information images with the pixels of 300 x 300, wherein the captured information images comprise a capturing success rate, a capturing angle cosine value, a capturing angle sine value and a capturing width;

s5, the obtained grabbing information is subjected to coordinate transformation of a wrist camera and then coordinate transformation between a wrist of the mechanical arm and the base, and finally the grabbing angle and the grabbing width of the target object to be grabbed under a mechanical arm base coordinate system are obtained;

the improved GG-CNN model is characterized in that a residual error network is built in the existing GG-CNN model by building a residual error module, and the fitting effect and the learning capacity of a convolutional neural network are enhanced.

2. Method according to claim 1, characterized in that, before step S2, it comprises:

s0-1, creating a first data set G for training the inputs and outputs of the improved GG-CNN model based on the existing data set _train (ii) a The first data set comprises images marked as positive grabbing information and images marked as negative grabbing information, and the images in the first data set are provided with a plurality of grabbing frames with marks;

3. The method of claim 2, wherein the GG-CNN model refined by residuals comprises:

a convolution part, a deconvolution part and an output part;

the second residual module includes: 5 identity residual modules, wherein the parameters in the identity residual modules comprise: 4 filters with step size of 1 × 1;

4. A method according to claim 3, characterized in that in step S0-3, the following cross-over ratio formula is used to measure the grabbing accuracy of the GG-CNN network improved by the residual error;

intersection ratio formula:

5. The method of claim 1,

depth image I = R in S2 ^H×W Where H is height, W is width, and the capture description of the depth image is:

grabbing in image space by coordinate transformation of mechanical arm

Conversion to grab in world coordinate g:

is the angle of rotation in the camera reference frame,

Phi, W and Q are each e R ^3×H×W Respectively representing the grabbing angle and grabbing widthDegree and grabbing accuracy, wherein the grabbing angle phi is split into a grabbing angle cosine value and a grabbing angle sine value, phi corresponding to the coordinate o with the highest grabbing success rate, W and Q comprise

And the value of q;

determining the optimal capture pose in the image space from G:

further, by

Calculating the optimal grabbing pose g in world coordinates _best 。

6. The method of claim 3,

the processing procedure of each residual module of the convolution part comprises the following steps:

each residual error module comprises a main path and an auxiliary path;

specifically, the main path includes:

2) The upper layer is regularized, then passes through an activation layer using a ReLU activation function, and finally passes through a filter and a convolution layer to output F (X);

the secondary path includes:

2) Module pooling parameter false: directly outputting X without any operation;

7. The method according to any one of claims 1 to 6, characterized in that before the step S2, it further comprises the following step S1:

s1, initializing a mechanical arm, and adjusting the mechanical arm to enable a wrist camera to be located at a preset height position above a vertical X0Y plane;

correspondingly, after step S5, step S6 is also included:

and S6, outputting the information of the grabbing position, angle and width of the target object to be grabbed after coordinate transformation so as to control the mechanical arm to grab the target object.