CN114782347A

CN114782347A - Mechanical arm grabbing parameter estimation method based on attention mechanism generation type network

Info

Publication number: CN114782347A
Application number: CN202210387024.1A
Authority: CN
Inventors: 杨宇翔; 邢玉虎; 全嘉勉; 高明裕; 何志伟; 董哲康
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-07-22

Abstract

The invention discloses a mechanical arm grabbing parameter estimation method based on an attention mechanism generating network. According to the method, the RGB-D camera is used for capturing corresponding operation scene images, the operation scene images are input into a trained attention-based mechanism generation type network, mechanical arm grabbing parameter information such as grabbing quality, grabbing angle, grabbing width and grabbing priority is obtained, and other grabbing parameters are screened through the grabbing priority, so that better mechanical arm grabbing parameters in a complex multi-object environment are obtained. The method not only can well solve the autonomous grabbing capacity of the mechanical arm in the complex stacking environment, but also further deepens the perception capacity of a visual system to effective information in the complex stacking environment through parameter estimation of grabbing priority, enhances the data processing capacity of the whole system to multi-dimensional information, and accordingly improves the grabbing precision of the mechanical arm in the complex stacking environment.

Description

Mechanical arm grabbing parameter estimation method based on attention mechanism generation type network

Technical Field

The invention belongs to the field of mechanical arm grabbing control, and particularly relates to a mechanical arm grabbing parameter estimation method based on an attention mechanism generating network.

Background

Currently, in the research on the autonomous grasping of the mechanical arm, the grasping technology for a single object in a simple scene is mature; in practical situations, however, a plurality of objects often present a disordered stacking manner in a complex environment, which brings a greater challenge to the robotic arm autonomous grasping technology. The invention provides a mechanical arm grabbing parameter estimation method based on an attention mechanism generating network, which effectively estimates mechanical arm grabbing parameters through the attention mechanism generating network, deepens the perception capability of a visual system to effective information in a complex environment, improves the fusion capability of multi-channel information, realizes grabbing tasks of various objects in the complex environment, and successfully solves the problem of automatic grabbing of the mechanical arm in a multi-object stacking environment.

Disclosure of Invention

Aiming at the problem of autonomous grabbing of the mechanical arm in a complex stacking scene, the invention provides a mechanical arm grabbing parameter estimation method based on an attention mechanism generating network, so that the grabbing precision of autonomous grabbing of the mechanical arm in a complex stacking environment is improved.

In order to achieve the purpose, the invention adopts the main technical scheme that:

s1 use RGB-D camera to obtain mechanical armJob scene image in front state, including RGB image I_rgbAnd depth image I_depthAnd a job scene reference coordinate system;

s2, inputting each operation scene image I into a trained neural network based on attention mechanism, generating a predicted value of a two-dimensional image group containing a motion instruction vector, wherein the predicted value of the two-dimensional image group at least contains one two-dimensional capture quality image G_θA two-dimensional capture angle image A_θOne two-dimensional captured width image W_θAnd a two-dimensional capture priority image O_θThe robot arm comprises a gripping success rate information, a gripping angle information, a clamping jaw opening width information and a gripping sequence information when the robot arm grips an object;

s3 two-dimensional capture quality image G in two-dimensional image group obtained by prediction_θThe pixel values are sorted according to the size, n pixel points with the largest pixel values are selected, and the pixel points with the largest pixel values are predicted values with the highest grabbing success rate

Corresponding to the two-dimensional angle image A according to the pixel point coordinates of the predicted value_θTwo-dimensional capture width image W_θAnd two-dimensional capture priority image O_θIn the method, a predicted value of the grabbing angle can be obtained

Grabbing width predicted value

And fetch sequence predicted values

Wherein p is_nRepresenting two-dimensional capture quality image G_θThe pixel value of (1) is arranged from large to small, and the pixel point coordinate of the nth pixel is arranged;

s4 pair fetch sequence prediction value

Sorting, selecting the priority of the grabbing sequence with the highest priorityHigh predicted value

Then

The grabbing information corresponding to the pixel coordinates is the optimal motion instruction vector, i.e.

And S5, analyzing the obtained optimal motion instruction vector to obtain the grabbing coordinate, the grabbing angle and the grabbing width of the target object to be grabbed under the base coordinate system of the mechanical arm, namely the grabbing parameters of the mechanical arm.

Preferably, step S2 includes:

s21 training the generated neural network based on the attention mechanism based on the existing data set;

s22, respectively preprocessing the job scene images to obtain 300 × 300 pixel job scene images;

s23, inputting the image feature vector into a trained attention-based generating neural network;

s24 outputs the two-dimensional image group prediction value including the motion instruction vector, that is, the parameter prediction value of the capturing success probability, to the attention-based generative neural network.

Preferably, step S3 includes:

s31, sorting the pixel values of the two-dimensional grabbing quality images in the predicted two-dimensional image group, namely, the size of each pixel value in the two-dimensional grabbing quality images represents the grabbing success rate of the mechanical arm taking the point as the center;

s32, selecting n coordinates of the maximum grabbing success rate predicted values as grabbing center coordinates;

s33, according to the predicted two-dimensional image group, obtaining the grabbing angle pixel value, grabbing width pixel value and grabbing priority pixel value corresponding to the grabbing center coordinate,

s34 analyzes the information of the grabbing angle, the grabbing width and the grabbing sequence according to the grabbing angle pixel value, the grabbing width pixel value and the grabbing priority pixel value.

Preferably, step S5 includes:

s51, analyzing the obtained optimal motion instruction vector;

s52, carrying out coordinate conversion on the analyzed data through a wrist camera coordinate system;

s53, transforming the coordinates of the converted camera coordinate system through the mechanical arm base coordinate system;

and S54, inputting the coordinate obtained by converting the mechanical arm base coordinate system, the grabbing width and the grabbing angle information into the mechanical arm control system for grabbing.

Preferably, the training method of the attention-based generative neural network comprises the following steps:

s01 creating a data set G for training a network based on the existing data set_train(ii) a The G is_trainThe data set comprises a work scene image, effective capture frame information and a segmentation image only containing the topmost object, wherein G_trainThe segmented image only containing the topmost object is a two-dimensional capture priority image;

s02, mapping the effective grabbing frame information to a 300X 300 two-dimensional image to obtain a two-dimensional grabbing quality image, a two-dimensional grabbing angle image and a two-dimensional grabbing width image, and combining G_trainConstructing a two-dimensional image group by only containing the segmented image of the topmost object;

s03, constructing a generating neural network based on the attention mechanism by constructing an attention mechanism module;

s04 Using data set G_trainAnd the two-dimensional image group trains the generated neural network based on the attention mechanism, RGB-D images without capture information are input, a two-dimensional image group containing capture information is output, and the trained generated neural network based on the attention mechanism is obtained.

Preferably, the attention-based generative neural network comprises:

a feature extraction part, an attention mechanism part and a network generation part;

a feature extraction section:

the feature extraction network consists of a convolution layer with convolution kernel size of 9 multiplied by 9 and two convolution layers with convolution kernel size of 4 multiplied by 4, and each convolution layer is followed by a Batch Normalization layer and a Rectified Linear Unit active layer at this stage;

cutting out RGB image I with size of 300 × 300_rgbAnd depth image I_depthPerforming feature fusion to obtain a fusion feature map I_fusionA first reaction of_fusionInputting a feature extraction network, and performing feature extraction to obtain a feature map I_output1；

Attention mechanism part:

the attention mechanism network consists of five attention modules, wherein each module consists of a residual error part, a Squeeze part and an Excitation part;

the residual part is divided into direct mapping and residual mapping, the direct mapping is checked by convolution kernel of 1 × 1_output1Performing convolution operation to obtain direct mapping result h (I)_output1) The residual mapping is composed of two convolution layers with convolution kernel size of 3 × 3, a Batch Normalization layer is next to the two convolution layers, a Rectified Linear Unit active layer is next to the first Batch Normalization layer, I_output1Obtaining R (I) after residual mapping_output1)；

The Squeeze part is realized by introducing Global Average firing, and the role is to obtain the Global information embedding of each channel of the feature map, namely the feature vector; suppose u_cIs the characteristic diagram with W multiplied by H and the channel is C, the characteristic diagram after Squeeze is z_c；

The Excitation moiety being via z_cLearning the weight of each channel, and forming by a door mechanism of two fully-connected layers; gating cell s_cThe size of the glass is 1 multiplied by 1,the number of channels is C, then s_cThe calculation method of (a) is expressed as:

s_c＝F_ex(z_c,w)＝σ(g(z,w))＝σ(w₂δ(w₁z_c)) (2)

wherein sigma is a sigmoid activation function, delta is a ReLU activation function,

gamma is the number of nodes of the hidden layer;

will obtain s_cAnd u_cPerforming vector product to obtain the predicted value

R (I)_output1) Sequentially inputting the Squeeze part and the Excitation part to obtain the predicted values passing through the Squeeze part and the Excitation part

Will be provided with

Splicing the characteristic graph obtained by the residual error part to obtain the output I of the attention module_output2；

Will I_output1The output of the attention mechanism network part can be obtained by inputting 5 linearly connected five attention modules

Generating a network part:

the generation network consists of two deconvolution layers with convolution kernel sizes of 4 multiplied by 4 and a deconvolution layer with convolution kernel sizes of 9 multiplied by 9, wherein the two deconvolution layers with convolution kernel sizes of 4 multiplied by 4 are both followed by a Batch Normalization layer and a Rectified Linear Unit active layer;

will be provided with

Inputting the generated network to obtain the predicted value of the two-dimensional image group containing the motion instruction vector.

The invention has the following beneficial effects:

the invention provides a mechanical arm grabbing parameter estimation method based on an attention mechanism generating network, which can grab various unknown objects in a complex unstructured environment. The sensing capability of the mechanical arm in a complex environment is improved by fusing multiple information channels such as a color image, a depth image and the like; the construction of the lightweight attention generation network ensures the real-time property of the mechanical arm when the mechanical arm grabs the object; the establishment of the grabbing priority improves the effective grabbing precision of the mechanical arm.

Drawings

FIG. 1 is a diagram illustrating an example of a neural network structure generated based on attention mechanism according to an embodiment of the present invention;

fig. 2 is a frame diagram of a robot grasping parameter learning system based on an attention mechanism generating network according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present invention, reference will now be made in detail to the present invention, by way of example, which is illustrated in the accompanying drawings of FIG. 1. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. The present invention will be described in further detail with reference to the following embodiments:

s1, acquiring the work scene image of the mechanical arm in the current state by using the RGB-D camera, wherein the work scene image comprises an RGB image I_rgbAnd depth image I_depthAnd an operating scene reference coordinate system;

s2, inputting each operation scene image I into a trained attention-based generating neural network in sequence to generate a predicted value of a two-dimensional image group containing a motion instruction vector, wherein the two-dimensional image group at least contains one two-dimensional capture quality image G_θA two-dimensional angle image A_θOne two-dimensional captured width image W_θAnd a two-dimensional capture priority image O_θThe method comprises the following steps of respectively including grabbing success rate information, grabbing angle information, clamping jaw opening width information and grabbing sequence information when a mechanical arm grabs an object;

the attention mechanism-based generating neural network comprises a feature extraction part, an attention mechanism part and a generating network part;

a feature extraction section:

the feature extraction network consists of a convolution layer with convolution kernel size of 9 multiplied by 9 and two convolution layers with convolution kernel size of 4 multiplied by 4, and each convolution layer is followed by a Batch Normalization layer and a Rectified Linear Unit activation layer at the stage;

will cut out 300X 300 RGB image I_rgbAnd depth image I_depthPerforming feature fusion to obtain a fusion feature map I_fusionIs shown by_fusionInputting a feature extraction network, and performing feature extraction to obtain a feature map I_output1。

Attention mechanism part:

the residual part can be divided into two parts of direct mapping and residual mapping, the direct mapping is checked by a convolution kernel of 1 × 1 to check I_output1Performing convolution operation to obtain direct mapping result h (I)_output1) The residual mapping is composed of two convolution layers with convolution kernel size of 3 x 3, a Batch Normalization layer is next to the two convolution layers, a Rectified Linear Unit active layer is next to the first Batch Normalization layer, I_output1Obtaining R (I) after residual mapping_output1)；

The Squeeze part is realized by introducing Global Average Potential (GAP) and the role is to obtain the Global information embedding of each channel of the feature map, namely the feature vector. Suppose u_cIs the characteristic diagram with W multiplied by H and the channel is C, the characteristic diagram after Squeeze is z_c；

The Excitation moiety being via z_cThe learned weight of each channel is composed of a gate mechanism (gate mechanism) of two fully connected layers. Gating unit s_cIs a feature vector with the size of 1 multiplied by 1 and the number of channels C, then s_cThe calculation method of (a) is expressed as:

s_c＝F_ex(z_c,w)＝σ(g(z,w))＝σ(w₂δ(w₁z_c)) (2)

gamma is the number of nodes of the hidden layer;

will obtain s_cAnd u_cPerforming vector product to obtain the predicted value

Will be provided with

Generating a network part:

will be provided with

And inputting the generation network to obtain the predicted value of the two-dimensional image group containing the motion instruction vector.

S3 two-dimensional capture quality image G in two-dimensional image group obtained by prediction_θThe pixel values are sorted according to the size, ten pixel points with the largest pixel value are selected, and the pixel points are predicted values with the highest grabbing success rate

Grabbing width predicted value

And fetch sequence predicted values

S4 pair grabbing sequence predicted value

Sorting, selecting the predicted value with the highest priority of the grabbing sequence

Then

The capture information corresponding to the pixel coordinates is the optimal motion instruction vector, i.e.

And S5, analyzing the obtained optimal motion instruction vector, and obtaining the grabbing coordinate, the grabbing angle and the grabbing width of the target object to be grabbed in the mechanical arm base coordinate system, namely the mechanical arm grabbing parameters, through coordinate transformation of the wrist camera and coordinate transformation between the mechanical arm wrist and the base after analysis.

S6 repeats steps S1-S5 until all objects are grabbed.

As a specific preference for implementing the technical solution of the present invention, before step S2, the method comprises:

s01 creating a data set G for training a network based on the existing data set_train(ii) a The G is_trainThe data set comprises a working scene image, effective capture frame information and a segmentation image only containing a topmost object;

s02, mapping the effective grabbing frame information to a 300X 300 two-dimensional image to obtain a two-dimensional grabbing quality image, a two-dimensional grabbing angle image and a two-dimensional grabbing width image, and combining G_trainThe image group is divided into two-dimensional image groups, wherein the two-dimensional image groups only contain the top-layer object;

s04 Using data set G_trainTraining the generated neural network based on the attention mechanism by the two-dimensional image group, inputting an RGB-D image without grabbing information, outputting a two-dimensional image group containing grabbing information, and obtaining the trained generated neural network based on the attention mechanism;

as a specific preferred implementation of the technical solution of the present invention, as shown in fig. 2, a robot grasping parameter learning system based on an attention mechanism generating network includes:

offline learning, by data set G_trainContinuously training a generative neural network based on an attention mechanism, thereby obtaining a mechanical arm grabbing parameter prediction model;

and in the online learning, the operation scene image under the actual condition is acquired through the field perception of the actual operation scene and is input into the mechanical arm grabbing parameter prediction model, and the grabbing parameters of the mechanical arm under the actual scene are acquired, so that the grabbing of the mechanical arm under the actual scene is realized.

It should be understood that the above description of specific embodiments of the present invention is only for the purpose of illustrating the technical lines and features of the present invention, and is intended to enable those skilled in the art to understand the contents of the present invention and to implement the present invention, but the present invention is not limited to the above specific embodiments. It is intended that all such changes and modifications as fall within the scope of the appended claims be embraced therein.

Claims

1. The mechanical arm grabbing parameter estimation method based on the attention mechanism generation type network is characterized by comprising the following steps:

s1, acquiring the work scene image of the mechanical arm in the current state by using the RGB-D camera, wherein the work scene image comprises an RGB image I_rgbAnd depth image I_depthAnd a job scene reference coordinate system;

s2, inputting each operation scene image I into trained neural network based on attention mechanism to generate predicted value of two-dimensional image group containing motion instruction vectorThe predicted value of the two-dimensional image group at least comprises a two-dimensional grabbing quality image G_θA two-dimensional capture angle image A_θOne two-dimensional captured width image W_θAnd a two-dimensional capture priority image O_θThe robot arm comprises a gripping success rate information, a gripping angle information, a clamping jaw opening width information and a gripping sequence information when the robot arm grips an object;

s3 matching the predicted two-dimensional image group with the two-dimensional grabbing quality image G_θThe pixel values are sorted according to size, n pixel points with the maximum pixel value are selected, and the pixel points are predicted values with the highest capturing success rate

Grabbing width predicted value

And fetch sequence predicted values

Wherein p is_nRepresenting two-dimensional grab-quality images G_θThe pixel value of (1) is arranged from large to small, and the pixel point coordinate of the nth pixel is arranged;

s4 pair fetch sequence prediction value

Then

And S5, analyzing the obtained optimal motion instruction vector to obtain the grabbing coordinates, the grabbing angle and the grabbing width of the target object to be grabbed under the mechanical arm base coordinate system, namely the grabbing parameters of the mechanical arm.

2. The method for estimating manipulator grasping parameters based on attention mechanism generating network according to claim 1, wherein step S2 comprises:

s22, respectively preprocessing the operation scene images to obtain 300 x 300 pixel operation scene images;

3. The method for estimating manipulator grasping parameters based on an attention mechanism generating network according to claim 1 or 2, wherein step S3 includes:

s32, selecting n coordinates of the predicted values of the maximum grabbing success rate as grabbing center coordinates;

4. The method for estimating grabbing parameters of a mechanical arm based on an attention mechanism generating network as claimed in any one of claims 1 to 3, wherein step S5 includes:

s51, analyzing the obtained optimal motion instruction vector;

5. The mechanical arm grabbing parameter estimation method based on the attention mechanism generation network as claimed in claim 2, wherein: the attention mechanism-based training method of the generative neural network comprises the following steps:

s01 creating a data set G for training a network based on the existing data set_train(ii) a The G is_trainThe data set comprises a working scene image, effective capture frame information and a segmentation image only containing the topmost object, wherein G_trainThe segmented image only containing the topmost object is a two-dimensional capture priority image;

s04 Using data set G_trainGenerating neural network based on attention mechanism to be paired with two-dimensional imageAnd performing training, namely inputting an RGB-D image without grabbing information, outputting a two-dimensional image group containing grabbing information, and obtaining a trained attention-based generating neural network.

6. The method for estimating manipulator grasping parameters based on attention mechanism generating network according to claim 5, wherein the attention mechanism generating neural network comprises:

a feature extraction part:

will cut out 300X 300 RGB image I_rgbAnd depth image I_depthPerforming feature fusion to obtain a fusion feature map I_fusionA first reaction of_fusionInputting a feature extraction network, and performing feature extraction to obtain a feature map I_output1；

Attention mechanism part:

the residual part is divided into direct mapping and residual mapping, the direct mapping is checked by convolution kernel of 1 × 1_output1Performing convolution operation to obtain direct mapping result h (I)_output1) The residual mapping is composed of two convolution layers with convolution kernel size of 3 x 3, a Batch Normalization layer is next to the two convolution layers, a Rectified Linear Unit active layer is next to the first Batch Normalization layer, I_output1Obtaining R (I) after residual mapping_output1)；

The Squeeze part is realized by introducing globalage Pooling, and the role is to obtain the global information embedding of each channel of the feature map, namely the feature vector; suppose u_cIs a feature map with W × H size and C channel, the feature map after Squeeze is z_c；

The Excitation moiety being via z_cLearning the weight of each channel, wherein the weight is formed by door mechanisms of two fully-connected layers; gating cell s_cIs a feature vector with the size of 1 multiplied by 1 and the number of channels C, then s_cThe calculation method of (a) is expressed as:

s_c＝F_ex(z_c,w)＝σ(g(z,w))＝σ(w₂δ(w₁z_c)) (2)

wherein, sigma is sigmoid activation function, delta is ReLU activation function,

gamma is the number of nodes of the hidden layer;

will obtain s_cAnd u_cPerforming vector product to obtain the predicted value

Will be provided with

Generating a network part:

will be provided with